Skip to content

Commit

Permalink
[Sample] update XGBoost sample (#2220)
Browse files Browse the repository at this point in the history
* init.

* add delete op

* Add new transform and analyze op

* Save.

* Partial work

* partial work

* WIP

* Clean up code.

* Disable xgboost cm check for now.

* Update commit SHA

* Update kw arg

* Correct the url format.

* Switch to gcp component for create cluster.

* add secret

* fix create cluster op

* fix path problem

* Move display name

* no op

* Improve

* doc

* Solve

* update sample test launcher.

* Fix component test yaml
  • Loading branch information
Jiaxiao Zheng committed Oct 15, 2019
1 parent 0bde90d commit dbac974
Show file tree
Hide file tree
Showing 5 changed files with 218 additions and 283 deletions.
41 changes: 10 additions & 31 deletions samples/core/xgboost_training_cm/README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,9 @@
## Overview

The `xgboost-training-cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.
The `xgboost_training_cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.

The pipeline starts by creating an Google DataProc cluster, and then running analysis, transformation, distributed training and
prediction in the created cluster. Then a single node confusion-matrix aggregator is used (for classification case) to
provide the confusion matrix data to the front end. Finally, a delete cluster operation runs to destroy the cluster it creates
prediction in the created cluster. Finally, a delete cluster operation runs to destroy the cluster it creates
in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails
or not.

Expand Down Expand Up @@ -32,37 +31,17 @@ pipeline run results. Note that each pipeline run will create a unique directory
## Components source

Create Cluster:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/create_cluster/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/create_cluster)
[source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py)

Analyze (step one for preprocessing):
[source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/analyze/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/analyze)

Transform (step two for preprocessing):
[source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/transform/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/transform)

Distributed Training:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/train/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/train)

Distributed Predictions:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/predict/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/predict)

Confusion Matrix:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/local/confusion_matrix/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/local/confusion_matrix)


ROC:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/local/roc/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/local/roc)
Analyze (step one for preprocessing), Transform (step two for preprocessing) are using pyspark job
submission component, with
[source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_submit_pyspark_job.py)

Distributed Training and predictions are using spark job submission component, with
[source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py)

Delete Cluster:
[source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/delete_cluster/src)
[container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/delete_cluster)
[source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py)

The container file is located [here](https://github.com/kubeflow/pipelines/tree/master/components/gcp/container)

Loading

0 comments on commit dbac974

Please sign in to comment.