[Sample] update XGBoost sample (#2220)

* init. * add delete op * Add new transform and analyze op * Save. * Partial work * partial work * WIP * Clean up code. * Disable xgboost cm check for now. * Update commit SHA * Update kw arg * Correct the url format. * Switch to gcp component for create cluster. * add secret * fix create cluster op * fix path problem * Move display name * no op * Improve * doc * Solve * update sample test launcher. * Fix component test yaml
kubeflow · Oct 15, 2019 · dbac974 · dbac974
1 parent 0bde90d
commit dbac974
Show file tree

Hide file tree

Showing 5 changed files with 218 additions and 283 deletions.
diff --git a/samples/core/xgboost_training_cm/README.md b/samples/core/xgboost_training_cm/README.md
@@ -1,10 +1,9 @@
 ## Overview
 
-The `xgboost-training-cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.
+The `xgboost_training_cm.py` pipeline creates XGBoost models on structured data in CSV format. Both classification and regression are supported.
 
 The pipeline starts by creating an Google DataProc cluster, and then running analysis, transformation, distributed training and 
-prediction in the created cluster. Then a single node confusion-matrix aggregator is used (for classification case) to
-provide the confusion matrix data to the front end. Finally, a delete cluster operation runs to destroy the cluster it creates
+prediction in the created cluster. Finally, a delete cluster operation runs to destroy the cluster it creates
 in the beginning. The delete cluster operation is used as an exit handler, meaning it will run regardless of whether the pipeline fails
 or not.
 
@@ -32,37 +31,17 @@ pipeline run results. Note that each pipeline run will create a unique directory
 ## Components source
 
 Create Cluster:
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/create_cluster/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/create_cluster)
+  [source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_create_cluster.py) 
 
-Analyze (step one for preprocessing):
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/analyze/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/analyze)
-
-Transform (step two for preprocessing):
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/transform/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/transform)
-
-Distributed Training:
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/train/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/train)
-
-Distributed Predictions:
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/predict/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/predict)
-
-Confusion Matrix:
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/local/confusion_matrix/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/local/confusion_matrix)
-
-
-ROC:
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/local/roc/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/local/roc)
+Analyze (step one for preprocessing), Transform (step two for preprocessing) are using pyspark job
+submission component, with
+  [source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_submit_pyspark_job.py) 
 
+Distributed Training and predictions are using spark job submission component, with
+  [source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_submit_spark_job.py) 
 
 Delete Cluster:
-  [source code](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/delete_cluster/src) 
-  [container](https://github.com/kubeflow/pipelines/tree/master/components/deprecated/dataproc/delete_cluster)
+  [source code](https://github.com/kubeflow/pipelines/blob/master/components/gcp/container/component_sdk/python/kfp_component/google/dataproc/_delete_cluster.py) 
 
+The container file is located [here](https://github.com/kubeflow/pipelines/tree/master/components/gcp/container)