Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Sample] update XGBoost sample #2220

Merged
merged 27 commits into from
Oct 15, 2019

Conversation

numerology
Copy link

@numerology numerology commented Sep 24, 2019

Update xgboost sample to adopt new GCP components.
TODO:

  • wait for updated create_cluster component.
  • add back visualization. Plan to do in future PR. This requires changes to the components.

This change is Reviewable

@numerology
Copy link
Author

cc @gaoning777
One nit Q: Is there anyway to specify/modify the component name after loading the component from url? Now there are four steps in the pipeline are named as 'dataproc_submit_(py)spark_job', which is not very informative.

@numerology
Copy link
Author

/test kubeflow-pipeline-sample-test

@gaoning777
Copy link
Contributor

cc @gaoning777
One nit Q: Is there anyway to specify/modify the component name after loading the component from url? Now there are four steps in the pipeline are named as 'dataproc_submit_(py)spark_job', which is not very informative.

@Ark-kun do you know if there is a way to change the name in the task factory?

…enew-xgboost-sample

# Conflicts:
#	samples/core/xgboost_training_cm/xgboost_training_cm.py
@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 11, 2019

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 11, 2019

Is there anyway to specify/modify the component name after loading the component from url? Now there are four steps in the pipeline are named as 'dataproc_submit_(py)spark_job', which is not very informative.

@Ark-kun do you know if there is a way to change the name in the task factory?

Yes! You can use set_display_name:

my_task = my_op(...).set_display_name('Custom name')

schema=schema,
train_data=train_data,
output=output_template
).after(create_cluster_op).apply(gcp.use_gcp_secret('user-gcp-sa'))
Copy link
Contributor

@Ark-kun Ark-kun Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nicer to apply the secrets to all ops in one line in the end of the pipeline function:

kfp.dsl.get_pipeline_conf().add_op_transformer(use_gcp_secret('user-gcp-sa'))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

target=target,
analysis=analyze_output,
output=output_template
).after(analyze_op).apply(gcp.use_gcp_secret('user-gcp-sa'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual execution order control (.after) is usually not needed as the task dependencies should normally be data dependencies. It can be a sign of component or pipeline design issues.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. But currently we're lacking a way to output arbitrary artifacts that can be consumed by downstream for those GCP components. This is just a workaround.

analyze_op.output,
output_template
).apply(gcp.use_gcp_secret('user-gcp-sa'))
project=project,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change. Explicitly named arguments makes the pipeline robust against the signature changes.

@numerology
Copy link
Author

/hold cancel

region='us-central1',
train_data='gs://ml-pipeline-playground/sfpd/train.csv',
eval_data='gs://ml-pipeline-playground/sfpd/eval.csv',
schema='gs://ml-pipeline-playground/sfpd/schema.json',
target='resolution',
rounds=200,
rounds=5,
Copy link
Contributor

@gaoning777 gaoning777 Oct 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this change is for fast testing, could you set it in the test config?

@gaoning777
Copy link
Contributor

/lgtm
/approve

@numerology
Copy link
Author

Memo: since the new Dataproc components no longer support custom output, visualization components like confusion matrix and ROC are no longer available. See #2177

@gaoning777
Copy link
Contributor

/hold
Moving the public dataset.

@k8s-ci-robot k8s-ci-robot removed the lgtm label Oct 14, 2019
@numerology
Copy link
Author

/hold
Moving the public dataset.

Done.

@numerology
Copy link
Author

/retest

@numerology
Copy link
Author

/hold
Moving the public dataset.

Just remember another thing todo is to make sure it's compatible with the post-submit tests as well.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gaoning777

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaoning777
Copy link
Contributor

/hold cancel

@gaoning777
Copy link
Contributor

/lgtm

@numerology
Copy link
Author

Auto-merging does not seem to work. Manually merged

@numerology numerology merged commit dbac974 into kubeflow:master Oct 15, 2019
@numerology numerology deleted the renew-xgboost-sample branch October 15, 2019 15:03
"""

# Remove existing [output]/train and [output]/eval if they exist.
delete_directory_from_gcs(os.path.join(output, 'train'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is going to work?

delete_directory_from_gcs is not a component, so it won't be executed as part of the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants