-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SageMaker HPO component and sample usage in a pipeline #1628
Add SageMaker HPO component and sample usage in a pipeline #1628
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
Hi @carolynwang. Thanks for your PR. I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@carolynwang Thanks for the contribution. Please sign CLA first |
/cc @Jeffwan |
CLA signed |
CLAs look good, thanks! ℹ️ Googlers: Go here for more info. |
Thanks. I will have a review and come back to you. |
description: 'Tuned hyperparameters' | ||
implementation: | ||
container: | ||
image: carowang/kfp-aws-sm-hpo:190716-01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will you be renaming this image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually, I think
message = response['FailureReason'] | ||
logging.error('Hyperparameter tuning failed with the following error: {}'.format(message)) | ||
raise Exception('Hyperparameter tuning job failed') | ||
logging.info("Hyperparameter tuning job is still in status: " + status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the job has stopped then it will go in infinite loop.
we should check for progress case and continue the loop for failure case we should just return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Completed
and Failed
are only two end state? Job will eventually fit into these two? @carolynwang Can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's right. Eventually if/when stopping the job when the run is terminated is implemented, the job can also be Stopping/Stopped state but the pod+container+script should be terminated anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over all looks good to me. Few comments please take care of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add feedbacks from reviewers
@@ -17,8 +17,9 @@ RUN apt-get update -y && apt-get install --no-install-recommends -y -q ca-certif | |||
|
|||
RUN easy_install pip | |||
|
|||
RUN pip install boto3==1.9.130 sagemaker pathlib2 | |||
RUN pip install boto3==1.9.169 sagemaker pathlib2 pyyaml==3.12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it's a revision version change, I don't have concern. Better to run exiting workflows to make sure this boto version works for all previous jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run a pipeline using the existing components using this version of boto and they still work
{"Name": "extra_center_factor", "MinValue": "10", "MaxValue": "20"}]', | ||
continuous_parameters='[]', | ||
categorical_parameters='[{"Name": "init_method", "Values": ["random", "kmeans++"]}]', | ||
channels='[{"ChannelName": "train", \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Have you considered to transform channels to simpler configs? If I am a user, I would think this is easy to break. If S3Uri, S3DataType amd S3DataDistributionType are fields helpful, can we think of a way to build abstraction? This is low priority
role_arn='', | ||
): | ||
|
||
training = sagemaker_hpo_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, does user create SM HOP job separately? Do they pipeline this job with other training jobs or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user can do it separately like in this sample, but they can also pipeline this job with another training or create model component since it outputs the tuned hyperparameters and the job name and model artifacts url of the best training job within the HPO job. I'll have an example with it in a pipeline with other components
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am thinking is it's better to have a more complex example to demo real world use case like taxi example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, will work on making a more complex example when I get a chance to
request['TrainingJobDefinition']['StaticHyperParameters'] = args['static_parameters'] | ||
request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingInputMode'] = args['training_input_mode'] | ||
|
||
# TODO: determine if algorithm name or training image is used for algorithms from AWS marketplace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems you cover this from line 279-300 Remove #TODO if it's done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as other TODOs
message = response['FailureReason'] | ||
logging.error('Hyperparameter tuning failed with the following error: {}'.format(message)) | ||
raise Exception('Hyperparameter tuning job failed') | ||
logging.info("Hyperparameter tuning job is still in status: " + status) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like Completed
and Failed
are only two end state? Job will eventually fit into these two? @carolynwang Can you confirm?
/ok-to-test |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jeffwan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test kubeflow-pipeline-e2e-test |
Summary
@Jeffwan @mbaijal @gautamkmr @kalyc
Updated Dockerfile
Added Dockerfile for just HPO
Added hyperparameter tuning component
Sample pipeline (of just the HPO component)
Testing
HPO component functionality
This change is