Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SageMaker HPO component and sample usage in a pipeline #1628

Merged
merged 7 commits into from
Jul 22, 2019

Conversation

carolynwang
Copy link
Contributor

@carolynwang carolynwang commented Jul 16, 2019

Summary

@Jeffwan @mbaijal @gautamkmr @kalyc

Updated Dockerfile

  • Copies hyperparameter_tuning.py into the container
  • Updated install for boto3 and installs pyyaml

Added Dockerfile for just HPO

  • Same as original but only copies the script for HPO in

Added hyperparameter tuning component

  • hpo.template.yaml: template for API call request generation
  • _utils.py: additional functions for hyperparameter tuning
  • hyperparameter_tuning.py
  • README: details inputs/outputs for the HPO component and how to run the sample
  • component.yaml: the docker image used in this built from the HPO-only Dockerfile

Sample pipeline (of just the HPO component)

  • Show sample usage for the HPO component for the K-Means algorithm using the MNIST dataset
  • Uses all possible parameters, though some parameters have default values
  • Note: the default inputs used in the parameters is not actually practical and is just being used to demonstrate that the component works

Testing

HPO component functionality

  • kmeans-hpo-pipeline.py compiles and runs without errors using the default values set in the pipeline parameters
  • Note: has not been tested with SageMaker custom algorithm resources, AWS marketplace algorithm resources, BYOC, or any other built-in algorithms aside from K-Means.

This change is Reviewable

@googlebot
Copy link
Collaborator

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@k8s-ci-robot
Copy link
Contributor

Hi @carolynwang. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Jeffwan
Copy link
Member

Jeffwan commented Jul 16, 2019

@carolynwang Thanks for the contribution. Please sign CLA first

@Jeffwan
Copy link
Member

Jeffwan commented Jul 16, 2019

/cc @Jeffwan

@carolynwang
Copy link
Contributor Author

CLA signed

@googlebot
Copy link
Collaborator

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

@Jeffwan
Copy link
Member

Jeffwan commented Jul 16, 2019

Thanks. I will have a review and come back to you.

description: 'Tuned hyperparameters'
implementation:
container:
image: carowang/kfp-aws-sm-hpo:190716-01
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will you be renaming this image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually, I think

message = response['FailureReason']
logging.error('Hyperparameter tuning failed with the following error: {}'.format(message))
raise Exception('Hyperparameter tuning job failed')
logging.info("Hyperparameter tuning job is still in status: " + status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the job has stopped then it will go in infinite loop.
we should check for progress case and continue the loop for failure case we should just return.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Completed and Failed are only two end state? Job will eventually fit into these two? @carolynwang Can you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right. Eventually if/when stopping the job when the run is terminated is implemented, the job can also be Stopping/Stopped state but the pod+container+script should be terminated anyway

Copy link
Contributor

@goswamig goswamig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over all looks good to me. Few comments please take care of them.

Copy link
Member

@Jeffwan Jeffwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add feedbacks from reviewers

@@ -17,8 +17,9 @@ RUN apt-get update -y && apt-get install --no-install-recommends -y -q ca-certif

RUN easy_install pip

RUN pip install boto3==1.9.130 sagemaker pathlib2
RUN pip install boto3==1.9.169 sagemaker pathlib2 pyyaml==3.12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a revision version change, I don't have concern. Better to run exiting workflows to make sure this boto version works for all previous jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run a pipeline using the existing components using this version of boto and they still work

{"Name": "extra_center_factor", "MinValue": "10", "MaxValue": "20"}]',
continuous_parameters='[]',
categorical_parameters='[{"Name": "init_method", "Values": ["random", "kmeans++"]}]',
channels='[{"ChannelName": "train", \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Have you considered to transform channels to simpler configs? If I am a user, I would think this is easy to break. If S3Uri, S3DataType amd S3DataDistributionType are fields helpful, can we think of a way to build abstraction? This is low priority

role_arn='',
):

training = sagemaker_hpo_op(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, does user create SM HOP job separately? Do they pipeline this job with other training jobs or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user can do it separately like in this sample, but they can also pipeline this job with another training or create model component since it outputs the tuned hyperparameters and the job name and model artifacts url of the best training job within the HPO job. I'll have an example with it in a pipeline with other components

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am thinking is it's better to have a more complex example to demo real world use case like taxi example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will work on making a more complex example when I get a chance to

request['TrainingJobDefinition']['StaticHyperParameters'] = args['static_parameters']
request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingInputMode'] = args['training_input_mode']

# TODO: determine if algorithm name or training image is used for algorithms from AWS marketplace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you cover this from line 279-300 Remove #TODO if it's done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as other TODOs

message = response['FailureReason']
logging.error('Hyperparameter tuning failed with the following error: {}'.format(message))
raise Exception('Hyperparameter tuning job failed')
logging.info("Hyperparameter tuning job is still in status: " + status)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like Completed and Failed are only two end state? Job will eventually fit into these two? @carolynwang Can you confirm?

@Jeffwan
Copy link
Member

Jeffwan commented Jul 21, 2019

/ok-to-test
/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Jeffwan
Copy link
Member

Jeffwan commented Jul 22, 2019

/test kubeflow-pipeline-e2e-test

@k8s-ci-robot k8s-ci-robot merged commit 2778632 into kubeflow:master Jul 22, 2019
@carolynwang carolynwang deleted the sagemaker-hpo-component branch July 22, 2019 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants