Add SageMaker HPO component and sample usage in a pipeline #1628

carolynwang · 2019-07-16T18:47:25Z

Summary

@Jeffwan @mbaijal @gautamkmr @kalyc

Updated Dockerfile

Copies hyperparameter_tuning.py into the container
Updated install for boto3 and installs pyyaml

Added Dockerfile for just HPO

Same as original but only copies the script for HPO in

Added hyperparameter tuning component

hpo.template.yaml: template for API call request generation
_utils.py: additional functions for hyperparameter tuning
hyperparameter_tuning.py
README: details inputs/outputs for the HPO component and how to run the sample
component.yaml: the docker image used in this built from the HPO-only Dockerfile

Sample pipeline (of just the HPO component)

Show sample usage for the HPO component for the K-Means algorithm using the MNIST dataset
Uses all possible parameters, though some parameters have default values
Note: the default inputs used in the parameters is not actually practical and is just being used to demonstrate that the component works

Testing

HPO component functionality

kmeans-hpo-pipeline.py compiles and runs without errors using the default values set in the pipeline parameters
Note: has not been tested with SageMaker custom algorithm resources, AWS marketplace algorithm resources, BYOC, or any other built-in algorithms aside from K-Means.

This change is

googlebot · 2019-07-16T18:47:30Z

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here (e.g. I signed it!) and we'll verify it.

What to do if you already signed the CLA

Individual signers

It's possible we don't have your GitHub username or you're using a different email address on your commit. Check your existing CLA data and verify that your email is set on your git commits.

Corporate signers

Your company has a Point of Contact who decides which employees are authorized to participate. Ask your POC to be added to the group of authorized contributors. If you don't know who your Point of Contact is, direct the Google project maintainer to go/cla#troubleshoot (Public version).
The email used to register you as an authorized contributor must be the email used for the Git commit. Check your existing CLA data and verify that your email is set on your git commits.
The email used to register you as an authorized contributor must also be attached to your GitHub account.

ℹ️ Googlers: Go here for more info.

k8s-ci-robot · 2019-07-16T18:47:39Z

Hi @carolynwang. Thanks for your PR.

I'm waiting for a kubeflow member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jeffwan · 2019-07-16T20:41:58Z

@carolynwang Thanks for the contribution. Please sign CLA first

Jeffwan · 2019-07-16T20:45:43Z

/cc @Jeffwan

carolynwang · 2019-07-16T21:02:37Z

CLA signed

googlebot · 2019-07-16T21:02:41Z

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

Jeffwan · 2019-07-16T22:39:08Z

Thanks. I will have a review and come back to you.

components/aws/sagemaker/hyperparameter_tuning/README.md

kalyc · 2019-07-16T22:48:02Z

components/aws/sagemaker/hyperparameter_tuning/component.yaml

+    description: 'Tuned hyperparameters'
+implementation:
+  container:
+    image: carowang/kfp-aws-sm-hpo:190716-01


will you be renaming this image?

eventually, I think

components/aws/sagemaker/common/_utils.py

goswamig · 2019-07-18T20:34:45Z

components/aws/sagemaker/common/_utils.py

+            message = response['FailureReason']
+            logging.error('Hyperparameter tuning failed with the following error: {}'.format(message))
+            raise Exception('Hyperparameter tuning job failed')
+        logging.info("Hyperparameter tuning job is still in status: " + status)


if the job has stopped then it will go in infinite loop.
we should check for progress case and continue the loop for failure case we should just return.

Looks like Completed and Failed are only two end state? Job will eventually fit into these two? @carolynwang Can you confirm?

Yes that's right. Eventually if/when stopping the job when the run is terminated is implemented, the job can also be Stopping/Stopped state but the pod+container+script should be terminated anyway

goswamig

Over all looks good to me. Few comments please take care of them.

Jeffwan

Please add feedbacks from reviewers

Jeffwan · 2019-07-19T17:01:47Z

components/aws/sagemaker/Dockerfile

@@ -17,8 +17,9 @@ RUN apt-get update -y && apt-get install --no-install-recommends -y -q ca-certif

 RUN easy_install pip

-RUN pip install boto3==1.9.130 sagemaker pathlib2
+RUN pip install boto3==1.9.169 sagemaker pathlib2 pyyaml==3.12


Since it's a revision version change, I don't have concern. Better to run exiting workflows to make sure this boto version works for all previous jobs.

I've run a pipeline using the existing components using this version of boto and they still work

Jeffwan · 2019-07-19T17:08:29Z

samples/aws-samples/mnist-kmeans-sagemaker/kmeans-hpo-pipeline.py

+                         {"Name": "extra_center_factor", "MinValue": "10", "MaxValue": "20"}]',
+    continuous_parameters='[]',
+    categorical_parameters='[{"Name": "init_method", "Values": ["random", "kmeans++"]}]',
+    channels='[{"ChannelName": "train", \


Minor: Have you considered to transform channels to simpler configs? If I am a user, I would think this is easy to break. If S3Uri, S3DataType amd S3DataDistributionType are fields helpful, can we think of a way to build abstraction? This is low priority

Jeffwan · 2019-07-19T17:10:46Z

samples/aws-samples/mnist-kmeans-sagemaker/kmeans-hpo-pipeline.py

+    role_arn='',
+    ):
+
+    training = sagemaker_hpo_op(


Basically, does user create SM HOP job separately? Do they pipeline this job with other training jobs or something else?

The user can do it separately like in this sample, but they can also pipeline this job with another training or create model component since it outputs the tuned hyperparameters and the job name and model artifacts url of the best training job within the HPO job. I'll have an example with it in a pipeline with other components

What I am thinking is it's better to have a more complex example to demo real world use case like taxi example.

Okay, will work on making a more complex example when I get a chance to

Jeffwan · 2019-07-19T17:14:11Z

components/aws/sagemaker/common/_utils.py

+    request['TrainingJobDefinition']['StaticHyperParameters'] = args['static_parameters']
+    request['TrainingJobDefinition']['AlgorithmSpecification']['TrainingInputMode'] = args['training_input_mode']
+
+    # TODO: determine if algorithm name or training image is used for algorithms from AWS marketplace


Seems you cover this from line 279-300 Remove #TODO if it's done

Same as other TODOs

Jeffwan · 2019-07-19T17:16:33Z

components/aws/sagemaker/common/_utils.py

+            message = response['FailureReason']
+            logging.error('Hyperparameter tuning failed with the following error: {}'.format(message))
+            raise Exception('Hyperparameter tuning job failed')
+        logging.info("Hyperparameter tuning job is still in status: " + status)


Looks like Completed and Failed are only two end state? Job will eventually fit into these two? @carolynwang Can you confirm?

… int params, reintro some default values

Jeffwan · 2019-07-21T19:35:47Z

/ok-to-test
/lgtm
/approve

k8s-ci-robot · 2019-07-21T19:35:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jeffwan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~components/aws/OWNERS~~ [Jeffwan]
~~samples/aws-samples/OWNERS~~ [Jeffwan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Jeffwan · 2019-07-22T01:19:52Z

/test kubeflow-pipeline-e2e-test

carolynwang added 3 commits July 16, 2019 10:01

add HPO component and sample pipeline usage

de5d933

Update Dockerfile to include HPO component

1624a45

Update docker image used in hpo component

4a35ab5

k8s-ci-robot added the size/XL label Jul 16, 2019

k8s-ci-robot requested review from Ark-kun and gaoning777 July 16, 2019 18:47

k8s-ci-robot added the needs-ok-to-test label Jul 16, 2019

k8s-ci-robot requested a review from Jeffwan July 16, 2019 20:45

kalyc reviewed Jul 16, 2019

View reviewed changes

components/aws/sagemaker/hyperparameter_tuning/README.md Outdated Show resolved Hide resolved

kalyc reviewed Jul 16, 2019

View reviewed changes

components/aws/sagemaker/common/_utils.py Outdated Show resolved Hide resolved

goswamig reviewed Jul 18, 2019

View reviewed changes

components/aws/sagemaker/common/_utils.py Show resolved Hide resolved

goswamig reviewed Jul 18, 2019

View reviewed changes

Jeffwan suggested changes Jul 19, 2019

View reviewed changes

carolynwang added 4 commits July 19, 2019 15:24

Update HPO readme, make HPO job name required, allow empty string for…

4dc7b7a

… int params, reintro some default values

Resolve a couple todos

e6fb6b0

Add Dockerfile for HPO and update docker image used in HPO component

71f55e0

Add Dockerfile for HPO

93eb0cf

k8s-ci-robot assigned Jeffwan Jul 21, 2019

k8s-ci-robot added ok-to-test lgtm labels Jul 21, 2019

k8s-ci-robot removed the needs-ok-to-test label Jul 21, 2019

k8s-ci-robot added the approved label Jul 21, 2019

k8s-ci-robot merged commit 2778632 into kubeflow:master Jul 22, 2019

carolynwang deleted the sagemaker-hpo-component branch July 22, 2019 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SageMaker HPO component and sample usage in a pipeline #1628

Add SageMaker HPO component and sample usage in a pipeline #1628

carolynwang commented Jul 16, 2019 •

edited

Loading

googlebot commented Jul 16, 2019

k8s-ci-robot commented Jul 16, 2019

Jeffwan commented Jul 16, 2019 •

edited

Loading

Jeffwan commented Jul 16, 2019

carolynwang commented Jul 16, 2019

googlebot commented Jul 16, 2019

Jeffwan commented Jul 16, 2019

kalyc Jul 16, 2019

carolynwang Jul 19, 2019

goswamig Jul 18, 2019

Jeffwan Jul 19, 2019

carolynwang Jul 19, 2019

goswamig left a comment

Jeffwan left a comment

Jeffwan Jul 19, 2019

carolynwang Jul 19, 2019

Jeffwan Jul 19, 2019

Jeffwan Jul 19, 2019

carolynwang Jul 19, 2019

Jeffwan Jul 19, 2019

carolynwang Jul 19, 2019

Jeffwan Jul 19, 2019

Jeffwan Jul 19, 2019

Jeffwan Jul 19, 2019

Jeffwan commented Jul 21, 2019

k8s-ci-robot commented Jul 21, 2019

Jeffwan commented Jul 22, 2019

Add SageMaker HPO component and sample usage in a pipeline #1628

Add SageMaker HPO component and sample usage in a pipeline #1628

Conversation

carolynwang commented Jul 16, 2019 • edited Loading

Summary

Testing

googlebot commented Jul 16, 2019

What to do if you already signed the CLA

Individual signers

Corporate signers

k8s-ci-robot commented Jul 16, 2019

Jeffwan commented Jul 16, 2019 • edited Loading

Jeffwan commented Jul 16, 2019

carolynwang commented Jul 16, 2019

googlebot commented Jul 16, 2019

Jeffwan commented Jul 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

goswamig left a comment

Choose a reason for hiding this comment

Jeffwan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Jeffwan commented Jul 21, 2019

k8s-ci-robot commented Jul 21, 2019

Jeffwan commented Jul 22, 2019

carolynwang commented Jul 16, 2019 •

edited

Loading

Jeffwan commented Jul 16, 2019 •

edited

Loading