# Hyperparameter Tuning using Your Own Keras/Tensorflow Container

This notebook shows how to build your own Keras(Tensorflow) container, test it locally using SageMaker Python SDK local mode, and bring it to SageMaker for training, leveraging hyperparameter tuning. 

The model used for this notebook is a ResNet model, trainer with the CIFAR-10 dataset. The example is based on https://github.com/keras-team/keras/blob/master/examples/cifar10_cnn.py

## Set up the notebook instance to support local mode
Currently you need to install docker-compose in order to use local mode (i.e., testing the container in the notebook instance without pushing it to ECR).

In [1]:
!/bin/bash setup.sh

The user has root access.
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


## Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because it creates new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

## Set up the environment
We will set up a few things before starting the workflow. 

1. get the execution role which will be passed to sagemaker for accessing your resources such as s3 bucket
2. specify the s3 bucket and prefix where training data set and model artifacts are stored

In [2]:
import os
import numpy as np
import tempfile

import tensorflow as tf

import sagemaker
import boto3
from sagemaker.estimator import Estimator

region = boto3.Session().region_name

sagemaker_session = sagemaker.Session()
smclient = boto3.client('sagemaker')

bucket = sagemaker.Session().default_bucket()  # s3 bucket name, must be in the same region as the one specified above
prefix = 'sagemaker/DEMO-hpo-keras-cifar10'

role = sagemaker.get_execution_role()

NUM_CLASSES = 10   # the data set has 10 categories of images

## Complete source code
- [trainer/start.py](trainer/start.py): Keras model
- [trainer/environment.py](trainer/environment.py): Contain information about the SageMaker environment

## Building the image
We will build the docker image using the Tensorflow versions on dockerhub. The full list of Tensorflow versions can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/


In [3]:
import shlex
import subprocess

def get_image_name(ecr_repository, tensorflow_version_tag):
    return '%s:tensorflow-%s' % (ecr_repository, tensorflow_version_tag)

def build_image(name, version):
    cmd = 'docker build -t %s --build-arg VERSION=%s -f Dockerfile .' % (name, version)
    subprocess.check_call(shlex.split(cmd))

#version tag can be found at https://hub.docker.com/r/tensorflow/tensorflow/tags/ 
#e.g., latest cpu version is 'latest', while latest gpu version is 'latest-gpu'
tensorflow_version_tag = '1.10.1'   

account = boto3.client('sts').get_caller_identity()['Account']
    
ecr_repository="%s.dkr.ecr.%s.amazonaws.com/test" %(account,region) # your ECR repository, which you should have been created before running the notebook

image_name = get_image_name(ecr_repository, tensorflow_version_tag)

print('building image:'+image_name)
build_image(image_name, tensorflow_version_tag)

building image:023375022819.dkr.ecr.us-east-1.amazonaws.com/test:tensorflow-1.10.1


## Prepare the data

In [4]:
def upload_channel(channel_name, x, y):
    y = tf.keras.utils.to_categorical(y, NUM_CLASSES)

    file_path = tempfile.mkdtemp()
    np.savez_compressed(os.path.join(file_path, 'cifar-10-npz-compressed.npz'), x=x, y=y)

    return sagemaker_session.upload_data(path=file_path, bucket=bucket, key_prefix='data/DEMO-keras-cifar10/%s' % channel_name)


def upload_training_data():
    # The data, split between train and test sets:
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()

    train_data_location = upload_channel('train', x_train, y_train)
    test_data_location = upload_channel('test', x_test, y_test)

    return {'train': train_data_location, 'test': test_data_location}

channels = upload_training_data()


Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


## Testing the container locally (optional)

You can test the container locally using local mode of SageMaker Python SDK. A training container will be created in the notebook instance based on the docker image you built. Note that we have not pushed the docker image to ECR yet since we are only running local mode here. You can skip to the tuning step if you want but testing the container locally can help you find issues quickly before kicking off the tuning job.

### Setting the hyperparameters

In [5]:
hyperparameters = dict(batch_size=32, data_augmentation=True, learning_rate=.0001, 
                       width_shift_range=.1, height_shift_range=.1, epochs=1)
hyperparameters

{'batch_size': 32,
 'data_augmentation': True,
 'learning_rate': 0.0001,
 'width_shift_range': 0.1,
 'height_shift_range': 0.1,
 'epochs': 1}

### Create a training job using local mode

In [6]:
%%time

output_location = "s3://{}/{}/output".format(bucket,prefix)

estimator = Estimator(image_name, role=role, output_path=output_location,
                      train_instance_count=1, 
                      train_instance_type='local', hyperparameters=hyperparameters)
estimator.fit(channels)

Creating tmp1vk9g3nw_algo-1-4bgv9_1 ... 
[1BAttaching to tmp1vk9g3nw_algo-1-4bgv9_12mdone[0m
[36malgo-1-4bgv9_1  |[0m Using TensorFlow backend.
[36malgo-1-4bgv9_1  |[0m   return yaml.load(f)
[36malgo-1-4bgv9_1  |[0m creating SageMaker trainer environment:
[36malgo-1-4bgv9_1  |[0m TrainerEnvironment(input_dir='/opt/ml/input', input_config_dir='/opt/ml/input/config', model_dir='/opt/ml/model', output_dir='/opt/ml/output', hyperparameters={'epochs': '1', 'width_shift_range': '0.1', 'height_shift_range': '0.1', 'data_augmentation': 'True', 'learning_rate': '0.0001', 'batch_size': '32'}, resource_config={'current_host': 'algo-1-4bgv9', 'hosts': ['algo-1-4bgv9']}, input_data_config={'test': {'TrainingInputMode': 'File'}, 'train': {'TrainingInputMode': 'File'}}, output_data_dir='/opt/ml/output/data', hosts=['algo-1-4bgv9'], channel_dirs={'test': '/opt/ml/input/data/test', 'train': '/opt/ml/input/data/train'}, current_host='algo-1-4bgv9', available_gpus=0, available_cpus=2)
[36malgo

## Pushing the container to ECR
Now that we've tested the container locally and it works fine, we can move on to run the hyperparmeter tuning. Before kicking off the tuning job, you need to push the docker image to ECR first. 

The cell below will create the ECR repository, if it does not exist yet, and push the image to ECR.

In [7]:
# The name of our algorithm
algorithm_name = 'test'

# If the repository doesn't exist in ECR, create it.
exist_repo = !aws ecr describe-repositories --repository-names {algorithm_name} > /dev/null 2>&1

if not exist_repo:
    !aws ecr create-repository --repository-name {algorithm_name} > /dev/null

# Get the login command from ECR and execute it directly
!$(aws ecr get-login --region {region} --no-include-email)

!docker push {image_name}


An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'test' already exists in the registry with id '023375022819'
https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded
The push refers to repository [023375022819.dkr.ecr.us-east-1.amazonaws.com/test]

[1B139ddfcd: Preparing 
[1B768272c2: Preparing 
[1Bcd656dad: Preparing 
[1B50f4ad51: Preparing 
[1Bcbd51d4c: Preparing 
[1B2dba4224: Preparing 
[1B928d62e9: Preparing 
[1B5de15f9a: Preparing 
[1Bd797a4b6: Preparing 
[1B590089c4: Preparing 
[1B2280c8c4: Preparing 
[1B3827a77f: Preparing 
[1Bd0000622: Preparing 
[1B5ec0f29e: Preparing 
[14B68272c2: Pushed    5.47MBists 6MB15A[1K[K[9A[1K[K[14A[1K[K[4A[1K[K[14A[1K[K[14A[1K[K[15A[1K[K[14A[1K[Ktensorflow-1.10.1: digest: sha256:5f8ac077707eb285ffd5df97f0d715a82de98016b33bd673294090601b855519 size: 3461


## Specify hyperparameter tuning job configuration
*Note, with the default setting below, the hyperparameter tuning job can take 20~30 minutes to complete. You can customize the code in order to get better result, such as increasing the total number of training jobs, epochs, etc., with the understanding that the tuning time will be increased accordingly as well.*

Now you configure the tuning job by defining a JSON object that you pass as the value of the TuningJobConfig parameter to the create_tuning_job call. In this JSON object, you specify:
* The ranges of hyperparameters you want to tune
* The limits of the resource the tuning job can consume 
* The objective metric for the tuning job

In [8]:
import json
from time import gmtime, strftime

tuning_job_name = 'BYO-keras-tuningjob-' + strftime("%d-%H-%M-%S", gmtime())

print(tuning_job_name)

tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "0.001",
          "MinValue": "0.0001",
          "Name": "learning_rate",          
        }
      ],
      "IntegerParameterRanges": []
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 9,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "loss",
      "Type": "Minimize"
    }
  }


BYO-keras-tuningjob-28-16-18-48


## Specify training job configuration
Now you configure the training jobs the tuning job launches by defining a JSON object that you pass as the value of the TrainingJobDefinition parameter to the create_tuning_job call.
In this JSON object, you specify:
* Metrics that the training jobs emit
* The container image for the algorithm to train
* The input configuration for your training and test data
* Configuration for the output of the algorithm
* The values of any algorithm hyperparameters that are not tuned in the tuning job
* The type of instance to use for the training jobs
* The stopping condition for the training jobs

This example defines one metric that Tensorflow container emits: loss. 

In [9]:
training_image = image_name

print('training artifacts will be uploaded to: {}'.format(output_location))

training_job_definition = {
    "AlgorithmSpecification": {
      "MetricDefinitions": [
        {
          "Name": "loss",
          "Regex": "loss: ([0-9\\.]+)"
        }
      ],
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": channels['train'],
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "CompressionType": "None",
            "RecordWrapperType": "None"
        },
        {
            "ChannelName": "test",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": channels['test'],
                    "S3DataDistributionType": "FullyReplicated"
                }
            },            
            "CompressionType": "None",
            "RecordWrapperType": "None"            
        }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 1,
      "InstanceType": "ml.m4.xlarge",
      "VolumeSizeInGB": 50
    },
    "RoleArn": role,
    "StaticHyperParameters": {
        "batch_size":"32",
        "data_augmentation":"True",
        "height_shift_range":"0.1",
        "width_shift_range":"0.1",
        "epochs":'1'
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}


training artifacts will be uploaded to: s3://sagemaker-us-east-1-023375022819/sagemaker/DEMO-hpo-keras-cifar10/output


## Create and launch a hyperparameter tuning job
Now you can launch a hyperparameter tuning job by calling create_tuning_job API. Pass the name and JSON objects you created in previous steps as the values of the parameters. After the tuning job is created, you should be able to describe the tuning job to see its progress in the next step, and you can go to SageMaker console->Jobs to check out the progress of each training job that has been created.

In [10]:
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                               HyperParameterTuningJobConfig = tuning_job_config,
                                               TrainingJobDefinition = training_job_definition)

{'HyperParameterTuningJobArn': 'arn:aws:sagemaker:us-east-1:023375022819:hyper-parameter-tuning-job/byo-keras-tuningjob-28-16-18-48',
 'ResponseMetadata': {'RequestId': '3c2121f6-b46e-4f1a-8d99-da877e00d763',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3c2121f6-b46e-4f1a-8d99-da877e00d763',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '132',
   'date': 'Wed, 28 Aug 2019 16:23:05 GMT'},
  'RetryAttempts': 0}}

Let's just run a quick check of the hyperparameter tuning jobs status to make sure it started successfully and is `InProgress`.

In [24]:
smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name)['HyperParameterTuningJobStatus']

'Completed'

## Analyze tuning job results - after tuning job is completed
Please refer to "HPO_Analyze_TuningJob_Results.ipynb" to see example code to analyze the tuning job results.
<br>https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/analyze_results/HPO_Analyze_TuningJob_Results.ipynb

In [23]:
# run this cell to check current status of hyperparameter tuning job
tuning_job_result = smclient.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=tuning_job_name)

status = tuning_job_result['HyperParameterTuningJobStatus']
if status != 'Completed':
    print('Reminder: the tuning job has not been completed.')
    
job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']
print("%d training jobs have completed" % job_count)
    
is_minimize = (tuning_job_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['Type'] != 'Maximize')
objective_name = tuning_job_result['HyperParameterTuningJobConfig']['HyperParameterTuningJobObjective']['MetricName']

9 training jobs have completed


In [25]:
from pprint import pprint
if tuning_job_result.get('BestTrainingJob',None):
    print("Best model found so far:")
    pprint(tuning_job_result['BestTrainingJob'])
else:
    print("No training jobs have reported results yet.")

Best model found so far:
{'CreationTime': datetime.datetime(2019, 8, 28, 16, 30, 41, tzinfo=tzlocal()),
 'FinalHyperParameterTuningJobObjectiveMetric': {'MetricName': 'loss',
                                                 'Value': 1.2361425161361694},
 'ObjectiveStatus': 'Succeeded',
 'TrainingEndTime': datetime.datetime(2019, 8, 28, 16, 37, 42, tzinfo=tzlocal()),
 'TrainingJobArn': 'arn:aws:sagemaker:us-east-1:023375022819:training-job/byo-keras-tuningjob-28-16-18-48-004-ef381052',
 'TrainingJobName': 'BYO-keras-tuningjob-28-16-18-48-004-ef381052',
 'TrainingJobStatus': 'Completed',
 'TrainingStartTime': datetime.datetime(2019, 8, 28, 16, 32, 33, tzinfo=tzlocal()),
 'TunedHyperParameters': {'learning_rate': '0.0007307682577397425'}}


In [19]:
import pandas as pd

In [26]:
tuner = sagemaker.HyperparameterTuningJobAnalytics(tuning_job_name)

full_df = tuner.dataframe()

if len(full_df) > 0:
    df = full_df[full_df['FinalObjectiveValue'] > -float('inf')]
    if len(df) > 0:
        df = df.sort_values('FinalObjectiveValue', ascending=is_minimize)
        print("Number of training jobs with valid objective: %d" % len(df))
        print({"lowest":min(df['FinalObjectiveValue']),"highest": max(df['FinalObjectiveValue'])})
        pd.set_option('display.max_colwidth', -1)  # Don't truncate TrainingJobName        
    else:
        print("No training jobs have reported valid results yet.")
        
df

Number of training jobs with valid objective: 9
{'lowest': 1.2361425161361694, 'highest': 1.610754132270813}


Unnamed: 0,FinalObjectiveValue,TrainingElapsedTimeSeconds,TrainingEndTime,TrainingJobName,TrainingJobStatus,TrainingStartTime,learning_rate
5,1.236143,309.0,2019-08-28 16:37:42+00:00,BYO-keras-tuningjob-28-16-18-48-004-ef381052,Completed,2019-08-28 16:32:33+00:00,0.000731
0,1.255291,359.0,2019-08-28 16:46:58+00:00,BYO-keras-tuningjob-28-16-18-48-009-28cb54f4,Completed,2019-08-28 16:40:59+00:00,0.00091
4,1.258336,308.0,2019-08-28 16:37:52+00:00,BYO-keras-tuningjob-28-16-18-48-005-161aa93f,Completed,2019-08-28 16:32:44+00:00,0.000823
7,1.264098,319.0,2019-08-28 16:30:33+00:00,BYO-keras-tuningjob-28-16-18-48-002-1f6407a6,Completed,2019-08-28 16:25:14+00:00,0.000432
1,1.271155,380.0,2019-08-28 16:46:44+00:00,BYO-keras-tuningjob-28-16-18-48-008-9658a1a4,Completed,2019-08-28 16:40:24+00:00,0.00091
8,1.281773,313.0,2019-08-28 16:30:21+00:00,BYO-keras-tuningjob-28-16-18-48-001-4a888e29,Completed,2019-08-28 16:25:08+00:00,0.000933
3,1.33297,376.0,2019-08-28 16:39:03+00:00,BYO-keras-tuningjob-28-16-18-48-006-3fd4da64,Completed,2019-08-28 16:32:47+00:00,0.000751
6,1.4538,314.0,2019-08-28 16:30:22+00:00,BYO-keras-tuningjob-28-16-18-48-003-ff9fc73c,Completed,2019-08-28 16:25:08+00:00,0.000463
2,1.610754,303.0,2019-08-28 16:45:21+00:00,BYO-keras-tuningjob-28-16-18-48-007-85d569b3,Completed,2019-08-28 16:40:18+00:00,0.000166


### See TuningJob results vs time

Next we will show how the objective metric changes over time, as the tuning job progresses. For Bayesian strategy, you should expect to see a general trend towards better results, but this progress will not be steady as the algorithm needs to balance exploration of new areas of parameter space against exploitation of known good areas. This can give you a sense of whether or not the number of training jobs is sufficient for the complexity of your search space.

In [27]:
import bokeh
import bokeh.io
bokeh.io.output_notebook()
from bokeh.plotting import figure, show
from bokeh.models import HoverTool

class HoverHelper():

    def __init__(self, tuning_analytics):
        self.tuner = tuning_analytics

    def hovertool(self):
        tooltips = [
            ("FinalObjectiveValue", "@FinalObjectiveValue"),
            ("TrainingJobName", "@TrainingJobName"),
        ]
        for k in self.tuner.tuning_ranges.keys():
            tooltips.append( (k, "@{%s}" % k) )

        ht = HoverTool(tooltips=tooltips)
        return ht

    def tools(self, standard_tools='pan,crosshair,wheel_zoom,zoom_in,zoom_out,undo,reset'):
        return [self.hovertool(), standard_tools]

hover = HoverHelper(tuner)

p = figure(plot_width=900, plot_height=400, tools=hover.tools(), x_axis_type='datetime')
p.circle(source=df, x='TrainingStartTime', y='FinalObjectiveValue')
show(p)

### Analyze the correlation between objective metric and individual hyperparameters
Now you have finished a tuning job, you may want to know the correlation between your objective metric and individual hyperparameters you've selected to tune. Having that insight will help you decide whether it makes sense to adjust search ranges for certain hyperparameters and start another tuning job. For exmaple, if you see a positive trend between objective metric and a numerical hyperparameter, you probably want to set a higher tuning range for that hyperparameter in your next tuning job.

The following cell draws a graph for each hyperparameter to show its correlation with your objective metric.

In [22]:
ranges = tuner.tuning_ranges
figures = []
for hp_name, hp_range in ranges.items():
    categorical_args = {}
    if hp_range.get('Values'):
        # This is marked as categorical.  Check if all options are actually numbers.
        def is_num(x):
            try:
                float(x)
                return 1
            except:
                return 0           
        vals = hp_range['Values']
        if sum([is_num(x) for x in vals]) == len(vals):
            # Bokeh has issues plotting a "categorical" range that's actually numeric, so plot as numeric
            print("Hyperparameter %s is tuned as categorical, but all values are numeric" % hp_name)
        else:
            # Set up extra options for plotting categoricals.  A bit tricky when they're actually numbers.
            categorical_args['x_range'] = vals

    # Now plot it
    p = figure(plot_width=500, plot_height=500, 
               title="Objective vs %s" % hp_name,
               tools=hover.tools(),
               x_axis_label=hp_name, y_axis_label=objective_name,
               **categorical_args)
    p.circle(source=df, x=hp_name, y='FinalObjectiveValue')
    figures.append(p)
show(bokeh.layouts.Column(*figures))

## Deploy the best model
Now that we have got the best model, we can deploy it to an endpoint. Please refer to other SageMaker sample notebooks or SageMaker documentation to see how to deploy a model.