### Deep Learning with Keras on Amazon SageMaker

Amazon SageMaker is a modular, fully managed Machine Learning service that lets you easily build, train and deploy models at any scale.

In this workshop, we'll use Keras (with the TensorFlow backend) to build a simple Convolutional Neural Network (CNN). We'll then train it to classify the Fashion-MNIST image data set. Fashion-MNIST is a Zalando dataset consisting of a training set of 60,000 examples and a validation set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes: it's a drop-in replacement for MNIST.

First, you will learn how to:
  * use the SageMaker SDK
  * adapt existing code to run on SageMaker ('script mode')
  * train your code locally for fast experimentation, without firing up managed infrastructure ('local mode')
  * train and deploy models on managed infrastructure
  * predict data samples
  
Then, we'll more to more advanced topics, such as:
  * saving up to 80% on training costs with Managed Spot Training
  * saving up to 80% on GPU inference costs with Elastic Inference 
  * finding optimal hyper parameters automatically with Automatic Model Tuning

Resources
  * Amazon SageMaker documentation [ https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html ]
  * SageMaker SDK 
    * Code [ https://github.com/aws/sagemaker-python-sdk ] 
    * Documentation [ https://sagemaker.readthedocs.io/ ]
  * Fashion-MNIST [ https://github.com/zalandoresearch/fashion-mnist ] 
  * Keras documentation [ https://keras.io/ ]
  * Numpy documentation [ https://docs.scipy.org/doc/numpy/index.html ]

## Import the latest SageMaker SDK

In [None]:
!pip install -qU sagemaker

Restart the Jupyter kernel to use the latest SageMaker SDK ("Kernel" / "Restart")

In [None]:
import sagemaker

sess = sagemaker.Session()
role = sagemaker.get_execution_role()

## Download the Fashion-MNIST dataset

In [None]:
from IPython.display import Image
Image("fashion-mnist-sprite.png")

First, we need to download the data set from the Internet. Fortunately, Keras provides a simple way to do this. The data set is already split (training and validation), with separate Numpy arrays for samples and labels. 

We create a local directory, and save the training and validation data sets separately.

In [None]:
import os
import keras
import numpy as np
from keras.datasets import fashion_mnist
(x_train, y_train), (x_val, y_val) = fashion_mnist.load_data()

os.makedirs("./data", exist_ok = True)

np.savez('./data/training', image=x_train, label=y_train)
np.savez('./data/validation', image=x_val, label=y_val)

In [None]:
%%sh
ls -l data

## Take a look at our Keras code

In [None]:
%%sh
pygmentize mnist_keras_tf.py

The main steps are:
  * receive and parse command line arguments: five hyper parameters, and four environment variables (we'll get back to these in a moment)
  * load the data sets
  * make sure data sets have the right shape for TensorFlow (channels last)
  * normalize data sets, i.e. tranform [0-255] pixel values to [0-1] values
  * one-hot encode category labels (not familiar with this? More info: [ https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/ ])
  * Build a Sequential model in Keras: two convolution block with max pooling, followed by a fully connected layer with dropout, and a final classification layer. Don't worry if this sounds like gibberish, it's not our focus today
  * Train the model, leveraging multiple GPUs if they're available.
  * Print statistics
  * Save the model in TensorFlow serving format
  

## Train outside of SageMaker (just like on your laptop)

Before we start training on SageMaker, let's run this code locally and make sure that it trains fine. We just need to set the four environment variables that it expects, and run the script with Python

In [None]:
# TODO: set the four environment variables expected by the script

# Number of GPUs on this machine
%env SM_NUM_GPUS=
# Where to save the model
%env SM_MODEL_DIR=
# Where the training data is
%env SM_CHANNEL_TRAINING=
# Where the validation data is
%env SM_CHANNEL_VALIDATION=

!python mnist_keras_tf.py --epochs 1

Why are these environment variables important anyway? Well, they will be automatically passed to our script by SageMaker, so that we know where the data sets are, where to save the model, and how many GPUs we have. So, if you write your local code this way, **there won't be anything to change** to run it on SageMaker.

This feature is called '**script mode**', it's the recommended way to work with built-in frameworks on SageMaker.

## Train on the notebook instance (aka 'local mode')

Our code runs fine. Now, let's try to run it inside the built-in TensorFlow environment provided by SageMaker. For fast experimentation, let's use local

Read this first, and complete the cells below: [ https://sagemaker.readthedocs.io/en/stable/using_tf.html ]

In [None]:
from sagemaker.tensorflow import TensorFlow

## TODO: configure a local training job for 'mnist_keras_tf.py', 
#        training with TensorFlow 1.14 in script mode for just one epoch

tf_estimator = # TODO

Now, let's define the local location of the training and validation data sets

In [None]:
## TODO: define the local location for the training and validation data sets

local_training_input_path   = # TODO
local_validation_input_path = # TODO

Let's train!

In [None]:
## TODO: train on the local training and validation data sets

tf_estimator.fit(# TODO)

OK, our job runs fine locally. Let's now run the same job on a managed instance.

## Upload the data set to S3

SageMaker training instances expect data sets to be stored in Amazon S3, so let's upload them there. We could use boto3 to do this, but the SageMaker SDK includes a simple function: *Session.upload_data()*

[https://sagemaker.readthedocs.io/en/stable/session.html]

*Note: for high-performance workloads, Amazon EFS and Amazon FSx for Lustre are now also supported. More info here: [ https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/ ]*



In [None]:
prefix = 'keras-fashion-mnist'

# TODO: in the default bucket, upload the training data set to 'keras-fashion-mnist/training'
training_input_path   = # TODO

# TODO: in the default bucket, upload the validation data set to 'keras-fashion-mnist/validation'
validation_input_path = # TODO

print(training_input_path)
print(validation_input_path)

We're done with our data set. Of course, in real life, much more work would be needed for data cleaning and preparation!

## Configure the training job on a fully managed instance

In [None]:
## TODO: configure a managed training job for 'mnist_keras_tf.py', 
#        using a single c5.2xlarge instance
#        running TensorFlow 1.14 in script mode for ten epochs

tf_estimator = # TODO

Let's train!

In [None]:
## TODO: train on the training and validation data sets stored in S3

tf_estimator.fit(# TODO)

This will take 4-5 minutes. Please take a look at the training log. The first few lines show SageMaker preparing the managed instance.

While the job is training, you can also look at metrics in the AWS console for SageMaker, and at the training log in the the AWS console for CloudWatch Logs.

Once the job is complete, the trained model is saved in S3, and is now ready to be deployed.

## Deploy our model to a real-time endpoint

In [None]:
## TODO: Deploy the model to an endpoint backed by a single m5.large instance

import time

tf_endpoint_name = 'keras-tf-fmnist-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

tf_predictor = # TODO

This will take about 7-8 minutes. While the model is deploying, please take a look at the "Endpoints" section in the AWS console for SageMaker.

## Predict 

Once the model is deployed, we can use it to predict data samples.

The cell below grabs 10 random images from the validation data set, and uses the predict() API to predict their class. 

More precisely, we build a prediction request in TensorFlow Serving format, and send an HTTPS POST request to the prediction endpoint: its URL is visible in the AWS console for SageMaker.

In [None]:
%matplotlib inline
import random
import matplotlib.pyplot as plt

num_samples = 10
indices = random.sample(range(x_val.shape[0] - 1), num_samples)
images = x_val[indices]/255
labels = y_val[indices]

for i in range(num_samples):
    plt.subplot(1,num_samples,i+1)
    plt.imshow(images[i].reshape(28, 28), cmap='gray')
    plt.title(labels[i])
    plt.axis('off')
    
prediction = tf_predictor.predict(images.reshape(num_samples, 28, 28, 1))['predictions']
prediction = np.array(prediction)
predicted_label = prediction.argmax(axis=1)
print('Predicted labels are: {}'.format(predicted_label))

## Clean up

Once we're done with this endpoint, we should delete it to avoid unecessary costs.

In [None]:
## TODO: delete the endpoint

We've covered the basics. Now let's look at more advanced topics.
***

## Managed Spot Training

EC2 Spot Instances have long been a great cost optimization feature, and spot training is now available on SageMaker.

This blog post has more info: [ https://aws.amazon.com/blogs/aws/managed-spot-training-save-up-to-90-on-your-amazon-sagemaker-training-jobs/ ]

In [None]:
## TODO: configure the same training job as above,
#        with Managed Spot Training

tf_estimator = # TODO

tf_estimator.fit(# TODO)

Check out the end of the training log. How much did you save?

## Elastic Inference

Elastic Inference is a feature that lets you attach fractional GPU acceleration to any EC2 instance. It's also available on SageMaker.

This blog post has more info: [ https://aws.amazon.com/blogs/aws/amazon-elastic-inference-gpu-powered-deep-learning-inference-acceleration/ ] 

In [None]:
## TODO: Deploy the model to an endpoint backed by a single c5.large instance,
#  accelerated by a medium-size elastic inference accelerator

tf_endpoint_name = 'keras-tf-fmnist-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

tf_predictor = # TODO

Once the endpoint is deployed, you can use the 'Predict' cell above to send it some samples.

Don't forget to delete the endpoint once you're done.

In [None]:
## TODO: delete the endpoint

## Automatic Model Tuning

Automatic model tuning is a great feature that helps you find automatically the best hyper parameters for your training job.

This blog post has more info: [ https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-automatic-model-tuning-now-supports-random-search-and-hyperparameter-scaling/ ]

First, let's define parameter ranges

In [None]:
## TODO: define parameter ranges like so:
#       - learning rate: from 0.001 to 0.1, with logarithmic scaling
#       - batch size: from 32 to 1024
#       - dense layer: from 128 to 1024 neurons
#       - dropout: from 0.2 to 0.6

from sagemaker.tuner import IntegerParameter, ContinuousParameter

hyperparameter_ranges = {
    # TODO
}

The next step is to define the metric we're optimizing for, in this case we want to maximize validation accuracy. This value is reported as 'val_acc' in the training log, and we have to pass a regular expression so that SageMaker can find it accuratelmy

This last step is not necessary for built-in algorithms and built-in frameworks, we can simply pass the metric name.

In [None]:
objective_metric_name = 'val_acc'
objective_type = 'Maximize'
metric_definitions = [{'Name': 'val_acc', 'Regex': 'val_acc: ([0-9\\.]+)'}]

Then, it's time to put everything together, and configure the tuning job.

In [None]:
from sagemaker.tuner import HyperparameterTuner

## TODO: configure a training job using the Tensorflow estimator, the parameter ranges and the metric defined above.
#        Let's run ten individual jobs, two by two.

tuner = HyperparameterTuner(# TODO)

Finally, let's launch the tuning job, just like a normal estimator. We definitely want to use spot training here!

In [None]:
## TODO: launch the tuning job, passing the location of the data sets in S3.

tuner.fit(# TODO)

While the job is running, you can view it in the AWS console for SageMaker: individual jobs (and their logs), best training job so far, etc.

Of course, you can also inspect the job programatically using boto3 : *decribe_hyper_parameter_training_job()*, etc. [ https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html ]

## Deploy the best model

Once the tuning job is complete, you can deploy the best model, just like we did for a normal estimator.

*Note: if you call deploy() while the tuning job is running, it will deploy the best current model.*

In [None]:
tf_endpoint_name = 'keras-tf-fmnist-'+time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

## TODO: Deploy the model to an endpoint backed by a single c5.large instance,
#  accelerated by a medium-size elastic inference accelerator

tf_predictor = # TODO

Again, you can use the 'Predict' cell above to send some samples to the endpoint.

Don't forget to delete the endpoint once you're done.

In [None]:
## TODO: delete the endpoint

Congratulations, you've made it to the end of this workshop. I hope that you had fun and learned a lot! Before leaving, please check the AWS console for SageMaker, and check that you've shutdown all your endpoints in order to avoid unecessary costs.

If you want to learn more, please check out the large collection of SageMaker notebooks on Github:
[ https://github.com/awslabs/amazon-sagemaker-examples ]

Julien Simon @julsimon