In [1]:
%%bash

# Do we have GPU support?
nvidia-smi > /dev/null 2>&1
if [ $? -eq 0 ]; then
  # check if we have nvidia-docker
  NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
  if [ $NVIDIA_DOCKER -eq 0 ]; then
    # Install nvidia-docker2
    DOCKER_VERSION=`yum list docker | tail -1 | awk '{print $2}' | head -c 2`

    if [ $DOCKER_VERSION -eq 17 ]; then
      DOCKER_PKG_VERSION='17.09.1ce-1.111.amzn1'
      NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker17.09.1.ce.amzn1'
    else
      DOCKER_PKG_VERSION='18.06.1ce-3.17.amzn1'
      NVIDIA_DOCKER_PKG_VERSION='2.0.3-1.docker18.06.1.ce.amzn1'
    fi

    sudo yum -y remove docker
    sudo yum -y install docker-$DOCKER_PKG_VERSION

    sudo /etc/init.d/docker start

    curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
    sudo yum install -y nvidia-docker2-$NVIDIA_DOCKER_PKG_VERSION
    sudo cp daemon.json /etc/docker/daemon.json
    sudo pkill -SIGHUP dockerd
    echo "installed nvidia-docker2"
  else
    echo "nvidia-docker2 already installed. We are good to go!"
  fi
fi

# This is common for both GPU and CPU instances

# check if we have docker-compose
docker-compose version >/dev/null 2>&1
if [ $? -ne 0 ]; then
  # install docker compose
  pip install docker-compose
fi

# check if we need to configure our docker interface
SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
if [ $SAGEMAKER_NETWORK -eq 0 ]; then
  docker network create --driver bridge sagemaker-local
fi

# Notebook instance Docker networking fixes
RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`

# Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`

# check if both IPTables and the Route Table are OK.
IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c 169.254.0.2`
ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`

if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then

  if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
    # fix routing
    sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
  else
    echo "SageMaker instance route table setup is ok. We are good to go."
  fi

  if [ $IPTABLES_PATCHED -eq 0 ]; then
    sudo iptables -t nat -A PREROUTING  -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
    echo "iptables for Docker setup done"
  else
    echo "SageMaker instance routing for Docker is ok. We are good to go!"
  fi
fi

nvidia-docker2 already installed. We are good to go!
SageMaker instance route table setup is ok. We are good to go.
SageMaker instance routing for Docker is ok. We are good to go!


# ResNet CIFAR-10 with tensorboard

This notebook shows how to use TensorBoard, and how the training job writes checkpoints to a external bucket.
The model used for this notebook is a ResNet model, trained with the CIFAR-10 dataset.
See the following papers for more background:

[Deep Residual Learning for Image Recognition](https://arxiv.org/pdf/1512.03385.pdf) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Dec 2015.

[Identity Mappings in Deep Residual Networks](https://arxiv.org/pdf/1603.05027.pdf) by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Jul 2016.

### Set up the environment

In [2]:
import os
import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()
# role = 'SageMakerRole'

### Download the CIFAR-10 dataset
Downloading the test and training data will take around 5 minutes.

In [3]:
import utils

utils.cifar10_download()

FloatProgress(value=0.0)

>> Downloading cifar-10-binary.tar.gz 
Successfully downloaded cifar-10-binary.tar.gz 170052171 bytes.


### Upload the data to a S3 bucket

In [4]:
# inputs = sagemaker_session.upload_data(path='/tmp/cifar10_data', key_prefix='data/DEMO-cifar10')
inputs = 's3://sagemaker-us-west-2-369233609183/data/DEMO-cifar10'

**sagemaker_session.upload_data** will upload the CIFAR-10 dataset from your machine to a bucket named **sagemaker-{region}-{*your aws account number*}**, if you don't have this bucket yet, sagemaker_session will create it for you.

### Complete source code
- [source_dir/resnet_model.py](source_dir/resnet_model.py): ResNet model
- [source_dir/resnet_cifar_10.py](source_dir/resnet_cifar_10.py): main script used for training and hosting

## Create a training job using the sagemaker.TensorFlow estimator
## THIS IS LOCAL MODE

In [9]:
from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='benchmark_main.py', framework_version='1.11', py_version='py3',
                       source_dir=source_dir,
                       role=role,
                       hyperparameters={'model': 'resnet50', 'dist_strat': ''},
#                                         'training':'/opt/ml/input/data/training'},
                       train_instance_count=1, train_instance_type='local_gpu')

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-tensorflow-scriptmode-2018-11-20-00-36-36-090


ClientError: An error occurred (503) when calling the GetAuthorizationToken operation (reached max retries: 4): Service Unavailable

# ON SAGEMAKER

In [10]:
from sagemaker.tensorflow import TensorFlow


source_dir = os.path.join(os.getcwd(), 'source_dir')
estimator = TensorFlow(entry_point='benchmark_main.py', framework_version='1.11', py_version='py3',
                       source_dir=source_dir,
                       role=role,
                       hyperparameters={'model': 'resnet50', 'dist_strat': ''},
#                                         'training':'/opt/ml/input/data/training'},
                       train_instance_count=2, train_instance_type='ml.p2.xlarge')

estimator.fit(inputs)

INFO:sagemaker:Creating training-job with name: sagemaker-tensorflow-scriptmode-2018-11-20-01-48-29-524


2018-11-20 01:48:30 Starting - Starting the training job...
2018-11-20 01:48:31 Starting - Launching requested ML instances......
2018-11-20 01:49:35 Starting - Preparing the instances for training.........
2018-11-20 01:51:23 Downloading - Downloading input data..
[31m2018-11-20 01:51:40,983 sagemaker-containers INFO     Imported framework sagemaker_tensorflow_container.training[0m
[31m2018-11-20 01:51:41,489 sagemaker-containers INFO     Invoking user script
[0m
[31mTraining Env:
[0m
[31m{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    },
    "current_host": "algo-2",
    "framework_module": "sagemaker_tensorflow_container.training:main",
    "hosts": [
        "algo-1",
        "algo-2"
    ],
    "hyperparameters": {
        "dist_strat": "",
        "model": "resnet50",
        "model_dir": "s3://sagemaker-us-west-2-369233609183/sagemaker-tensorflow-scriptmode-2018-11-20-01-48-29-524/model"
    },

The **```fit```** method will create a training job named **```tensorboard-example-{unique identifier}```** in two **ml.c4.xlarge** instances. These instances will write checkpoints to the s3 bucket **```sagemaker-{your aws account number}```**.

If you don't have this bucket yet, **```sagemaker_session```** will create it for you. These checkpoints can be used for restoring the training job, and to analyze training job metrics using **TensorBoard**. 

The parameter **```run_tensorboard_locally=True```** will run **TensorBoard** in the machine that this notebook is running. Everytime a new checkpoint is created by the training job in the S3 bucket, **```fit```** will download the checkpoint to the temp folder that **TensorBoard** is pointing to.

When the **```fit```** method starts the training, it will log the port that **TensorBoard** is using to display the metrics. The default port is **6006**, but another port can be choosen depending on its availability. The port number will increase until finds an available port. After that the port number will printed in stdout.

It takes a few minutes to provision containers and start the training job.**TensorBoard** will start to display metrics shortly after that.

You can access **TensorBoard** locally at [http://localhost:6006](http://localhost:6006) or using your SageMaker notebook instance [proxy/6006/](/proxy/6006/)(TensorBoard will not work if forget to put the slash, '/', in end of the url). If TensorBoard started on a different port, adjust these URLs to match.This example uses the optional hyperparameter **```throttle_secs```** to generate training evaluations more often, allowing to visualize **TensorBoard** scalar data faster. You can find the available optional hyperparameters [here](https://github.com/aws/sagemaker-python-sdk#optional-hyperparameters).

# Deploy the trained model to prepare for predictions

The deploy() method creates an endpoint which serves prediction requests in real-time.

In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

# Make a prediction with fake data to verify the endpoint is up

Prediction is not the focus of this notebook, so to verify the endpoint's functionality, we'll simply generate random data in the correct shape and make a prediction.

In [None]:
import numpy as np

random_image_data = np.random.rand(32, 32, 3)
predictor.predict(random_image_data)

# Cleaning up
To avoid incurring charges to your AWS account for the resources used in this tutorial you need to delete the **SageMaker Endpoint:**

In [None]:
sagemaker.Session().delete_endpoint(predictor.endpoint)