## MNIST Training with MXNet and Gluon

MNIST is a widely used dataset for handwritten digit classification. It consists of 70,000 labeled 28x28 pixel grayscale images of hand-written digits. The dataset is split into 60,000 training images and 10,000 test images. There are 10 classes (one for each of the 10 digits). This tutorial will show how to train and test an MNIST model on SageMaker using MXNet and the Gluon API.



In [1]:
import os
import boto3
import sagemaker
from sagemaker.mxnet import MXNet
from mxnet import gluon
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

role = get_execution_role()

## Download training and test data

In [2]:
gluon.data.vision.MNIST('./data/train', train=True)
gluon.data.vision.MNIST('./data/test', train=False)

Downloading ./data/train/train-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-images-idx3-ubyte.gz...
Downloading ./data/train/train-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/train-labels-idx1-ubyte.gz...
Downloading ./data/test/t10k-images-idx3-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-images-idx3-ubyte.gz...
Downloading ./data/test/t10k-labels-idx1-ubyte.gz from https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/mnist/t10k-labels-idx1-ubyte.gz...


<mxnet.gluon.data.vision.datasets.MNIST at 0x7fc2712d0e80>

## Uploading the data

We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.

In [3]:
inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-mnist')

## Implement the training function

We need to provide a training script that can run on the SageMaker platform. The training scripts are essentially the same as one you would write for local training, except that you need to provide a `train` function. The `train` function will check for the validation accuracy at the end of every epoch and checkpoints the best model so far, along with the optimizer state, in the folder `/opt/ml/checkpoints` if the folder path exists, else it will skip the checkpointing. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.

The script here is an adaptation of the [Gluon MNIST example](https://github.com/apache/incubator-mxnet/blob/master/example/gluon/mnist.py) provided by the [Apache MXNet](https://mxnet.incubator.apache.org/) project. 

In [4]:
!cat 'mnist.py'

from __future__ import print_function

import argparse
import logging
import os

import mxnet as mx
from mxnet import gluon, autograd
from mxnet.gluon import nn
import numpy as np
import json
import time


logging.basicConfig(level=logging.DEBUG)

# ------------------------------------------------------------ #
# Training methods                                             #
# ------------------------------------------------------------ #


def train(args):
    # SageMaker passes num_cpus, num_gpus and other args we can use to tailor training to
    # the current container environment, but here we just use simple cpu context.
    ctx = mx.cpu()

    # retrieve the hyperparameters we set in notebook (with some defaults)
    batch_size = args.batch_size
    epochs = args.epochs
    learning_rate = args.learning_rate
    momentum = args.momentum
    log_interval = args.log_interval

    num_gpus = int(os.environ['SM_NUM_GPUS'])
    current_host = args.cur

## Run the training script on SageMaker

The ```MXNet``` class allows us to run our training function on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on a single c4.xlarge instance. 

In [5]:
m = MXNet("mnist.py",
          role=role,
          train_instance_count=1,
          train_instance_type='ml.p2.xlarge',
          framework_version="1.4.1",
          train_use_spot_instances=True,
          train_max_wait= 360000,
          py_version="py3",
          hyperparameters={'batch-size': 100,
                           'epochs': 20,
                           'learning-rate': 0.1,
                           'momentum': 0.9,
                           'log-interval': 100})

After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk.


In [6]:
m.fit(inputs)

2019-11-12 23:18:15 Starting - Starting the training job...
2019-11-12 23:18:19 Starting - Launching requested ML instances.........
2019-11-12 23:19:52 Starting - Preparing the instances for training...
2019-11-12 23:20:33 Downloading - Downloading input data...
2019-11-12 23:21:15 Training - Training image download completed. Training in progress..[31m2019-11-12 23:21:15,719 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[31m2019-11-12 23:21:15,722 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2019-11-12 23:21:15,734 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HOSTS': '["algo-1"]', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_HPS': '{"batch-size":100,"epochs":20,"learning-rate":0.1,"log-interval":100,"momentum":0.9}', 'SM_USER_ENTRY_POINT': 'mnist.py', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_RESOURCE_CONFIG': '{"current_host":"algo-1","hosts":["algo-1"],"network_interface_name":"e

[31m[Epoch 0 Batch 100] Training: accuracy=0.800891, 4772.382719 samples/s[0m
[31m[Epoch 0 Batch 200] Training: accuracy=0.860398, 4767.500597 samples/s[0m
[31m[Epoch 0 Batch 300] Training: accuracy=0.886013, 5708.943908 samples/s[0m
[31m[Epoch 0 Batch 400] Training: accuracy=0.900000, 4816.388774 samples/s[0m
[31m[Epoch 0 Batch 500] Training: accuracy=0.910160, 5075.822008 samples/s[0m
[31m[Epoch 0] Training: accuracy=0.917933[0m
[31m[Epoch 0] Validation: accuracy=0.956400[0m
[31m[Epoch 1 Batch 100] Training: accuracy=0.960792, 4839.116239 samples/s[0m
[31m[Epoch 1 Batch 200] Training: accuracy=0.963433, 4873.074555 samples/s[0m
[31m[Epoch 1 Batch 300] Training: accuracy=0.963987, 4849.186658 samples/s[0m
[31m[Epoch 1 Batch 400] Training: accuracy=0.965112, 4868.831983 samples/s[0m
[31m[Epoch 1 Batch 500] Training: accuracy=0.965090, 4902.923539 samples/s[0m
[31m[Epoch 1] Training: accuracy=0.966000[0m
[31m[Epoch 1] Validation: accuracy=0.961100[0m
[31m[Ep

[31m[Epoch 16] Training: accuracy=0.993800[0m
[31m[Epoch 16] Validation: accuracy=0.974200[0m
[31m[Epoch 17 Batch 100] Training: accuracy=0.994554, 4768.476222 samples/s[0m
[31m[Epoch 17 Batch 200] Training: accuracy=0.995622, 4766.579540 samples/s[0m
[31m[Epoch 17 Batch 300] Training: accuracy=0.995681, 4952.069706 samples/s[0m
[31m[Epoch 17 Batch 400] Training: accuracy=0.995661, 4748.716671 samples/s[0m
[31m[Epoch 17 Batch 500] Training: accuracy=0.995689, 4767.446407 samples/s[0m
[31m[Epoch 17] Training: accuracy=0.995583[0m
[31m[Epoch 17] Validation: accuracy=0.974900[0m
[31m[Epoch 18 Batch 100] Training: accuracy=0.994950, 4870.867495 samples/s[0m
[31m[Epoch 18 Batch 200] Training: accuracy=0.995920, 4849.130596 samples/s[0m
[31m[Epoch 18 Batch 300] Training: accuracy=0.995814, 4840.903477 samples/s[0m
[31m[Epoch 18 Batch 400] Training: accuracy=0.996185, 4924.568222 samples/s[0m
[31m[Epoch 18 Batch 500] Training: accuracy=0.996627, 4841.015224 samples/

After training, we use the MXNet object to build and deploy an MXNetPredictor object. This creates a SageMaker endpoint that we can use to perform inference. 

This allows us to perform inference on json encoded multi-dimensional arrays. 

In [7]:
base_CPU_instance_type ='ml.m4.xlarge'
ei_accelerator_type ='ml.eia2.medium'

predictor = m.deploy(initial_instance_count=1,
                             instance_type=base_CPU_instance_type,
                             accelerator_type=ei_accelerator_type)

---------------------------------------------------------------------------------------------------!

We can now use this predictor to classify hand-written digits. Drawing into the image box loads the pixel data into a 'data' variable in this notebook, which we can then pass to the mxnet predictor. 

In [None]:
from IPython.display import HTML
HTML(open("input.html").read())

The predictor runs inference on our input data and returns the predicted digit (as a float value, so we convert to int for display).

In [None]:
response = predictor.predict(data)
print(int(response))

## Cleanup

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
predictor.delete_endpoint()