In [1]:
from utils import *
%matplotlib inline

# Train the neural network

<br>
<center><img src="support/robot.gif" width=600></center>

In this section, we will discuss how to train a defined network with data. We first import the libraries. The new ones are `mxnet.init` for more weight initialization methods, the `datasets` and `transforms` to load and transform computer vision datasets, `matplotlib` for drawing, and `time` for benchmarking.

In [2]:
from mxnet import nd, gluon, init, autograd

from mxnet.gluon import nn
from mxnet.gluon.data.vision import datasets, transforms

import matplotlib.pyplot as plt
from time import time

## Get data

### Training Dataset: MNIST

The handwritten digit MNIST dataset is one of the most commonly used datasets in deep learning. So we'll use it here.

The dataset can be automatically downloaded through Gluon's `data.vision.datasets.MNIST` which is a subclass of `gluon.data.Dataset`.

In [3]:
mnist_train = datasets.MNIST(train=True)
X, y = mnist_train[0]
print('X shape: %s dtype: %s' % (X.shape, X.dtype))
print("Number of images: %d" % len(mnist_train))

X shape: (28, 28, 1) dtype: <class 'numpy.uint8'>
Number of images: 60000


In order to feed data into a Gluon model, we need to convert the images to the `(channel, height, weight)` format with a floating point data type. It can be done by `transforms.ToTensor`. In addition, we normalize all pixel values to be between 0 and 1. We chain these two transforms together and apply it to the first element of the data pair, namely the images.

Transform dataset using `data.vision.transforms.ToTensor`:
- channel first, float32
- Min-Max Normalization


In [4]:
mnist_train = mnist_train.transform_first(transforms.ToTensor())

### Data Loading

In [5]:
batch_size = 256

train_data = gluon.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=True, num_workers=4)

The returned `train_data` is an iterator that yields batches of images and labels pairs.

In [6]:
for data, label in train_data:
    print(data.shape, label.shape)
    break

(256, 1, 28, 28) (256,)


## Define the model

We implement a simple neural network model introduced before. One difference here is that we changed the weight initialization method to `Xavier`, which is a popular choice for deep convolutional neural networks.

In [7]:
net = nn.Sequential()
with net.name_scope():
    net.add(
        nn.Flatten(),
        nn.Dense(120, activation="relu"),
        nn.Dense(84, activation="relu"),
        nn.Dense(10)
    )
net.initialize(init=init.Xavier())

Besides the neural network, we need to define the loss function and optimization method for training. We will use standard softmax cross entropy loss for classification problems. It first performs softmax on the output to obtain the predicted probability, and then compares the label with the cross entropy.

### Loss

In [8]:
softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

<center><img src="support/cross_entropy.png" width=400></center>

The optimization method we pick is the standard stochastic gradient descent with constant learning rate of 0.1.

### Optimization

In [9]:
trainer = gluon.Trainer(net.collect_params(),
                        'sgd', {'learning_rate': 0.1})

<center><img src="support/optimization.gif" width=400></center>

The `trainer` is created with all parameters (both weights and gradients) in `net`. Later on, we only need to call the `step` method to update its weights.

### Accuracy 

In [10]:
def acc(output, label):
    # output: (batch, num_output) float32 ndarray
    # label: (batch, ) int32 ndarray
    acc = (output.argmax(axis=1) == label.astype('float32'))
    return acc.mean().asscalar()

## Training loop

Now we can implement the complete training loop.

In [11]:
for epoch in range(10):
    train_loss, train_acc = 0., 0.
    tic = time()
    for data, label in train_data:
        with autograd.record():
            output = net(data)
            loss = softmax_cross_entropy(output, label)
        loss.backward()
        trainer.step(batch_size)
        
        train_loss += loss.mean().asscalar()
        train_acc += acc(output, label)

  
    print("Epoch[%d] Loss:%.3f Acc:%.3f Perf: %.1f img/sec"%(
        epoch, train_loss/len(train_data),
        train_acc/len(train_data),
        len(mnist_train)/(time()-tic)))

Epoch[0] Loss:0.589 Acc:0.837 Perf: 31231.4 img/sec
Epoch[1] Loss:0.289 Acc:0.917 Perf: 30453.0 img/sec
Epoch[2] Loss:0.233 Acc:0.934 Perf: 27167.5 img/sec
Epoch[3] Loss:0.197 Acc:0.944 Perf: 28271.2 img/sec
Epoch[4] Loss:0.172 Acc:0.951 Perf: 29715.1 img/sec
Epoch[5] Loss:0.151 Acc:0.956 Perf: 29598.4 img/sec
Epoch[6] Loss:0.134 Acc:0.962 Perf: 28834.3 img/sec
Epoch[7] Loss:0.121 Acc:0.965 Perf: 30130.1 img/sec
Epoch[8] Loss:0.110 Acc:0.969 Perf: 29800.6 img/sec
Epoch[9] Loss:0.100 Acc:0.972 Perf: 28675.8 img/sec


## Validate the model

In [12]:
#validation dataset
mnist_valid = gluon.data.vision.MNIST(train=False)

valid_data = gluon.data.DataLoader(mnist_valid.transform_first(transforms.ToTensor()), 
                                   batch_size=batch_size, 
                                   num_workers=4)

In [13]:
valid_acc = 0.
for data, label in valid_data:
    output = net(data)
    valid_acc += acc(output, label)
    
"Validation accuracy: %.2f"%(valid_acc/len(valid_data))

'Validation accuracy: 0.97'

## Save the model

Finally, we save the trained parameters onto disk, so that we can use them later.


<center><img src="support/save.gif" width=600></center>

In [14]:
net.save_parameters('net.params')

# Training with Amazon SageMaker

<br>
<center><img src="support/cloud-upload.gif" width=600></center>

Now let's see how to train the previously defined network on the aws cloud using Amazon Sagemaker to manage the  data. Let's import the sagemaker libraries.

In [15]:
import sagemaker
from sagemaker import get_execution_role

role = get_execution_role()
session = sagemaker.Session()

## Handling the data

Point to data location in S3

In [16]:
data_location = 's3://{}/{}'.format(session.default_bucket(), 'data')
output_location = 's3://{}/{}'.format(session.default_bucket(), 'results')

## MXNet Model Training script

Package the training defined above and functions for inference into a script that's used as an entrypoint by sagemaker.

In [17]:
!pygmentize train_sagemaker.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mfrom[39;49;00m [04m[36mtime[39;49;00m [34mimport[39;49;00m time

[34mimport[39;49;00m [04m[36mmxnet[39;49;00m [34mas[39;49;00m [04m[36mmx[39;49;00m
[34mfrom[39;49;00m [04m[36mmxnet[39;49;00m [34mimport[39;49;00m nd, gluon, init, autograd
[34mfrom[39;49;00m [04m[36mmxnet.gluon[39;49;00m [34mimport[39;49;00m nn
[34mfrom[39;49;00m [04m[36mmxnet.gluon.data.vision[39;49;00m [34mimport[39;49;00m datasets, transforms

[34mdef[39;49;00m [32mparse_args[39;49;00m():
    parser = argparse.ArgumentParser()

    parser.add_argument([33m'[39;49;00m[33m--num-gpus[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, default=[34m1[39;49;00m)
    parser.add_argument([33m'[39;49;00m[33m--epochs[39;49;00m[33m'[39;49;00m, [36mtype[39;49;00m=[36mint[39;49;00m, defaul

## SageMaker MXNet Estimator

In [18]:
from sagemaker.mxnet import MXNet

In [19]:
train_instance_type = 'ml.p3.2xlarge'

m = MXNet(entry_point='train_sagemaker.py',
          py_version='py3',
          role=role, 
          train_instance_count=1, 
          train_instance_type=train_instance_type,
          output_path=output_location,
          hyperparameters={'num-gpus': 1,
                           'epochs': 10,
                           'optimizer': 'adam',
                           'batch-size':256},
         input_mode='File',
         train_max_run=7200,
         framework_version='1.3.0')

## Fit MXNet Estimator

In [20]:
m.fit({'train': data_location})

2020-01-09 08:29:48 Starting - Starting the training job...
2020-01-09 08:29:50 Starting - Launching requested ML instances......
2020-01-09 08:30:54 Starting - Preparing the instances for training......
2020-01-09 08:32:11 Downloading - Downloading input data
2020-01-09 08:32:11 Training - Downloading the training image...
2020-01-09 08:32:32 Training - Training image download completed. Training in progress.[34m2020-01-09 08:32:33,492 sagemaker-containers INFO     Imported framework sagemaker_mxnet_container.training[0m
[34m2020-01-09 08:32:33,519 sagemaker_mxnet_container.training INFO     MXNet training environment: {'SM_HP_BATCH-SIZE': '256', 'SM_FRAMEWORK_PARAMS': '{}', 'SM_HP_EPOCHS': '10', 'SM_NETWORK_INTERFACE_NAME': 'eth0', 'SM_LOG_LEVEL': '20', 'SM_HOSTS': '["algo-1"]', 'SM_OUTPUT_INTERMEDIATE_DIR': '/opt/ml/output/intermediate', 'SM_CHANNELS': '["train"]', 'SM_OUTPUT_DATA_DIR': '/opt/ml/output/data', 'SM_INPUT_DIR': '/opt/ml/input', 'SM_FRAMEWORK_MODULE': 'sagemaker_mxne

[34m2020-01-09 08:33:53,592 INFO __main__: Epoch[0] Loss:0.241 Acc:0.926|0.965 Perf: 6394.7 img/sec[0m
[34m2020-01-09 08:34:03,043 INFO __main__: Epoch[1] Loss:0.113 Acc:0.965|0.975 Perf: 6348.6 img/sec[0m
[34m2020-01-09 08:34:13,297 INFO __main__: Epoch[2] Loss:0.084 Acc:0.974|0.980 Perf: 5851.8 img/sec[0m
[34m2020-01-09 08:34:23,168 INFO __main__: Epoch[3] Loss:0.069 Acc:0.978|0.980 Perf: 6078.7 img/sec[0m
[34m2020-01-09 08:34:33,011 INFO __main__: Epoch[4] Loss:0.061 Acc:0.980|0.982 Perf: 6095.7 img/sec[0m
[34m2020-01-09 08:34:43,314 INFO __main__: Epoch[5] Loss:0.060 Acc:0.981|0.986 Perf: 5823.6 img/sec[0m
[34m2020-01-09 08:34:53,216 INFO __main__: Epoch[6] Loss:0.052 Acc:0.983|0.982 Perf: 6059.9 img/sec[0m
[34m2020-01-09 08:35:03,277 INFO __main__: Epoch[7] Loss:0.058 Acc:0.982|0.987 Perf: 5963.7 img/sec[0m
[34m2020-01-09 08:35:12,422 INFO __main__: Epoch[8] Loss:0.053 Acc:0.984|0.987 Perf: 6561.7 img/sec[0m
[34m2020-01-09 08:35:22,424 INFO __main__: Epoch[9] Lo

## Deploy Trained model to a predictor

In [21]:
predictor = m.deploy(initial_instance_count=1,
                     endpoint_name="mxnet-sagemaker-demo-endpoint",
                     instance_type='ml.m4.xlarge')

---------------------------------------------------------------------------------------------------!

In [None]:
predictor.delete_endpoint()