# Custom Models with PyTorch - Amazon SageMaker Workshop

<div class="alert alert-block alert-info">
    <b>Note:</b> You may be asked to choose a kernel for this notebook. Please use the following:    
    <ul><li>Standard Amazon SageMaker notebook instances - <i>conda_pytorch_p36</i></li>
        <li>Amazon SageMaker Studio notebooks - <i>Python 3 (PyTorch 1.6 Python 3.6 GPU Optimized)</i></li></ul>
</div>

Although the built-in algorithms provided by Amazon SageMaker are useful, sometimes it is necessary to write custom algorithms. In this workshop, you will learn how to write custom algorithms while taking advantage of all the features offered by Amazon SageMaker, including training jobs and hyperparameter tuning jobs. This workshop is just a primer meant to get you started; there are too many features to cover in a couple of hours. 

Amazon SageMaker supports many major deep learning frameworks, including TensorFlow, PyTorch, and Apache MXNet. In this workshop, you will build a custom model in PyTorch, but there are plenty of resources available for the other frameworks.

**Contents**
1. [Create a PyTorch script](#script)
1. [Transform script for Amazon SageMaker](#transform)
 1. [Add environment variables](#vars)
 1. [Enable hyperparameter tracking](#hyperparams)
 1. [Add logging](#logging)
1. [Running the script with Amazon SageMaker](#run)
 1. [Setup](#setup)
 1. [Training](#train)
 1. [Hyperparameter tuning](#tune)
 1. [Deploy](#deploy)
1. [Resources](#iwantmore)

<div class="alert alert-block alert-info">
    <b>Note:</b> This workshop assumes that you have a basic knowledge of using deep learning frameworks, like PyTorch, as well as a basic knowledge of Amazon SageMaker. Please consider completing a basic Amazon SageMaker workshop before commencing with this workshop.
</div>

## Create a PyTorch script <a class="anchor" id="script"></a>

Let's get started by grabbing one of the example scripts from the official PyTorch repository. We will use the [MNIST example](https://github.com/pytorch/examples/tree/master/mnist) to keep things simple. If you are not familiar with the [MNIST dataset](https://en.wikipedia.org/wiki/MNIST_database), it is a collection of images of handwritten digits (0 - 9) popular for deep learning tutorials. It could be seen as the "Hello, World" of deep learning.

<img src="img/mnist.png" alt="MNIST data" width="300"/>

We will start with the original PyTorch script and modify it to work with Amazon SageMaker. First, take a look at the original script. A copy of it has already been placed in the `scripts` folder contained in this workshop.

In [None]:
!pygmentize scripts/original_pytorch_mnist.py

The script constructs a simple convolutional neural network, provides training and testing code, defines the hyperparameters as arguments, and loads the MNIST dataset.

Let's run the script as-is to see if it works. The code below will run the PyTorch script on the instance running this notebook, which should be an `ml.t3.medium` (best practice is to run notebooks on small instances and only use powerful GPU instances for training). Since this instance does not have a GPU, the script will take a long time to run. We only want to see if it works, so set the epoch to 1 to reduce the time it takes.

<div class="alert alert-block alert-info">
    <b>Note:</b> While you wait for the script to complete training, please read on and complete the next steps.
</div>

In [None]:
!python scripts/original_pytorch_mnist.py --epochs=1

## Transform script for Amazon SageMaker <a class="anchor" id="transform"></a>

Now, you are going to make changes to this script to make it compatible with Amazon SageMaker. Navigate to the `scripts` folder and duplicate `original_pytorch_mnist.py`, then rename the duplicate to `sagemaker_pytorch_mnist.py`. Open this file and make the changes described in the following sections.

### Add environment variables <a class="anchor" id="vars"></a>

First, you need to add some additional arguments to the script. By doing so, your script has access to important environment variables from the Amazon SageMaker container. In this case, the only variables you need to add are:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting
* `SM_CHANNEL_TRAINING`: A string that represents the path to the directory that contains the input data for the training channel.
* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.

However, there are many variables not covered in this workshop, which can be useful when building custom models. For more information, see the [SageMaker Containers GitHub](https://github.com/aws/sagemaker-training-toolkit#read-additional-information-using-environment-variables) and the [Amazon SageMaker Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/index.html).

Start by adding the code below in the `main()` method of `sagemaker_pytorch_mnist.py`. Place it below the original argument additions, but before `args = parser.parse_args()`.

```
# SageMaker Container environment
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument('--num-gpus', type=int, default=os.environ['SM_NUM_GPUS'])
```

Now that your script has access to the environment variables, you need to edit the script to use them. Starting with the number of GPUs available to the host. Remove the following argument from the script, it won't be needed anymore:

```
parser.add_argument('--no-cuda', action='store_true', default=False,
                        help='disables CUDA training')
```

Now replace

```
use_cuda = not args.no_cuda and torch.cuda.is_available()
```

with

```
use_cuda = args.num_gpus > 0
```

to make use of the new environment variable.

Next, you need to make sure the script fetches the data from the data directory specified in our SageMaker environment, instead of downloading it to the instance running this notebook. 

In this section of the code

```
    train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
                       transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.batch_size, shuffle=True, **kwargs)
    test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                       ])),
        batch_size=args.test_batch_size, shuffle=True, **kwargs)
```

replace the two occurrences of `'../data'` with `args.data_dir`.

Similarly, you want to change the script so the model is saved in the model directory specified in our SageMaker environment, instead of saving it to the instance running this notebook.

Replace

```
    if args.save_model:
        torch.save(model.state_dict(), "mnist_cnn.pt")
```

with

```
    if args.save_model:
        model_path = os.path.join(args.model_dir, "mnist_cnn.pt")
        torch.save(model.state_dict(), model_path)
```

and don't forget to add `import os` to the top of the script.

### Enable hyperparameter tracking <a class="anchor" id="hyperparams"></a>

This part is super easy, because you don't have to make any changes! You want SageMaker to know which hyperparameters are used by the script, so you can define these in the training jobs and hyperparameter tuning jobs. To do this, you must add the hyperparameters as arguments to `ArgumentParser` in the `main()` method. Fortunately, this was already done in the original PyTorch script, hence there is nothing additional to be done for SageMaker to track these hyperparameters.

### Add logging <a class="anchor" id="logging"></a>

SageMaker writes the logs of its training and hyperparameter tuning jobs to Amazon CloudWatch. Fortunately, this also does not require any changes to our original script. All of the `print()` statements in the original script, along with some additional startup information, will be automatically read by SageMaker and printed in the logs. Similarly, you can use the Python `logging` module instead of `print()` statements - both will work in the same way.

That's it! This is all you need to run your custom PyTorch algorithms using Amazon SageMaker.

## Running the script with Amazon SageMaker <a class="anchor" id="run"></a>

Assuming you made all the changes correctly, we should now be able to use `sagemaker_pytorch_mnist.py` to run the full training on a GPU instance using a SageMaker training job. In this workshop, we store the script on the instance running this notebook, but SageMaker can also fetch these scripts from code repositories, where they would normally be stored in production.

### Setup <a class="anchor" id="setup"></a>

First, we need to set up our execution role, session, and S3 bucket. 

<div class="alert alert-block alert-info">
    <b>Note:</b> 
    If you used the CloudFormation template to create the resources for this workshop in your account, or if you are running this notebook as part of an AWS-hosted workshop, an S3 bucket has already been created in your account. Identify the right S3 bucket and copy/paste this name in the code below.

If you do not have an existing S3 bucket in your account for this workshop, use `bucket = sagemaker_session.default_bucket()` to have SageMaker create a bucket for you.
</div>

In [None]:
import sagemaker
import boto3

# Get the role associated with this SageMaker notebook.
role = sagemaker.get_execution_role()
print("Role name: {}".format(role))

# Start a session
sagemaker_session = sagemaker.Session()

# Specify an S3 bucket for storing the training data.
# !ACTION REQUIRED! Replace <TODO> with the name of the S3 bucket created by the CloudFormation template.
# If no S3 bucket has been created, use bucket = sagemaker_session.default_bucket()
bucket = "<TODO>"
print("Bucket name: {}".format(bucket))

# Set a prefix for storing your data - this will look like a folder in the S3 bucket.
prefix = 'sagemaker-workshop-pytorch'

Normally, the next step would be to download the dataset. However, you already downloaded the dataset when you ran the original PyTorch MNIST script at the start of this workshop. If you check the root folder of this workshop, you should see a `data` folder with the MNIST data. So all we need to do is upload that to the S3 bucket by running the command below.

In [None]:
inputs = sagemaker_session.upload_data(path='../data', bucket=bucket, key_prefix=prefix)
print('Data uploaded to: {}'.format(inputs))

### Training <a class="anchor" id="train"></a>

Now we can call the SageMaker PyTorch estimator to start a training job. This should look familiar to you if you have used the SageMaker built-in algorithms before. It has similar parameters to specifying the instance type, instance count, role, and job name. 

However, it also has new parameters which are unfamiliar. Use `entry_point` to tell SageMaker where to find your custom PyTorch script and use `framework_version` to specify the version of the PyTorch container to use.

The hyperparameter names should match exactly the names used in `ArgumentParser` in your script. Notice that we use a string instead of a boolean value for `save-model` - this is because the Estimator does not support boolean values.

The final part of the process is to define the metrics which you want SageMaker to track. Take a look at the output produced by the original script when you ran it at the start of this workshop. It should look similar to:

```
...
Train Epoch: 1 [57600/60000 (96%)]	Loss: 0.161597
Train Epoch: 1 [58240/60000 (97%)]	Loss: 0.040758
Train Epoch: 1 [58880/60000 (98%)]	Loss: 0.080115
Train Epoch: 1 [59520/60000 (99%)]	Loss: 0.104489

Test set: Average loss: 0.0602, Accuracy: 9799/10000 (98%)
```

We can ask SageMaker to track specific values from the output by providing regular expressions which extract the values of interest. In this case, you'll want to track the training loss, training accuracy, validation loss, and validation accuracy. In the code below, we have provided the regular expression for extracting the training loss, but the other three have been left blank for you to define. We recommend using a [regex tool](https://pythex.org/) to test your regular expressions.

In [None]:
from sagemaker.pytorch import PyTorch

# !ACTION REQUIRED! In the code below, you need to replace the <TODO>'s!

estimator = PyTorch(
    entry_point='scripts/sagemaker_pytorch_mnist.py',
    base_job_name='training-pytorch-mnist',
    role=role,
    framework_version='1.2.0',
    py_version='py3',
    instance_count=1,
    instance_type='ml.p2.xlarge',
    hyperparameters={
        'batch-size': 64,
        'test-batch-size': 1000,
        'epochs': 10,
        'lr': 1.0,
        'gamma': 0.7,
        'seed': 1,
        'save-model': 'True'
    },
    metric_definitions=[
        {'Name': 'train:loss', 'Regex': 'Train Epoch: .+Loss: (.+)'},
        {'Name': 'train:accuracy', 'Regex': '<TODO>'},
        {'Name': 'val:loss', 'Regex': '<TODO>'},
        {'Name': 'val:accuracy', 'Regex': '<TODO>'}
    ]
)

When you're happy with the hyperparameter values and the metric definitions, go ahead and run the training job!

<div class="alert alert-block alert-warning">
    <b>Encounter an error?</b> If you encounter an error related to instance types, this likely means that you do not have access to p2 instances on your AWS account. New AWS accounts have limits on the resource types to prevent abuse. If you are using your own account, request a limit increase through the <a href=https://console.aws.amazon.com/support/home#/>AWS Support Center</a>. If this is not your own account, simply run the training on an 'ml.c4.xlarge' instance (and reduce the number of epochs). 
</div>

In [None]:
estimator.fit({'training': inputs}, wait=False)

Once the training job is running, navigate to SageMaker in the AWS console to see your training job. It will likely show the status 'InProgress'. Feel free to continue with this workshop and check back later to view the results.

<img src="img/training_job.PNG" alt="Training job in the console" width="800"/>

If you look at the training job more closely, you'll see that the console also lists the values of the hyperparameters, which is great if you need to look up this information at a later time.

<img src="img/training_job_parameters.PNG" alt="Training job hyperparameters" width="800"/>

Once the training job has finished, SageMaker will also display graphs of the metrics we asked it to track.

<img src="img/training_job_metrics.PNG" alt="Training job metrics" width="800"/>

### Hyperparameter tuning <a class="anchor" id="tune"></a>

Similar to running training jobs with custom PyTorch algorithms, you can run hyperparameter tuning jobs. The code below shows you how. Most of this code should look familiar to you if you have used built-in algorithms before. First, you define an estimator, same as for a training job. Then you specify which hyperparameters to tune and their search range. Finally, you define a tuner, with a strategy, an objective, and job settings. 

Don't forget to add the `metric_definitions` you defined before for the training job. In this example, we choose to find the best values for the learning rate (`lr`) and `gamma`. We also tell SageMaker to tune based on the `val:accuracy` metric which you have defined.

In [None]:
from sagemaker.pytorch import PyTorch
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter, ContinuousParameter

# !ACTION REQUIRED! In the code below, you need to replace the <TODO>'s!

estimator = PyTorch(
    entry_point='scripts/sagemaker_pytorch_mnist.py',
    base_job_name='training-pytorch-mnist',
    role=role,
    framework_version='1.2.0',
    py_version='py3',
    instance_count=1,
    instance_type='ml.p2.xlarge',
    hyperparameters={
        'batch-size': 64,
        'test-batch-size': 1000,
        'epochs': 5,
        'seed': 1,
        'save-model': 'True'
    },
    metric_definitions=[
        {'Name': 'train:loss', 'Regex': 'Train Epoch: .+Loss: (.+)'},
        {'Name': 'train:accuracy', 'Regex': '<TODO>'},
        {'Name': 'val:loss', 'Regex': '<TODO>'},
        {'Name': 'val:accuracy', 'Regex': '<TODO>'}
    ]
)

hyperparameter_ranges = {
    'lr': ContinuousParameter(0.001, 1.0, scaling_type='Logarithmic'),
    'gamma': ContinuousParameter(0.01, 0.9, scaling_type='Auto')
}

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='val:accuracy',
    hyperparameter_ranges=hyperparameter_ranges,
    metric_definitions=[
        {'Name': 'train:loss', 'Regex': 'Train Epoch: .+Loss: (.+)'},
        {'Name': 'train:accuracy', 'Regex': '<TODO>'},
        {'Name': 'val:loss', 'Regex': '<TODO>'},
        {'Name': 'val:accuracy', 'Regex': '<TODO>'}
    ],
    strategy='Bayesian',
    objective_type='Maximize',
    max_jobs=4,
    max_parallel_jobs=2,
    base_tuning_job_name='tuning-pytorch-mnist'
)

tuner.fit(inputs=inputs)

Once the hyperparameter tuning job is running, navigate to SageMaker in the AWS console to see it. It will likely show the status 'InProgress'. Feel free to continue with this workshop and check back later to view the results.

<div class="alert alert-block alert-warning">
    <b>Encounter an error?</b> If you encounter an error related to instance types, this likely means that you do not have access to p2 instances on your AWS account. New AWS accounts have limits on the resource types to prevent abuse. If you are using your own account, request a limit increase through the <a href=https://console.aws.amazon.com/support/home#/>AWS Support Center</a>. If this is not your own account, simply run the training on an 'ml.c4.xlarge' instance (and reduce the number of epochs). 
</div> 

### Deploy <a class="anchor" id="deploy"></a>

Finally, let's take a look at how to deploy a custom PyTorch model to an Amazon SageMaker endpoint. The code required to deploy the model could be included in `sagemaker_pytorch_mnist.py`, but in this case we will store it in a separate file called `serve_pytorch_mnist.py` which has already been placed in the `scripts` folder of this workshop. Take a look at the script first.

In [None]:
!pygmentize scripts/serve_pytorch_mnist.py

Notice that the script includes the same convolutional neural network build as the training script. However, the key method to understand is the `model_fn()` method. Amazon SageMaker model serving defines four methods which you can use to manipulate the behavior of the deployed model:

* `model_fn`: Takes a model directory and loads the model artifacts into an estimator object.
* `input_fn` (*optional*): Takes request data and deserializes the data into an object for prediction.
* `predict_fn` (*optional*): Takes the deserialized request object and performs inference against the loaded model.
* `output_fn` (*optional*): Takes the result of prediction and serializes this according to the response content type.

These methods can be completely customized based on your needs. The `input_fn`, `predict_fn`, and `output_fn` methods have default implementations in the SageMaker PyTorch model server. You only need to include these methods if you want to modify the default implementation. However, the `model_fn` method must always be defined.

When you're happy with the serving script, you can use the code below to set up a SageMaker endpoint. Hopefully, at this point, at least one of your training jobs has finished. You'll need to find the S3 URI of the model artifact for the model you want to deploy. It should look like: `s3://<bucket-name>/.../model.tar.gz`. This URI can be found through the console or through code in this notebook - it is up to you to pick a method, figure out how to find the URI, and insert it in the code below.

In [None]:
from sagemaker.pytorch import PyTorchModel

# !ACTION REQUIRED! In the code below, you need to replace the <TODO>'s!

model = PyTorchModel(
    model_data='<TODO>',
    role=role,
    framework_version='1.2.0',
    py_version='py3',
    entry_point='scripts/serve_pytorch_mnist.py',
)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

This workshop will not go through the process of running test data through the endpoint, because this is covered in all basic workshops. However, feel free to write this code yourself as a challenge.

Don't forget to delete the endpoint before you finish.

In [None]:
predictor.delete_endpoint()

## Resources <a class="anchor" id="iwantmore"></a>

If you are interested in learning more about the advanced features of Amazon SageMaker, below are some recommended resources for development and further learning.

**Further Learning**
* [Distributed MNIST with PyTorch on Amazon SageMaker](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/pytorch_mnist/mnist.py) - a more advanced version of the code in this workshop, making use of distributed computing.
* [Amazon SageMaker Immersion Day Workshop](https://sagemaker-immersionday.workshop.aws/) - a set of workshops from the Amazon SageMaker Immersion Day

**Useful Resources**
* [Amazon SageMaker Official Documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html)
* [AWS SageMaker Blogs](https://aws.amazon.com/blogs/?filtered-posts.q=sagemaker&filtered-posts.q_operator=AND)
* [Amazon SageMaker Python SDK Documentation](https://sagemaker.readthedocs.io/en/stable/index.html)
* [Using PyTorch with the Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/using_pytorch.html)
* [AWS Samples GitHub](https://github.com/aws-samples)
* [Amazon SageMaker Examples GitHub](https://github.com/awslabs/amazon-sagemaker-examples)