# How to train a Segmentation model on AWS Sagemaker

## Contents

1. [Create a AWS Sagemaker Notebook](#1)
2. [Create an AWS S3 Bucket](#2)
3. [Setup](#3)
4. [Data](#4)
5. [Train the model](#5)
6. [Host](#6)
7. [Clean up](#7)

### 1. Create a AWS Sagemaker Notebook Instance

You first need to have an AWS account. You can refer to this page for further explanations [Create a Sagemaker Notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-setup-working-env.html). 

During the creation of the notebook, you will need to:
 - Have an IAM Role to allow interactions between the notebook and the S3 Bucket
 - Choose an instance type for your Notebook. Since you will need to train a model that requires GPU instances, you need to choose a GPU instance. The cheapest instance for training is *ml.p3.2xlarge*. Refer to [this](https://aws.amazon.com/fr/sagemaker/pricing/instance-types) to see the instances available in your region.

### 2. Create an AWS S3 Bucket

To train your model, you need to store your datasets and your training process will produces a model (*model.pth*) or some outputs that you need to store. For that purpose, you can create a S3 Bucket by following the steps on this page. [Create a S3 Bucket](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-bucket.html).

For example, assume that you create a S3 Bucket name *sagemaker-inspi-data*, and create a folder *data/* where you store your training data in *train/* folder and your validation data in *val* folder. Both folders have 02 subfolders *image* and *mask* to store the images and their labels.
We can also create a folder in the S3 Bucket to store results of the training process. Let's call it *output/*.

### 3. Setup


We first need to create a Sagemaker Session and specify the IAM role that will be use. It is the role that allow the notebook to access your data stored in the S3 and your code If the role use to create the Sagemaker notebook is different from the one use for the S3, replace the ```sagemaker.get_execution_role()``` by the appropriate role.

In [1]:
import sagemaker

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

### 4. Data

Now you need to take the path of the inputs data (training and validation) in order to pass it to training job.

In [2]:
training_dir = 's3://sagemaker-inspi-data/data/train'
testing_dir = 's3://sagemaker-inspi-data/data/test'

If you skip **Step 2** and prefer use the default bucket and upload your data in this bucket and uncomment the following cell.
It will use the default bucket and upload your data in a folder *data/*.

In [None]:
# bucket = sagemaker_session.default_bucket()
# prefix = 'sagemaker-inspi-data/pytorch'

## Download the data and save it the folder 'data'

# Upload the data in the S3 Bucket
# training_dir = sagemaker_session.upload_data(path='data/train', bucket=bucket, key_prefix=prefix+'data/train')
# testing_dir = sagemaker_session.upload_data(path='data/val', bucket=bucket, key_prefix=prefix+'/data/val')

The ```input_data``` variable will take the path of your training and validation data

In [None]:
input_data = {'training':training_dir, 'testing':testing_dir}

### 5. Train the model

### 5-1. Training file

To train the model, you will need to run the training file ```train.py```. This script provides all the code we need for training and hosting a SageMaker model (model_fn function to load a model). The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:

   - SM_MODEL_DIR: A string representing the path to the directory to write model artifacts to. These artifacts are uploaded to S3 for model hosting.
   - SM_NUM_GPUS: The number of gpus available in the current container.
   - SM_CURRENT_HOST: The name of the current container on the container network.
   - SM_HOSTS: JSON encoded list containing all the hosts .
   - SM_CHANNEL_TRAINING: A string representing the path to the directory that contains training datasets
   - SM_CHANNEL_TESTING: A string representing the path to the directory that contains testing datasets.
   
For more information about Sagemaker environment variables, refer to [Sagemaker Containers](https://github.com/aws/sagemaker-containers).

### 5.2 Run the training job

The PyTorch class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, the training instance type, and hyperparameters. 

This code was written with **PyTorch 1.6.0** but the latest version of PyTorch in Sagemaker is **1.5.0**, but this is not a problem. You need to specify it otherwise, the default version of PyTorch which is **0.4.1** will be used and this can cause many errors.

If you want to make a distributed training you will need to specify more than 1 ML instances. As mentionned before, our code required a GPU instance, so the instances used will be *ml.p3.2xlarge*
If you choose a CPU instance or a GPU instance that does not fit your data, you will be out of memory. You will need to reduce your batch size.

If your training script is in a folder, you will also need to specify it with ```source_dir```.
If you use other libraries or packages that are not installed by default to the PyTorch containers, you must include a *requirements.txt* file that list all the dependencies libraries. They will be installed automatically by running the training job.

The hyperparameters parameter is a dict of values that will be passed to your training script. Our training script take a lot of hyperparameters -- you can see how to access these values by running ``` python3 train.py --help``` or opening the file.

In [13]:
pytorch_version = '1.5.0'  # Pytorch version for training
trainfile = 'train.py'   # file where is the training script
nb_ml_instances = 1   # the number of instances for training
type_ml_instance = 'ml.p3.2xlarge'  # the type of instances for training
#type_ml_instance = 'local' 
hyper_param = {'epochs': 100, 
               'backend': 'nccl',
               'batch-size' : 10,
               'workers':2,
               'train-size': 225
              }
output_path = 's3://sagemaker-inspi-data/output'

You can also use **TensorBoard** to visualize some information about your training. It will produce an event file that you can visualize with TensorBoard.

In [14]:
from sagemaker.debugger import TensorBoardOutputConfig

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path='s3://sagemaker-inspi-data/output/data/emission'
)

To run the training job, now you need to create a PyTorch object and pass all the parameters.

In [15]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(entry_point=trainfile,
                    role=role,
                    source_dir='code',
                    output_path = output_path,
                    framework_version=pytorch_version,
                    train_instance_count=nb_ml_instances,
                    train_instance_type=type_ml_instance,
                    tensorboard_output_config=tensorboard_output_config,
                    hyperparameters=hyper_param)

After creating your PyTorch Object, you can now fit your model to your input data

In [1]:
estimator.fit(input_data)

After the training, you will get a model in your S3 Bucket. You can download it in order to use it on other devices or services, or you can deploy it with Sagemaker.

### 6. Host

### 6-1. Create endpoint

After training, we use the PyTorch estimator object to build and deploy a PyTorchPredictor. This creates a Sagemaker Endpoint -- a hosted prediction service that we can use to perform inference.

We have implemented a ```model_fn``` function in the train.py script that is required. We are going to use default implementations of *input_fn, predict_fn, output_fn and transform_fn* defined in sagemaker-pytorch-containers.

The arguments to the deploy function allow us to set the number and type of instances that will be used for the Endpoint. These do not need to be the same as the values we used for the training job. For example, you can train a model on a set of GPU-based instances, and then deploy the Endpoint to a fleet of CPU-based instances, but you need to make sure that you return or save your model as a cpu model similar to what we did in train.py. Here we will deploy the model to a single *ml.m5.xlarge* instance.


In [None]:
predictor = estimator.deploy(initial_instance_count=1, instance_type='ml.m5.xlarge')

### 6-2. Evaluate

You can know use it to make some predictions with images. 
You can refer to ```test_1.py``` file to see how to use the model.

### 7. Clean up

After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it.

In [None]:
estimator.delete_endpoint()

In order to avoid unnecessary costs, you will need to delete the ressources that you don't need anymore.
Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/ and delete the following resources:

1. **The endpoint**. Deleting the endpoint also deletes the ML compute instance or instances that support it.

2. **The endpoint configuration.**

3. **The model.**

4. **The notebook instance. Before deleting the notebook instance, stop it.**

5. **The S3 Bucket**

### 8. Additional ressources

You can find additional ressources for further information:
1. [Get Started with AWS Sagemaker](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-console.html)
2. [Examples of notebooks with AWS Sagemaker](https://github.com/awslabs/amazon-sagemaker-examples)