# Training MMDetection Mask-RCNN Model on Sagemaker Distributed Cluster

## Motivation
[MMDetection](https://github.com/open-mmlab/mmdetection) is a popular open-source Deep Learning framework focused on Computer Vision models and use cases. MMDetection provides to higher level APIs for model training and inference. It demonstrates [state-of-the-art benchmarks](https://github.com/open-mmlab/mmdetection#benchmark-and-model-zoo) for variety of model architecture and extensive Model Zoo.

In this notebook, we will build a custom training container with MMdetection library and then train Mask-RCNN model from scratch on [COCO2017 dataset](https://cocodataset.org/#home) using Sagemaker distributed [training feature](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html) in order to reduce training time.

### Preconditions
- To execute this notebook, you will need to have COCO 2017 training and validation datasets uploaded to S3 bucket available for Amazon Sagemaker service.


## Building Training Container

Amazon Sagemaker allows to BYO containers for training, data processing, and inference. In our case, we need to build custom training container which will be pushed to your AWS account [ECR service](https://aws.amazon.com/ecr/). 

For this, we need to login to public ECR with Sagemaker base images and private ECR reposity.

In [None]:
import sagemaker, boto3

session = sagemaker.Session()
region = session.boto_region_name
account = boto3.client('sts').get_caller_identity().get('Account')
bucket = session.default_bucket()

container = "mmdetection-training" # your container name
tag = "latest"

In [None]:
# login to Sagemaker ECR with Deep Learning Containers
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin 763104351884.dkr.ecr.{region}.amazonaws.com
# login to your private ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account}.dkr.ecr.{region}.amazonaws.com

Now, let review training container:
- use Sagemaker PyTorch 1.5.0 container as base image;
- install latest version of Pytorch libraries and MMdetection dependencies;
- build MMDetection from sources;
- configure Sagemaker env variables, specifically, what script to use at training time.

In [None]:
! pygmentize -l docker Dockerfile.training

<br>
<br>
Next, we build and push custom training container to private ECR
<br>
<br>

In [None]:
! ./build_and_push.sh $container $tag Dockerfile.training

### Training script

At training time, Sagemaker executes training script defined in `SAGEMAKER_PROGRAM` variable. In our case, this script does following
- parses user parameters passed via Sagemaker Hyperparameter dictionary;
- based on parameters constructs launch command;
- uses `torch.distributed.launch` utility to launch distributed training;
- uses MMDetection `tools/train.py` to configure trianing process.


In [None]:
! pygmentize container_training/mmdetection_train.py

## Define training configuration

In [None]:
# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

In [None]:
from time import gmtime, strftime

prefix_input = 'mmdetection-input'
prefix_output = 'mmdetection-ouput'
image = '{}.dkr.ecr.{}.amazonaws.com/{}:{}'.format(account, region, container, tag)

In [None]:
# HERE
# algorithm parameters

hyperparameters = {
    "config-file" : "configs/custom/faster_rcnn_r50_fpn_1x_coco.py", # config path is relative to MMDetection root directory
    "dataset" : "coco",
    "auto-scale" : "false", # whether to scale LR and Warm Up time
    "validate" : "true", # whether to run validation after training is done
    
    # 'options' allows to override individual config values
    "options" : "total_epochs=10; optimizer.lr=0.08; evaluation.gpu_collect=True",
}

In [None]:
# Sagemaker will parse metrics from STDOUT and store/visualize them as part of training job
metrics = [
    {
        "Name": "loss",
        "Regex": ".*loss:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_cls",
        "Regex": ".*loss_rpn_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_rpn_bbox",
        "Regex": ".*loss_rpn_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_cls",
        "Regex": ".*loss_cls:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "acc",
        "Regex": ".*acc:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_bbox",
        "Regex": ".*loss_bbox:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "loss_mask",
        "Regex": ".*loss_mask:\s([0-9\\.]+)\s*"
    },
    {
        "Name": "lr",
        "Regex": "lr: (-?\d+.?\d*(?:[Ee]-\d+)?)"
    }
]

## Test training script and container locally


Amazon SageMaker support [local mode](https://sagemaker.readthedocs.io/en/stable/overview.html?highlight=local%20mode#local-mode) which allows you to deploy and run training job locally first, before deploying your training container to remote SageMaker Training cluster.

To use local mode, we first need to install some dependencies. Please note, you may or may not need to restart your kernel for this changes to be applied.

In [None]:
# Install all dependecies for local run. 
# Note you may need to restart your Sagemaker Notebook kernel to have changes applied.
! pip install 'sagemaker[local]' --upgrade

In [None]:
from sagemaker.local import LocalSession

# Configure our local training session
sagemaker_local_session = LocalSession()
sagemaker_local_session.config = {'local': {'local_code': True}}

In [None]:
# HERE
# Copy training data locally
! mkdir ../dominos
! aws cp s3://roboflow-data-1/dominos/ ../dominos --recursively

Now, we are ready to run our training container locally. For this, we need to pass special type of instance `local_gpu`. In this case, SageMaker will run training container with access to CUDA devices. Note, if you don't need access to GPUs, you may choose `local` instance type.

Note, depending on configuration of your local host and available memory, you may run into memory issues when loading dataset. In this case, try reducing your batch size to bring down memory consumption. 

In [None]:
est = sagemaker.estimator.Estimator(image,
                                    role=role,
                                    instance_count=1,
                                    instance_type='local',
                                    output_path="s3://{}/{}".format(bucket, prefix_output),
                                    metric_definitions = metrics,
                                    hyperparameters = hyperparameters, 
                                    sagemaker_session=sagemaker_local_session
)
# HERE
est.fit({"training" : "file:///home/ec2-user/SageMaker/dominos"})

## Start Sagemaker Training 

Now that we tested our training scrip and container locally, we are ready to run training job on disrtibuted SageMaker training cluster. Execute cell below to start training on Sagemaker. Note, that you have available parameters such as `instance_count` and `instance_type` to manage your training cluster configuration.

In [None]:
# HERE
instance_type = 'ml.p2.xlarge'
instance_count = 2

est = sagemaker.estimator.Estimator(image,
                                          role=role,
                                          instance_count=instance_count,
                                          instance_type=instance_type,
                                          train_volume_size=100,
                                          output_path="s3://{}/{}".format(bucket, prefix_output),
                                          metric_definitions = metrics,
                                          hyperparameters = hyperparameters, 
                                          sagemaker_session=session
)
# HERE
est.fit({"training" : "s3://roboflow-data-1/dominos/"})