# YOLO v3 Finetuning on AWS

This series of notebooks demonstrates how to finetune pretrained YOLO v3 (aka YOLO3) using MXNet on AWS.

**This notebook** walks through using the [SageMaker Hyperparameter Tuning Job](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) tool to finding optmized hypterparameter and finetune the model.

**Follow-on** the content of the notebooks shows:

* How to use MXNet YOLO3 pretrained model
* How to use Deep SORT with MXNet YOLO3
* How to create Ground-Truth dataset from images the model mis-detected
* How to finetune the model using the created dataset
* Load your finetuned model and Deploy Sagemaker-Endpoint with it using CPU instance.
* Load your finetuned model and Deploy Sagemaker-Endpoint with it using GPU instance.

## Pre-requisites

This notebook is designed to be run in Amazon SageMaker. To run it (and understand what's going on), you'll need:

* Basic familiarity with Python, [MXNet](https://mxnet.apache.org/), [AWS S3](https://docs.aws.amazon.com/s3/index.html), [Amazon SageMaker](https://aws.amazon.com/sagemaker/)
* To create an **S3 bucket** in the same region, and ensure the SageMaker notebook's role has access to this bucket.
* Sufficient [SageMaker quota limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_sagemaker) set on your account to run GPU-accelerated spot training jobs.

## Cost and runtime

Depending on your configuration, this demo may consume resources outside of the free tier but should not generally be expensive because we'll be training on a small number of images. You might wish to review the following for your region:

* [Amazon SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/)

The standard `ml.t2.medium` instance should be sufficient to run the notebooks.

We will use GPU-accelerated instance types for training and hyperparameter optimization, and use spot instances where appropriate to optimize these costs.

As noted in the step-by-step guidance, you should take particular care to delete any created SageMaker real-time prediction endpoints when finishing the demo.

# Step 0: Dependencies and configuration

As usual we'll start by loading libraries, defining configuration, and connecting to the AWS SDKs:

In [65]:
%load_ext autoreload
%autoreload 1

# Built-Ins:
import os
import time
import json
from datetime import datetime
from glob import glob
from pprint import pprint
from matplotlib import pyplot as plt

# External Dependencies:
import boto3
import imageio
import sagemaker
import numpy as np
from sagemaker.mxnet import MXNet
from botocore.exceptions import ClientError

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [2]:
# restore stored variables
%store -r

In [3]:
session = boto3.session.Session()
region = session.region_name
s3 = session.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
smclient = session.client('sagemaker')

In [4]:
iam = boto3.client('iam')

In [5]:
print(bucket.name)

sagemaker-ap-northeast-2-929831892372


## Step 1: Recap output.manifest

In last notebook, we made the *output.manifest* that is containing annotation infromation along with image location. And here is the content of the file.

content is dictionary having 2 essential keys, *labels* and *source-ref*. 
- **labels** - contains information of bounding boxes in the value under key *annotations*.  *class_id* is always *0* because we have only one class *person* in the dataset.
- **source-ref** - same value as in *input.manifest* file

For introduction to model training and deployment, see [**Train a Model with Amazon SageMaker**](http://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html)

In [6]:
output_manifest_path = f'annotations/{job_name}/manifests/output/output.manifest'
output_manifest_obj = bucket.Object(output_manifest_path)
dataset = output_manifest_obj.get()['Body'].read().decode('utf-8').split('\n')

In [7]:
dataset[0]

'{"source-ref":"s3://sagemaker-ap-northeast-2-929831892372/yolo-workshop-batch/images/235.jpg","labels":{"annotations":[{"class_id":0,"width":182,"top":45,"height":230,"left":90},{"class_id":0,"width":174,"top":0,"height":231,"left":150}],"image_size":[{"width":427,"depth":3,"height":320}]},"labels-metadata":{"job-name":"labeling-job/yolo-job-0","class-map":{"0":"Person"},"human-annotated":"yes","objects":[{"confidence":0.09},{"confidence":0.09}],"creation-date":"2020-03-14T12:51:04.692288","type":"groundtruth/object-detection"}}'

In [8]:
pprint(json.loads(dataset[0]))

{'labels': {'annotations': [{'class_id': 0,
                             'height': 230,
                             'left': 90,
                             'top': 45,
                             'width': 182},
                            {'class_id': 0,
                             'height': 231,
                             'left': 150,
                             'top': 0,
                             'width': 174}],
            'image_size': [{'depth': 3, 'height': 320, 'width': 427}]},
 'labels-metadata': {'class-map': {'0': 'Person'},
                     'creation-date': '2020-03-14T12:51:04.692288',
                     'human-annotated': 'yes',
                     'job-name': 'labeling-job/yolo-job-0',
                     'objects': [{'confidence': 0.09}, {'confidence': 0.09}],
                     'type': 'groundtruth/object-detection'},
 'source-ref': 's3://sagemaker-ap-northeast-2-929831892372/yolo-workshop-batch/images/235.jpg'}


## Step 2: Split dataset into Train and Test datasets

Split dataset into train and test datasets is common procedure in Machine Learning(ML).

There are several methods to do that, and we are going to use the simplest one in here. We are going to shuffle entire dataset and split with ratio of 9:1 for train and test respectively.

After split, you will get 2 files, *train.manifest* and *test.manifest* in the path that *output.manifest* is located.

In [9]:
RATIO = 0.9

# shuffle dataset
np.random.shuffle(dataset)

n_samples_total = len(dataset)
train_test_split_index = round(n_samples_total*0.9)

# split datasets
train_dataset = dataset[:train_test_split_index]
test_dataset = dataset[train_test_split_index:]

n_samples_train = len(train_dataset)
%store n_samples_train
n_samples_test = len(test_dataset)
%store n_samples_test

# store manifests into localhost
with open(f'train.manifest', 'w') as f:
    for line in train_dataset:
        if not line:
            continue
        f.write(str(line))
        f.write("\n")
    
with open(f'test.manifest', 'w') as f:
    for line in test_dataset:
        if not line:
            continue
        f.write(str(line))
        f.write("\n")
        
# store train/test manifests to s3 bucket where output.manifest is located.
manifest_path = output_manifest_path.rsplit('/', 1)[0]
bucket.upload_file('train.manifest', f'{manifest_path}/train.manifest')
print('Training manifest uploaded to:\n' + f's3://{bucket.name}/{manifest_path}/train.manifest')
bucket.upload_file('test.manifest', f'{manifest_path}/test.manifest')
print('Test manifest uploaded to:\n' + f"s3://{bucket.name}/{manifest_path}/test.manifest")

Stored 'n_samples_train' (int)
Stored 'n_samples_test' (int)
Training manifest uploaded to:
s3://sagemaker-ap-northeast-2-929831892372/annotations/yolo-job-0/manifests/output/train.manifest
Test manifest uploaded to:
s3://sagemaker-ap-northeast-2-929831892372/annotations/yolo-job-0/manifests/output/test.manifest


## Step 3: Create Hyperparameter Tuning Job

Now, you are ready to finetune MXNet YOLO model with *train.manifest* and *test.manifest* datasets.

Of course, you create hyperparameter tuning job on AWS Console but there is much easier way to do the same job on sagemaker notebook.

Sagemaker provide *sagemaker.mxnet.MXNet* estimator to train model. With this class you can train or make hyperparameter tuning job for your own model.

First of all, you should define metric for estimator. The estimator's goal is mininize or maximize the metric you gave to it.

In this chapter we are going to use *Loss* as a metric which means the goal of the estimator is going to be minize it as much as it can. 

The estimator container will automatically capture it's *stdout* and find the *Regex* pattern you difined and make it as metric to minize.

In [10]:
metric_definitions = [
    { 'Name': 'TrainLoss', 'Regex': 'Train Loss: (.*?) ;' },
]

If you have execution role for sagemaker just put it to the *role_name* on below cell.

Let's make IAM role on [**AWS Console**]() with *AmazonSagemakerFullAccess* Policy like below screen. 

<img src="Assets/ExecutionRole.png" />

In [11]:
# replace role_name with yours
role_name = 'AmazonSageMaker-ExecutionRole-20200129T183159'
%store role_name

Stored 'role_name' (str)


In [12]:
role = iam.get_role(RoleName=role_name)['Role']['Arn']

Let's make estiamtor. Estimator handles end-to-end Amazon SageMaker training and deployment tasks.

You can run your training job on *Spot Instance* and we are going to do that, because using spot instance is the most cost efficient way to run your job on AWS.

The estimator we are making, uses *4 of ml.p3.8xlarge Spot instances* for training so that 4 Hyperparameter tuning job is able to run cucurrently.

Let me explain some important parameters before run the code,

* entry_point - python script that includes train/finetune logics.
* source_dir - local folder location that `entry_point` is placed.
* frame_work_version - MXNet framework version
* input_mode - 'File' or 'Pipe'. entry_point should be implemented considering input_mode.
* train_use_spot_instances - True if you want to use spot-instance for running training jobs.
* output_path - s3 bucket path that models and checkpoints will be stored.
* hyperparameters - default hyperparameters. most of the values will be overriden by hyperparameter tuning job. (look into *hyperparameter_ranges* variable below cell)

In [70]:
model_output_path = f's3://{BUCKET_NAME}/{MODELS_PREFIX}'
%store model_output_path

Stored 'model_output_path' (str)


In [19]:
estimator = MXNet(
    role=role,
    entry_point='yolo_finetune.py',
    source_dir='src',
    framework_version='1.4.1',
    py_version='py3',
    input_mode='File',
    train_volume_size=n_samples_train,
    train_instance_count=1,
    train_instance_type='ml.p3.8xlarge',
    train_max_run=5*60*60,
    train_use_spot_instances=True,
    train_max_wait=5*60*60,
    metric_definitions=metric_definitions,
    base_job_name='yolo-finetune-0',
    output_path=model_output_path,
    hyperparameters={
        'epochs': 30,
        'num-workers': 4,
        'batch-size': 8,
        'num-gpus': 4,
        'data-shape': 320,
        'lr': 0.000361,
        'momentum': 0.299848,
        'wd': 0.986724,
        'optimizer': 'sgd',
    }
)

## Step 5: Prepare channel

HyperParameter tuning job requires data channel for fetch data from s3.

Estimator on *File* mode, *image_channel* must be provided to the tuner because Sagemaker training container copies all train/test images on creating container instance using *image_channel*.

We are using *File* mode because our dataset is small enough but if you are planning to deal with very large dataset consider *Pipe* mode.

In [20]:
# pass only essential keys
attribute_names = ['source-ref', 'labels']

In [21]:
train_channel = sagemaker.session.s3_input(
    f's3://{BUCKET_NAME}/{manifest_path}/train.manifest',
    distribution='FullyReplicated',
    s3_data_type="S3Prefix",
    attribute_names=attribute_names
)
                                        
test_channel = sagemaker.session.s3_input(
    f's3://{BUCKET_NAME}/{manifest_path}/test.manifest',
    distribution='FullyReplicated',
    s3_data_type='S3Prefix',
    attribute_names=attribute_names
)

image_channel = sagemaker.session.s3_input(
    f's3://{BUCKET_NAME}/{BATCH_NAME}/{IMAGE_PREFIX}',
    s3_data_type="S3Prefix"
)

## Step 6: Finetune Model using Hyperparameter tuning job

[How Hyperparameter Tuning Works](https://docs.aws.amazon.com/sagemaker/latest/dg/automatic-model-tuning-how-it-works.html) says dailed information about hyperparmeter tuning job.

Simply put it, hyperparameter tuning job test all of the *likely* parameters in the given range, and find best combination of the parameters for the model with given dataset.

In this manner, you should provide ranges of the hyperparameters where the best parameters lie on.

In [27]:
hyperparameter_ranges = {
    'lr': sagemaker.tuner.ContinuousParameter(0.0001, 0.1),
    'momentum': sagemaker.tuner.ContinuousParameter(0.0, 0.99),
    'wd': sagemaker.tuner.ContinuousParameter(0.0, 0.99),
    'optimizer': sagemaker.tuner.CategoricalParameter(['sgd', 'adam', 'rmsprop', 'adadelta'])
}

Put the all together, such as estimator, metric(or loss), hyperparameter ranges, we are going to run Hyperparameter Tuning Job.

In [42]:
max_jobs = 24
%store max_jobs

tuner = sagemaker.tuner.HyperparameterTuner(
    estimator,
    'TrainLoss',
    objective_type='Minimize',
    metric_definitions=metric_definitions,
    hyperparameter_ranges=hyperparameter_ranges,
    base_tuning_job_name='yolo-htj-batch-0',
    max_jobs=24,
    max_parallel_jobs=3
)

tuner.fit(
    {
        "train": train_channel,
        "test": test_channel,
        "images": image_channel
    },
    include_cls_metadata=False
)

Stored 'max_jobs' (int)


ResourceLimitExceeded: An error occurred (ResourceLimitExceeded) when calling the CreateHyperParameterTuningJob operation: The account-level service limit 'ml.p3.8xlarge for spot training job usage' is 3 Instances, with current utilization of 3 Instances and a request delta of 3 Instances. Please contact AWS support to request an increase for this limit.

Once you call *fit()* method, you can check the progress on [AWS Console](https://console.aws.amazon.com).

<img src="Assets/TrainingJobStatus.png" />

and, of course, you can check progress out on the notebook using Sagemaker Python SDK

In [68]:
training_jobs = smclient.list_training_jobs(NameContains=tuner.base_tuning_job_name, StatusEquals='InProgress')
training_job_summaries = training_jobs['TrainingJobSummaries']
training_job_name = training_job_summaries[0]['TrainingJobName'].rsplit('-', 2)[0]
%store training_job_name

Stored 'training_job_name' (str)


In [40]:
analytics = sagemaker.HyperparameterTuningJobAnalytics(training_job_name)
analytics.dataframe()

Unnamed: 0,lr,momentum,optimizer,wd,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
0,0.088727,0.976522,"""adadelta""",0.989257,yolo-htj-batch-0-200316-1644-024-2add2e90,InProgress,,NaT,NaT,
1,0.000467,0.238498,"""sgd""",0.608111,yolo-htj-batch-0-200316-1644-023-8f2c0fe6,InProgress,,2020-03-16 17:40:20+09:00,NaT,
2,0.000121,0.016268,"""sgd""",0.979727,yolo-htj-batch-0-200316-1644-022-906d9c4c,InProgress,,NaT,NaT,
3,0.000245,0.989981,"""sgd""",0.971713,yolo-htj-batch-0-200316-1644-021-12794dc4,Completed,17.40723,2020-03-16 17:34:03+09:00,2020-03-16 17:38:16+09:00,253.0
4,0.000132,0.97625,"""sgd""",0.208593,yolo-htj-batch-0-200316-1644-020-c4d5fd9f,Completed,8.67067,2020-03-16 17:33:35+09:00,2020-03-16 17:37:52+09:00,257.0
5,0.000229,0.99,"""sgd""",0.97038,yolo-htj-batch-0-200316-1644-019-e92e161e,Completed,16.758648,2020-03-16 17:33:22+09:00,2020-03-16 17:37:40+09:00,258.0
6,0.000349,0.735787,"""sgd""",0.981868,yolo-htj-batch-0-200316-1644-018-0d72777e,Completed,7.828395,2020-03-16 17:26:49+09:00,2020-03-16 17:31:11+09:00,262.0
7,0.000532,0.987465,"""adam""",0.984736,yolo-htj-batch-0-200316-1644-017-4c9a29c3,Completed,11.105245,2020-03-16 17:25:49+09:00,2020-03-16 17:30:23+09:00,274.0
8,0.001306,0.986137,"""adam""",0.989794,yolo-htj-batch-0-200316-1644-016-77ed042d,Completed,14.89971,2020-03-16 17:27:05+09:00,2020-03-16 17:31:40+09:00,275.0
9,0.000218,0.885406,"""adam""",0.967501,yolo-htj-batch-0-200316-1644-015-2493e60c,Completed,7.664004,2020-03-16 17:18:30+09:00,2020-03-16 17:23:09+09:00,279.0


Now, we are going to wait for all training job is completed, it will take few minues..

In [63]:
def wait_for_training(base_job_name, total_jobs):
    completed = len(smclient.list_training_jobs(
        NameContains=base_job_name,
        StatusEquals='Completed',
        MaxResults=total_jobs,
    )['TrainingJobSummaries'])
    
    completed_jobs = 0
    while completed_jobs < total_jobs:
        print(f'{completed_jobs}/{total_jobs} of training jobs are completed...')
        completed_jobs = len(smclient.list_training_jobs(
            NameContains=base_job_name, 
            StatusEquals='Completed',
            MaxResults=total_jobs,
        )['TrainingJobSummaries'])
        time.sleep(10)
        
    print(f'All({total_jobs}) training jobs are completed!!')

In [66]:
wait_for_training(tuner.base_tuning_job_name, max_jobs)

0/24 of training jobs are completed...
All(24) training jobs are completed!!
