# YOLO v3 Finetuning on AWS

This series of notebooks demonstrates how to finetune pretrained YOLO v3 (aka YOLO3) using MXNet on AWS.

**This notebook** walks through using the [SageMaker Ground Truth](https://aws.amazon.com/sagemaker/groundtruth/) tool to annotate training and validation data sets.

**Follow-on** the content of the notebooks shows:

* How to use MXNet YOLO3 pretrained model
* How to use Deep SORT with MXNet YOLO3
* How to create Ground-Truth dataset from images the model mis-detected
* How to finetune the model using the created dataset
* Load your finetuned model and Deploy Sagemaker-Endpoint with it using CPU instance.
* Load your finetuned model and Deploy Sagemaker-Endpoint with it using GPU instance.

## Pre-requisites

This notebook is designed to be run in Amazon SageMaker. To run it (and understand what's going on), you'll need:

* Basic familiarity with Python, [MXNet](https://mxnet.apache.org/), [AWS S3](https://docs.aws.amazon.com/s3/index.html), [Amazon Sagemaker](https://aws.amazon.com/sagemaker/)
* To create an **S3 bucket** in the same region, and ensure the SageMaker notebook's role has access to this bucket.
* Sufficient [SageMaker quota limits](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_sagemaker) set on your account to run GPU-accelerated spot training jobs.

## Cost and runtime

Depending on your configuration, this demo may consume resources outside of the free tier but should not generally be expensive because we'll be training on a small number of images. You might wish to review the following for your region:

* [Amazon SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/)
* [SageMaker Ground Truth pricing](https://aws.amazon.com/sagemaker/groundtruth/pricing/)

The standard `ml.t2.medium` instance should be sufficient to run the notebooks.

We will use GPU-accelerated instance types for training and hyperparameter optimization, and use spot instances where appropriate to optimize these costs.

As noted in the step-by-step guidance, you should take particular care to delete any created SageMaker real-time prediction endpoints when finishing the demo.

# Step 0: Dependencies and configuration

As usual we'll start by loading libraries, defining configuration, and connecting to the AWS SDKs:

In [2]:
%load_ext autoreload
%autoreload 1

# Built-Ins:
import re
import os
import json
from glob import glob
from pprint import pprint
from matplotlib import pyplot as plt

# External Dependencies:
import boto3
import imageio
import sagemaker
from botocore.exceptions import ClientError



In [3]:
BUCKET_NAME = sagemaker.Session().default_bucket()
%store BUCKET_NAME

REGION = sagemaker.Session().boto_region_name
%store REGION

IMAGE_PREFIX = 'images'
%store IMAGE_PREFIX

MODELS_PREFIX = 'models'
%store MODELS_PREFIX

CLASS_NAMES = ['person']
%store CLASS_NAMES

BATCH_NAME = 'yolo-workshop-batch'
%store BATCH_NAME

Stored 'BUCKET_NAME' (str)
Stored 'REGION' (str)
Stored 'IMAGE_PREFIX' (str)
Stored 'MODELS_PREFIX' (str)
Stored 'CLASS_NAMES' (list)
Stored 'BATCH_NAME' (str)


In [4]:
session = boto3.session.Session()
region = session.region_name
s3 = session.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
smclient = session.client('sagemaker')

In [5]:
print(bucket.name)

sagemaker-ap-northeast-2-929831892372


## Step 1: Create bucket

In this notebook, we are going to label the mis-detected images using Sagemaker Ground Truth.

Most of the Sagemaker services are needed the data on the S3 to use.


In [6]:
try:
    bucket.create(
        ACL='private',
        CreateBucketConfiguration={
            'LocationConstraint': REGION,
        },
    )
except ClientError as e:
    if e.response['Error']['Code'] == 'BucketAlreadyOwnedByYou':
        print('Bucket have been already created..')
    else:
        raise e

Bucket have been already created..


In [7]:
# set cors to bucket

cors_config = {
    'CORSRules': [{
        'AllowedHeaders': ['Authorization'],
        'AllowedMethods': ['GET', 'PUT'],
        'AllowedOrigins': ['*'],
        'ExposeHeaders': ['GET', 'PUT'],
    }]
}
cors = bucket.Cors()
cors.put(CORSConfiguration=cors_config)

{'ResponseMetadata': {'RequestId': '124A9C91E5987B2F',
  'HostId': 'INFvvjRgmoRxAvaFGLzh5SVnFZcnjHOkWYyBOShcnTlTxtB+owhuH0QsS9Xvhg6plDusJGpzPxc=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'INFvvjRgmoRxAvaFGLzh5SVnFZcnjHOkWYyBOShcnTlTxtB+owhuH0QsS9Xvhg6plDusJGpzPxc=',
   'x-amz-request-id': '124A9C91E5987B2F',
   'date': 'Wed, 20 May 2020 03:03:24 GMT',
   'content-length': '0',
   'server': 'AmazonS3'},
  'RetryAttempts': 0}}

## Step 2: Upload images to S3

Let's say the mis-detected images are stored at `/Users/dongkyl/Documents/git/mxnet-deepsort-yolo3/images`, below code will upload your images onto S3.

In [8]:
local_image_path = '/Users/dongkyl/Documents/git/mxnet-deepsort-yolo3/images' # change this
print(local_image_path)

/Users/dongkyl/Documents/git/mxnet-deepsort-yolo3/images


on S3, the path for images would be `/{BATCH_NAME}/{IMAGE_PREFIX}`. 

By this, the `full s3 path` of the image will be `s3://{BUCKET_NAME}/{BATCH_NAME}/{IMAGE_PREFIX}/{FILENAME}`.

In [9]:
upload_path = f'{BATCH_NAME}/{IMAGE_PREFIX}'
print(upload_path)

yolo-workshop-batch/images


Now, we are going to upload images to bucket to make `input.manifest` that is for Ground-Truth labeling job. And, of course, the images will also be used in finetuning too.

In [19]:
filenames = []

for file_path in glob(rf'{local_image_path}/[0-9]*.jpg'):
    filename = file_path.rsplit('/', 1)[-1]
    bucket.upload_file(file_path, f'{upload_path}/{filename}')
    filenames.append(filename)
    print(f'uploaded {filename}...')

uploaded 88.jpg...
uploaded 76.jpg...
uploaded 49.jpg...
uploaded 61.jpg...
uploaded 64.jpg...
uploaded 70.jpg...
uploaded 58.jpg...
uploaded 73.jpg...
uploaded 67.jpg...
uploaded 28.jpg...
uploaded 115.jpg...
uploaded 100.jpg...
uploaded 103.jpg...
uploaded 16.jpg...
uploaded 106.jpg...
uploaded 112.jpg...
uploaded 13.jpg...
uploaded 10.jpg...
uploaded 121.jpg...
uploaded 109.jpg...
uploaded 34.jpg...
uploaded 22.jpg...
uploaded 37.jpg...
uploaded 118.jpg...
uploaded 31.jpg...
uploaded 25.jpg...
uploaded 19.jpg...
uploaded 4.jpg...
uploaded 94.jpg...
uploaded 43.jpg...
uploaded 55.jpg...
uploaded 7.jpg...
uploaded 82.jpg...
uploaded 97.jpg...
uploaded 40.jpg...
uploaded 79.jpg...
uploaded 1.jpg...
uploaded 91.jpg...
uploaded 85.jpg...
uploaded 52.jpg...
uploaded 46.jpg...


## Step 3: Generate input.manifest

In order to set up the Sagemaker Ground Truth labeling job, you should make a manifest file that contains the list of the files on S3.

The [**manifest**](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-input.html) file is just list of dictionaries and each row must contain one of the key `source-ref`. `source-ref` value is going to be the `full s3 path` of the image file, that is mentioned above.

We are going to generate input manifest and place it into `{BATCH_NAME}/manifests/input.manifest`. And then upload it onto S3 bucket with same path. 

The full path for manifest would be `s3://{BUCKET_NAME}/{BATCH_NAME}/manifests/input.manifest`.

In [20]:
os.makedirs(f'{BATCH_NAME}/manifests', exist_ok=True)
input_manifest_loc = f'{BATCH_NAME}/manifests/input.manifest'

with open(input_manifest_loc, 'w') as fp:
    for filename in filenames:
        source_ref = f's3://{bucket.name}/{upload_path}/{filename}'
        fp.write(json.dumps({'source-ref': source_ref})+'\n')

bucket.upload_file(input_manifest_loc, input_manifest_loc)

You can visit the [**AWS S3 Console**](https://s3.console.aws.amazon.com/) to make sure images are uploaded successfully.

Of course, you can display the list of files on the notebook using boto3.

In [21]:
for obj in bucket.objects.filter(Prefix=f'{BATCH_NAME}'):
    print(obj.key)

yolo-workshop-batch/images/1.jpg
yolo-workshop-batch/images/10.jpg
yolo-workshop-batch/images/100.jpg
yolo-workshop-batch/images/103.jpg
yolo-workshop-batch/images/106.jpg
yolo-workshop-batch/images/109.jpg
yolo-workshop-batch/images/112.jpg
yolo-workshop-batch/images/115.jpg
yolo-workshop-batch/images/118.jpg
yolo-workshop-batch/images/121.jpg
yolo-workshop-batch/images/13.jpg
yolo-workshop-batch/images/16.jpg
yolo-workshop-batch/images/19.jpg
yolo-workshop-batch/images/22.jpg
yolo-workshop-batch/images/25.jpg
yolo-workshop-batch/images/28.jpg
yolo-workshop-batch/images/31.jpg
yolo-workshop-batch/images/34.jpg
yolo-workshop-batch/images/37.jpg
yolo-workshop-batch/images/4.jpg
yolo-workshop-batch/images/40.jpg
yolo-workshop-batch/images/43.jpg
yolo-workshop-batch/images/46.jpg
yolo-workshop-batch/images/49.jpg
yolo-workshop-batch/images/52.jpg
yolo-workshop-batch/images/55.jpg
yolo-workshop-batch/images/58.jpg
yolo-workshop-batch/images/61.jpg
yolo-workshop-batch/images/64.jpg
yolo-wor

## Step 4: Setup Sagemker Ground Trutch Labeling workforce

Sagemaker Ground Truth workforce gives you 3 options such as,

- Amazon Mechanical Turk
- Private
- Vendor

for more details, visit [**Use Amazon SageMaker Ground Truth for Labeling**](https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html).

In this notebook, We are going to use *Private Workforce* to label bounding boxes to our data. With Private workforce you can make your employees or contractors handling data within your organization.

**in the [AWS console](https://console.aws.amazon.com)**,
Under *Services* go to *Amazon SageMaker*, and select *Ground Truth > Labeling workforces* from the side-bar menu on the left. And select the *Private* tab on the top, click the button *Create private team* to create private workforce.

<img src="Assets/PrivateWorkforce.png" />

By Creating workforce,

- Private tab menu displays link under *Labeling portal sign-in URL*
- Workers will got a email with temporary password

When worker visit the link and login for the first time(with email as username and password as temporary password), worker has to reset the password.

After resetting password, worker is finally ready to work!

## Step 5: Set up the SageMaker Ground Truth labeling job

Now that our images and a manifest file listing them are ready in S3, we'll set up the Ground Truth labeling job **in the [AWS console](https://console.aws.amazon.com)**.

Under *Services* go to *Amazon SageMaker*, and select *Ground Truth > Labeling Jobs* from the side-bar menu on the left.

### Job Details

Click the **Create labeling job** button, and you'll be asked to specify job details as follows:

* **Job name:** Choose a name to identify this labeling job, e.g. `yolo-workshop-job-0`
* **Label name (The override checkbox):** Consider overriding this to `labels`
* **Input data location:** The path to the input manifest file in S3 (see output above)
* **Output data location:** The path to store labeled dataset in S3 (e.g.  *s3://{BUCKET_NAME}/{BATCH_NAME}/annotations*)
* **IAM role:** If you're not sure whether your existing roles have the sufficient permissions for Ground Truth, select the options to create a new role
* **Task type:** Image > Bounding box

<img src="Assets/SetupGroundTruth.png"/>

All other settings can be left as default. Record your choices for the label name and output data location below, because we'll need these later:

In [24]:
job_name = 'yolo-workshop-job-0'
%store job_name

Stored 'job_name' (str)


In [23]:
print(f'intput_dataset_location: s3://{bucket.name}/{input_manifest_loc}')
print(f'output_dataset_location: s3://{bucket.name}/annotations')

intput_dataset_location: s3://sagemaker-ap-northeast-2-929831892372/yolo-workshop-batch/manifests/input.manifest
output_dataset_location: s3://sagemaker-ap-northeast-2-929831892372/annotations


### Workers
On the next screen, we'll configure who will annotate our data: Ground Truth allows you to define your own in-house Private Workforces; use Vendor Managed Workforces for specialist tasks; or use the public workforce provided by Amazon Mechanical Turk.

Select Private worker type, and you'll be prompted either to select from your existing private workforces, or create a new one if none exist.

To create a new private workforce if you need, simply follow the UI workflow with default settings. It doesn't matter what you call the workforce, and you can create a new Cognito User Group to define the workforce. Add yourself to the user pool by adding your email address: You should receive a confirmation email shortly with a temporary password and a link to access the annotation portal.

Automatic data labeling is applicable only for data sets over 1000 samples, so leave this turned off for now.

<img src="Assets/Workers.png" />

### Labeling Tool
Since you'll be labelling the data yourself, a brief description of the task should be fine in this case. When using real workforces, it's important to be really clear in this section about the task requirements and best practices - to ensure consistency of annotations between human workers.

For example: In the common case where we see a pair of boots from the side and one is almost entirely obscured, how should the image be annotated? Should model cats count, or only real ones?

The most important configuration here is to set the options to be the same as our {CLASS_NAMES}, we have only one label in this workshop *Person*.

<img src="Assets/LabelingTool.png" />

Take some time to explore the other options for configuring the annotation tool; and when you're ready click "Create" to launch the labeling job.

## Step 5: Label those images!

Follow the link you received in your workforce invitation email to the workforce's **labeling portal**, and log in with the default password given in the email (which you'll be asked to change).

If you lose the portal link, you can always retrieve it through the *Ground Truth > Labeling Workforces* menu in the SageMaker console: Near the top of the summary of private workforces.

New jobs can sometimes take a minute or two to appear for workers. Select the job and click "Start working" to enter the labeling tool.

<img src="Assets/WorkerLabelingJobs.png"/>

once the labeling job is started, you will see this labeling job web page..

<img src="Assets/WorkerLabelingPage.png"/>

Note that you can check on the progress of labelling jobs through the APIs as well as in the AWS console.
After few seconds from workers done their labeling job, the status will be changed to *Completed*

In [25]:
smclient.describe_labeling_job(LabelingJobName=job_name)['LabelingJobStatus']

'Completed'

## Step 6: Check the labeling results

when your workers done their job, *output.manifest* will be generated into following path.

*s3://{BUCKET_NAME}/annotations/{job_name}/manifests/output/output.manifest*

In [26]:
output_manifest_path = f'annotations/{job_name}/manifests/output/output.manifest'
output_manifest_obj = bucket.Object(output_manifest_path)
for el in map(json.loads, output_manifest_obj.get()['Body'].read().decode('utf-8').split('\n')):
    pprint(el)
    break

{'labels': {'annotations': [{'class_id': 0,
                             'height': 383,
                             'left': 61,
                             'top': 32,
                             'width': 105},
                            {'class_id': 0,
                             'height': 409,
                             'left': 113,
                             'top': 0,
                             'width': 345},
                            {'class_id': 0,
                             'height': 97,
                             'left': 33,
                             'top': 151,
                             'width': 36},
                            {'class_id': 0,
                             'height': 93,
                             'left': 10,
                             'top': 168,
                             'width': 25}],
            'image_size': [{'depth': 3, 'height': 416, 'width': 740}]},
 'labels-metadata': {'class-map': {'0': 'Person'},
                     'crea

As you can see, top level keys are 2, *source-ref* and *labels* respectively. the key *label* is the name you gave when setting up the labeling job. which is containing information of all bound-boxes about *source-ref* image.

In next chapter, we will create Sagemaker Hyperparmeter Optimization(a.k.a HPO) Job with this *output.manifest*.