# Training Object Detection Models in SageMaker with Augmented Manifests

This notebook demonstrates the use of an "augmented manifest" to train an object detection machine learning model with AWS SageMaker.

**Note:** This notebook was adapted from: https://github.com/awslabs/amazon-sagemaker-examples.git for the Belong Deeplens Innovation Sprint. This lab is based on a small dataset consisting of a pre-labelled set of images containing 500 images of analogue guages. The labelling was specifically tasked to box the analogue guage. 

This detailed lab guide can be found at:
https://aws-computer-vision.jacobcantwell.com/

This dataset and the attached augmented manifest file can be found at:
https://aws-computer-vision.jacobcantwell.com/jupyter/analogue-guage-detection.zip

Author: Jacob Cantwell.

## Initialise project variables

Below we initialise the location of files and objects that we need to set up the training job.

In [None]:
import time

## Updated below to your local lab team S3 bucket
bucket_name = "deeplens-[YOUR TEAM NAME]-belong-lab" # Replace '[YOUR TEAM NAME]' with your lab teams bucket name.

# Create unique job name 
job_name_prefix = '[YOUR TEAM NAME]' # Enter your lab team name or other unique identifyer.

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp

## URL of the training images and Augmented Manifest file being used for this lab
# Replace if needed.
training_data_url = "https://aws-computer-vision.jacobcantwell.com/jupyter/analogue-guage-detection.zip"

## Image dataset directory
# Is used for both local and S3 image directory so is hard dependency on local directory name for images.
image_dataset_prefix = 'training-images'

manifest_prefix = 'manifests' # S3 folder to store and process the manifest file. 
train_manifest = 'train.manifest'
validate_manifest = 'validate.manifest'

## Format full path for training and validation manifests
s3_train_manifest = 's3://{}/{}/{}'.format(bucket_name, manifest_prefix, train_manifest)
s3_validate_manifest = 's3://{}/{}/{}'.format(bucket_name, manifest_prefix, validate_manifest)

## Output folder for training data and model/
s3_output_path = 's3://{}/training-output'.format(bucket_name)

## Print the paths just to validate
print("Training Job Name: {}".format(job_name))
print("S3 Bucket: {}".format(bucket_name))
print("Augmented manifest for training data: {}".format(s3_train_manifest))
print("Augmented manifest for validation data: {}".format(s3_validate_manifest))
print("Output training data path: {}".format(s3_output_path))

## Setup

Here we define the training image containing the semantic segmentation algorithm, and instantiate a SageMaker session.

In [None]:
import boto3
import re
import sagemaker
from sagemaker import get_execution_role
import time
from time import gmtime, strftime
import json

role = get_execution_role()
sess = sagemaker.Session()
s3 = boto3.resource('s3')

training_image = sagemaker.amazon.amazon_estimator.get_image_uri(boto3.Session().region_name, 'object-detection', repo_version='latest')

print ('Execution Role: {}'.format(role))

## Understanding the Augmented Manifest Format

Augmented manifests provide two key benefits. First, the format is consistent with that of a labelling job output manifest. This means that you can take your output manifests from a Ground Truth labelling job and, whether the dataset objects were entirely human-labelled, entirely machine-labelled, or anything in between, and use them as inputs to SageMaker training jobs - all without any additional translation or reformatting! Second, the dataset objects and their corresponding ground truth labels/annotations are captured *inline*. This effectively reduces the required number of channels by half, since you no longer need one channel for the dataset objects alone and another for the associated ground truth labels/annotations.

The augmented manifest format is essentially the [json-lines format](http://jsonlines.org/), also called the new-line delimited JSON format. This format consists of an arbitrary number of well-formed, fully-defined JSON objects, each on a separate line. Augmented manifests must contain a field that defines a dataset object, and a field that defines the corresponding annotation. Let's look at an example for an object detection problem.


The Ground Truth output format is discussed more fully for various types of labelling jobs in the [official documenation](https://docs.aws.amazon.com/sagemaker/latest/dg/sms-data-output.html).

{<span style="color:blue">"source-ref"</span>: "s3://bucket_name/path_to_a_dataset_object.jpeg", <span style="color:blue">"labeling-job-name"</span>: {"annotations":[{"class_id":"0",`<bounding box dimensions>`}],"image_size":[{`<image size simensions>`}]}

The first field will always be either `source` our `source-ref`. This defines an individual dataset object. The name of the second field depends on whether the labelling job was created from the SageMaker console or through the Ground Truth API. If the job was created through the console, then the name of the field will be the labelling job name. Alternatively, if the job was created through the API, then this field maps to the `LabelAttributeName` parameter in the API. 

The training job request requires a parameter called `AttributeNames`. This should be a two-element list of strings, where the first string is "source-ref", and the second string is the label attribute name from the augmented manifest. This corresponds to the <span style="color:blue">blue text</span> in the example above. In this case, we would define `attribute_names = ["source-ref", "labeling-job-name"]`.

*Be sure to carefully inspect your augmented manifest so that you can define the `attribute_names` variable below.*

The key feature of the augmented manifest is that it has both the data object itself (i.e., the image), and the annotation in-line in a single JSON object. Note that the `annotations` keyword contains dimensions and coordinates (e.g., width, top, height, left) for bounding boxes! The augmented manifest can contain an arbitrary number of lines, as long as each line adheres to this format.

Let's discuss this format in more detail by describing each parameter of this JSON object format.

* The `source-ref` field defines a single dataset object, which in this case is an image over which bounding boxes should be drawn. Note that the name of this field is arbitrary. 
* The `object-detection-job-name` field defines the ground truth bounding box annotations that pertain to the image identified in the `source-ref` field. As mentioned above, note that the name of this field is arbitrary. You must take care to define this field in the `AttributeNames` parameter of the training job request, as shown later on in this notebook.
* Because this example augmented manifest was generated through a Ground Truth labelling job, this example also shows an additional field called `object-detection-job-name-metadata`. This field contains various pieces of metadata from the labelling job that produced the bounding box annotation(s) for the associated image, e.g., the creation date, confidence scores for the annotations, etc. This field is ignored during the training job. However, to make it as easy as possible to translate Ground Truth labelling jobs into trained SageMaker models, it is safe to include this field in the augmented manifest you supply to the training job.

## Download the Augmented Manifest and Training Images

Download the training images and manifest file to this Notebook's workspace.
**Note:** For larger datasets its more efficient to download direct to S3 instead of to the Notebook!


In [None]:
import os
import urllib.request
from zipfile import ZipFile

temp_zipfile = './analogue-guage-detection.zip'
## Download the analogue guage training manifest and image dataset as a ZIP attached to this lab.
urllib.request.urlretrieve(training_data_url, temp_zipfile)

## Unzip the image set
with ZipFile(temp_zipfile, 'r') as zipObj:
   # Extract all the contents of zip file in current directory
   zipObj.extractall()

## Delete the data .ZIP file top conserve space 
os.remove(temp_zipfile)

## Print out the file sin local workspace as validation
print('Files in local workspace:')
os.listdir()


## Separate the Manifest into Training and Validation Files.

While building a model, Sagemaker needs some images to be excluded from the training process so that they can be used to validate the accuracy of the model being developed. If the model validates on images it has seen before in training the results will be artificially high.

To allow for this we separate out a small subset (in this case 10%) of the training images from the augmented manifest file and add to the validation manifest. The remaining of the labelled imaged listed in the augmented manifest are saved into the training manifest.


In [None]:
import random

# Create the training and validation manifests from the labelled Augmented Manifest
print ('Creating the training and validation manifests from the labelled augmented manifest:')

local_manifest = 'augmented-manifest.json'
validation_ratio = 0.1      # Ratio of images to separate to validation manifest

##############################################################
# Get all manifest source-ref lines into sourceref_array
sourceref_array = []
print('Reading Augmented Manifest file to sourceref_array:')
with open(local_manifest) as manifest_file:
    for line in manifest_file:
        # Localref replaces a placeholder S3 bucket in the manifest with the local value.
        localref = line.replace( '[BUCKET_AND_PATH]', '{}/{}'.format(bucket_name, image_dataset_prefix))
        sourceref_array.append(localref)

dataset_size = len(sourceref_array)
print ('Found: {} image source-refs in augmented manifest'.format(dataset_size))
print ('complete.')

##############################################################
# Calculate training and validation image manifest lengths.
print ('\nCalculate training and validation manifest lengths at {} of complete image dataset.'.format(validation_ratio))
validation_size = int(round(dataset_size * float(validation_ratio)))
training_size = dataset_size - validation_size

print ('Total Dataset Images: {}'.format(dataset_size))
print ('Training Images, {}'.format(training_size))
print ('Validation Images: {}'.format(validation_size))
print ('complete.')

##############################################################
# Get random image references for validation manifest and write to workspace
print ('Nominate {} random image references for validation manifest'.format(validation_size))
validation_array = []

for i in range(validation_size):
    # get current size of sourceref_array as items are pop'ed.
    dataset_remain_size = len(sourceref_array)
    # Calculate a random int between 0 and current size of sourceref_array
    rand_val = random.randrange(0, dataset_remain_size);
    # Pop the random key value off sourceref_array and into the validation_array
    validation_array.append(sourceref_array.pop(rand_val))

print ('{} image refs applied to validation manifest'.format(len(validation_array)))
print ('complete.')

##############################################################
# Write training and validation manifests to workspace
print ('\nWrite training and validation manifests to workspace:')

# Write sourceref_array lines not split out to validation_array to training manifest file
print ('Write Training manifest to: {}'.format(train_manifest))
with open(train_manifest, 'w') as f:
    for line in sourceref_array:
        f.write(line)

print ('complete.')
  
# Write validation_array lines to validation manifest file
print ('Write Validation manifest to: {}'.format(validate_manifest))
with open(validate_manifest, 'w') as f:
    for line in validation_array:
        f.write(line)
print ('complete.')

## Upload the Training Image Dataset and Manifest Files to S3 for Sagemaker.

In proceeding steps we are going to initiate a dedicated instance to build and train the object detection model. Because this instance doesn't have access to the dataset and manifest files that were processed in this notebook, we need to upload these to S3 so the training instance can access.


In [None]:
# Upload the training manifest to S3
print ('Uploading {} to {}'.format(train_manifest, s3_train_manifest))
s3.meta.client.upload_file(train_manifest, bucket_name, '{}/{}'.format(manifest_prefix, train_manifest))
print ('complete\n')

# Upload the validation manifest to S3
print ('Uploading {} to {}'.format(validate_manifest, s3_validate_manifest))
s3.meta.client.upload_file(validate_manifest, bucket_name, '{}/{}'.format(manifest_prefix, validate_manifest))
print ('complete\n')

# Upload the training image dataset
print ('Uploading training images in {} to {}'.format(image_dataset_prefix, bucket_name))
for filename in os.listdir(image_dataset_prefix):
    image_path = '{}/{}'.format(image_dataset_prefix, filename)
    s3.meta.client.upload_file(image_path, bucket_name, image_path)
    print ('Successfully uploaded Image: {}'.format(image_path))

print ('Upload Complete\n\n')


## Preview Input Data

Let's read the augmented manifest so we can inspect its contents to better understand the format and to verify its now accessible from S3.

In [None]:
augmented_manifest_s3_key = s3_train_manifest.split(bucket_name)[1][1:]
s3_obj = s3.Object(bucket_name, augmented_manifest_s3_key)
augmented_manifest = s3_obj.get()['Body'].read().decode('utf-8')
augmented_manifest_lines = augmented_manifest.split('\n')

num_training_samples = len(augmented_manifest_lines) # Compute number of training samples for use in training job request.


print('Preview of Augmented Manifest File Contents')
print('-------------------------------------------')
print('\n')

for i in range(2):
    print('Line {}'.format(i+1))
    print(augmented_manifest_lines[i])
    print('\n')

## Create Training Job

### Set the attribute names:
In the previous step you can see the name of the object contain all the labelling data is "vehicle-class" and so the attribute names for this manifest is source-ref and vehicle-class. This custom attribute was configured during the labelling task.

In [None]:
# If you are using a different manifest file than the one given then make sure to update this field accordingly.

attribute_names = ["source-ref","vehicle-class"]

### Construct Training Parameters:
In this step we construct the parameters for the training job.

+ The required parameters have been derived or entered in the steps above.
+ The hyperparameters are beyond the scope of this lab and have been set to sane defaults.

In [None]:

training_params = \
{
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "Pipe"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": s3_output_path
    },
    "ResourceConfig": {
        "InstanceCount": 1,   
        "InstanceType": "ml.p3.2xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": { 
         "base_network": "resnet-50",
         "use_pretrained_model": "1",
         "num_classes": "4",
         "mini_batch_size": "1",
         "epochs": "50",
         "learning_rate": "0.001",
         "lr_scheduler_step": "3,6",
         "lr_scheduler_factor": "0.1",
         "optimizer": "rmsprop",
         "momentum": "0.9",
         "weight_decay": "0.0005",
         "overlap_threshold": "0.5",
         "nms_threshold": "0.45",
         "image_shape": "300",
         "label_width": "350",
         "num_training_samples": str(num_training_samples)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 86400
    },
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "AugmentedManifestFile",
                    "S3Uri": s3_train_manifest,
                    "S3DataDistributionType": "FullyReplicated",
                    "AttributeNames": attribute_names
                }
            },
            "ContentType": "application/x-recordio",
            "RecordWrapperType": "RecordIO",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "AugmentedManifestFile",
                    "S3Uri": s3_validate_manifest,
                    "S3DataDistributionType": "FullyReplicated",
                    "AttributeNames": attribute_names
                }
            },
            "ContentType": "application/x-recordio",
            "RecordWrapperType": "RecordIO",
            "CompressionType": "None"
        }
    ]
}
 
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

## Start The Training Job  

Now we create the Amazon SageMaker training job.

In [None]:
client = boto3.client(service_name='sagemaker')
client.create_training_job(**training_params)

# Confirm that the training job has started
status = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))


# Monitor Progress of The Training Job  

Execute the below cell to get 30 second status updates from the training job.

**Note:** Its expected that the training will take about 15 minutes. Is a good time to take a short break while the instance spins up and the training returns some meaningful data.

You can also view the progress and results of the training in the Sagemaker console.


In [None]:
TrainingJobStatus = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
SecondaryStatus = client.describe_training_job(TrainingJobName=job_name)['SecondaryStatus']
print(TrainingJobStatus, SecondaryStatus)
while TrainingJobStatus !='Completed' and TrainingJobStatus!='Failed':
    time.sleep(30)
    TrainingJobStatus = client.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
    SecondaryStatus = client.describe_training_job(TrainingJobName=job_name)['SecondaryStatus']
    print(TrainingJobStatus, SecondaryStatus)

# Conclusion

That's it! Let's review what we've learned. 
* Augmented manifests are a new format that provide a seamless interface between Ground Truth labelling jobs and SageMaker training jobs. 
* In augmented manifests, you specify the dataset objects and the associated annotations in-line.
* Be sure to pay close attention to the `AttributeNames` parameter in the training job request. The strings you specify in this field must correspond to those that are present in your augmented manifest.