# DeepLens Outware faces training

This is an end to end template of how to train a image classification model and inference with it in Amazon SageMaker. Please make a copy and modify from there, if you want to train your own model. 

In this template, we use the faces extracted from outware photos as training data. Check `deeplens_sagemaker_data_preparation.ipynb` notebook for how to prepare data for training.

## Image Classification

### Initial configuration

Setup for SageMaker jobs, we are using image classification image provided by Amazon.

Please modify the *bucket* and *train_name* to match your data.

In [None]:
%%time
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

# Customize to your S3 bucket
bucket='deeplens-sagemaker-2bbe16b4-c056-4ae2-9332-d31dd7aeb470'
train_name='gender'

# Training image container
containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/image-classification:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/image-classification:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/image-classification:latest'}
training_image = containers[boto3.Session().region_name]

## Hyperparameters

See the detailed explanation of each hyperparameters in [here](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html)

You will need modify these hyper parameters to match your training data and requirement.

In [None]:
# Use pre-trained model to populate parameters
use_pretrained_model = "1"
# Checkpoint frequency, for example, for checkpoint_frequency = "10", the training job will save the model artifact every 10 epochs
checkpoint_frequency = "5"
# The algorithm supports multiple network depth (number of layers). They are 18, 34, 50, 101, 152 and 200
# For this training, we will use 18 layers
num_layers = "50" 
# we need to specify the input image shape for the training data
image_shape = "3,224,224"
# we also need to specify the number of training samples in the training set
# num_training_samples is the number of line in our YOUR_DATA_PREFIX_train.lst file
num_training_samples = "1294"
# specify the number of output classes
# num_classes is the number of lines in our outwarians_labels file
num_classes = "2"
# batch size for training
# make sure it is not too big 
mini_batch_size = "32"
# number of epochs
epochs = "100"
# optimizer
optimizer = "sgd"
# learning rate
learning_rate = "0.1"
# Decrease factor for learning rate
lr_scheduler_factor = "0.1"
# Epoch number when decrease in learning rate should happen, comma separated
lr_scheduler_step = "80,90"

## Training parameters

In this section, we setup training parameters. Note that you will need two input channels for image classification they are our train and validation folders on the S3 bucket.

In [None]:
%%time
import time
import boto3
from time import gmtime, strftime


# create unique job name 
job_name_prefix = 'sagemaker-{}-training-notebook'.format(train_name)
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/model/{}/{}/output'.format(bucket, train_name, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge",
        "VolumeSizeInGB": 100
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "use_pretrained_model": str(use_pretrained_model),
        "checkpoint_frequency": str(checkpoint_frequency),
        "optimizer": str(optimizer),
        "image_shape": image_shape,
        "num_layers": str(num_layers),
        "num_training_samples": str(num_training_samples),
        "num_classes": str(num_classes),
        "mini_batch_size": str(mini_batch_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate),
        "lr_scheduler_factor": str(lr_scheduler_factor),
        "lr_scheduler_step": str(lr_scheduler_step),
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 43200
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/train/{}/'.format(bucket, train_name),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/validation/{}/'.format(bucket, train_name),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "application/x-recordio",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nTraining Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))
print('\nValidation Data Location: {}'.format(training_params['InputDataConfig'][1]['DataSource']['S3DataSource']))

## Training model

Create a training model job with the parameter we set in the earlier step. It will kick off an EC2 node we specified above as "ml.m4.xlarge", it has 4 vCPU, NVidia K80 Graphics card with 12GB GPU ram, 61GB of system ram and 100GB of EBS, also this instance cost $0.90/hr/node.

Once the training job has started you can go to [Console > SageMaker > Jobs](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs) to check the latest job progress.

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus'] + ": " + training_info["SecondaryStatus"]
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus'] + ": " + training_info["SecondaryStatus"]
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

In [None]:
training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
status = training_info['TrainingJobStatus']
print("Training job ended with status: " + status)

## Create model

In [None]:
%%time
import boto3
from time import gmtime, strftime

sagemaker = boto3.Session().client(service_name='sagemaker') 

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
model_name="test-{}-classification-model{}".format(train_name, timestamp)
print(model_name)
info = sagemaker.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

containers = {'us-west-2': '433757028032.dkr.ecr.us-west-2.amazonaws.com/image-classification:latest',
              'us-east-1': '811284229777.dkr.ecr.us-east-1.amazonaws.com/image-classification:latest',
              'us-east-2': '825641698319.dkr.ecr.us-east-2.amazonaws.com/image-classification:latest',
              'eu-west-1': '685385470294.dkr.ecr.eu-west-1.amazonaws.com/image-classification:latest'}
hosting_image = containers[boto3.Session().region_name]
primary_container = {
    'Image': hosting_image,
    'ModelDataUrl': model_data,
}

create_model_response = sagemaker.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

## Create endpoint configuration

In [None]:
from time import gmtime, strftime

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_config_name = job_name_prefix + '-epc-' + timestamp
endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.m4.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print('Endpoint configuration name: {}'.format(endpoint_config_name))
print('Endpoint configuration arn:  {}'.format(endpoint_config_response['EndpointConfigArn']))

## Create endpoint

The endpoint is also hosted on ml.m4.xlarge, it will incur same cost as our training job. Please delete the endpoint after you finish all the work. Endpoint can always be recreated from the endpoint configuration.

In [None]:
%%time
import time

timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
endpoint_name = job_name_prefix + '-ep-' + timestamp
print('Endpoint name: {}'.format(endpoint_name))

endpoint_params = {
    'EndpointName': endpoint_name,
    'EndpointConfigName': endpoint_config_name,
}
endpoint_response = sagemaker.create_endpoint(**endpoint_params)
print('EndpointArn = {}'.format(endpoint_response['EndpointArn']))

In [None]:
# get the status of the endpoint
response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = response['EndpointStatus']
print('EndpointStatus = {}'.format(status))


# wait until the status has changed
sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)


# print the status of the endpoint
endpoint_response = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = endpoint_response['EndpointStatus']
print('Endpoint creation ended with EndpointStatus = {}'.format(status))

if status != 'InService':
    raise Exception('Endpoint creation failed.')

## Inference

In this section we will test our model with a set of images on the S3 bucket.

Update the following environment variable for your project.

In [None]:
%env TRAIN_LABEL=om_gender_label
%env TEST_IMAGE_PATH=test_images

In [None]:
import boto3
runtime = boto3.Session().client(service_name='runtime.sagemaker') 

In [None]:
%%bash

echo "Download test images"
aws s3 cp --recursive "s3://deeplens-sagemaker-2bbe16b4-c056-4ae2-9332-d31dd7aeb470/$TEST_IMAGE_PATH/" ./$TEST_IMAGE_PATH/
echo "Download label file"
aws s3 cp "s3://deeplens-sagemaker-2bbe16b4-c056-4ae2-9332-d31dd7aeb470/$TRAIN_LABEL" .
echo "Download cv2 face detection model"
aws s3 cp "s3://deeplens-sagemaker-2bbe16b4-c056-4ae2-9332-d31dd7aeb470/model/opencv_haarcascade_model/haarcascade_frontalface_default.xml" .

In [None]:
import os

label_file_name = os.environ['TRAIN_LABEL']
test_image_path = os.environ['TEST_IMAGE_PATH']
print(label_file_name)
print(test_image_path)

In [None]:
%%time
import json
import numpy as np
import cv2
from glob import glob

labels = []


def loadLabel():
    with open(label_file_name) as f:
        for line in f:
            (label, index) = line.split()
            labels.append(label)


def findFaces(imageNdarray, classifier):
    faces = classifier.detectMultiScale(imageNdarray,
                                        scaleFactor=1.3,
                                        minNeighbors=5,
                                        minSize=(50, 50),
                                        flags = cv2.CASCADE_SCALE_IMAGE)
    for (x, y, w, h) in faces:
        yield (x, y , w, h)


def queryEndpoint(face):
    face = cv2.resize(face, (224, 224), interpolation = cv2.INTER_CUBIC)
    ret, face = cv2.imencode('.jpg', face)
    if ret == True:
        face = bytearray(face.tostring())
    else:
        print('Failed to prepare image for recognition')
        return

    response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                       ContentType='application/x-image', 
                                       Body=face)
    result = response['Body'].read()
    # result will be in json format and convert it to ndarray
    result = json.loads(result)
    # the result will output the probabilities for all classes
    # find the class with maximum probability and print the class index
    index = np.argmax(result)
    prob = result[index]
    thresh = 0.50
    if prob < thresh:
        print("Result: Highest probility below threshold ({:.2f}%).".format(thresh * 100))
    else:
        print("Result: label - {}, probability - {:.2f}%".format(labels[index], result[index] * 100))

    for i in range(0, len(result)):
        print("    Label - {}, probability - {:.2f}%".format(labels[i], result[i] * 100))



def checkFace(fileName, faceCascade):
    image = cv2.imread(fileName, 1)
    faces = findFaces(image, faceCascade)
    for face in faces:
        (x, y, w, h) = face
        face = image[y:y+h, x:x+w]
        queryEndpoint(face)


photos = glob("./{}/*.jpg".format(test_image_path))

loadLabel()

faceCascade = cv2.CascadeClassifier("./haarcascade_frontalface_default.xml")

for photo in photos:
    print(photo)
    checkFace(photo, faceCascade)
    
cv2.destroyAllWindows()

In [None]:
sagemaker.delete_endpoint(EndpointName=endpoint_name)

In [None]:
%%bash

echo "Remove test artifacts"
rm ./$TRAIN_LABEL
rm -rf ./$TEST_IMAGE_PATH/
rm ./haarcascade_frontalface_default.xml