# Train an image classification model on Caltech-256, with automatic model tuning and Elastic Inference


## Introduction

This workshop module is a variant on the Image Classification with Transfer Learning workshop. It is an end-to-end example of image classification using Amazon SageMaker's image classification algorithm, but this time using Automatic Mode Tuning and Elastic Inference (trained on the public ImageNet dataset). Again, the pre-trained model will be fine-tuned using the [caltech-256 dataset](http://www.vision.caltech.edu/Image_Datasets/Caltech256/).

To get started, we need to set up the environment with a few prerequisite steps, for permissions, configurations, and so on.

## Setup

Let's start by doing a little housework, just to make sure we have the latest everything we need

Run the cell by clicking either (1) the play symbol that appears to the left of In[] when you hover over it, or (2) the 'Run cell' button in the toolbar above, or (3) using Control + Enter from your keyboard.


In [None]:
!pip uninstall --yes numpy
!pip uninstall --yes numpy
!pip install -U  numpy==1.14 sagemaker

## Prequisites and Preprocessing

### Permissions and environment variables

Here we set up the linkage and authentication for AWS services. There are two parts to this:

* The S3 bucket that you want to use for training and model data.
* The Amazon SageMaker image classification Docker image which you can use out of the box, without modifications.

#### First part

In [None]:
import boto3
import sagemaker

session = sagemaker.Session()
bucket = session.default_bucket()

%env bucket s3://$bucket

#### Second part
Get the name of the image classification algorithm in our region

In [None]:
region_name = boto3.Session().region_name
algorithm = sagemaker.amazon.amazon_estimator.get_image_uri(region_name, "image-classification", "latest")

print("Using algorithm %s" % algorithm)

## Fine-tuning the Image Classification model

The Caltech 256 dataset consist of images from 257 categories (the last one being a clutter category), and has 30k images with a minimum of 80 images and a maximum of about 800 images per category. 

The image classification algorithm can take two types of input formats. The first is a [recordio format](https://mxnet.incubator.apache.org/faq/recordio.html), and the other is a [lst format](https://mxnet.incubator.apache.org/faq/recordio.html?highlight=im2rec). Files for both these formats are available at http://data.dmlc.ml/mxnet/data/caltech-256/. In this example, we will use the recordio format for training and use the training/validation split [specified here](http://data.dmlc.ml/mxnet/data/caltech-256/).

In [None]:
%%sh
wget http://data.mxnet.io/data/caltech-256/caltech-256-60-train.rec
wget http://data.mxnet.io/data/caltech-256/caltech-256-60-val.rec

Now we can upload dataset to S3 and define some filesystem locations

In [None]:
session.upload_data(path='caltech-256-60-train.rec', bucket=bucket, key_prefix='ml-immersion-day/adv-image-train')
session.upload_data(path='caltech-256-60-val.rec',   bucket=bucket, key_prefix='ml-immersion-day/adv-image-validation')

s3_train      = 's3://{}/ml-immersion-day/adv-image-train/'.format(bucket)
s3_validation = 's3://{}/ml-immersion-day/adv-image-validation/'.format(bucket)
s3_output     = 's3://{}/ml-immersion-day/adv-image-output'.format(bucket)

%env s3_train      $s3_train
%env s3_validation $s3_validation
%env s3_output     $s3_output

Let's have a look and make sure everything is in the right place

In [None]:
%%sh
aws s3 ls $s3_train
aws s3 ls $s3_validation

### Set dataset parameters

Here we tell SageMaker where it can find the datasets for training and validation, and bind them to the 2 input channels

In [None]:
train_data = sagemaker.session.s3_input(s3_train, 
                                        distribution='FullyReplicated', 
                                        content_type='application/x-recordio',
                                        s3_data_type='S3Prefix')

validation_data = sagemaker.session.s3_input(s3_validation,
                                             distribution='FullyReplicated', 
                                             content_type='application/x-recordio', 
                                             s3_data_type='S3Prefix')

data_channels = {'train': train_data, 'validation': validation_data}

## Training
Now that we are done with all the setup, we are ready to train our image classfication model. To begin, let us create a ``sageMaker.estimator.Estimator`` object. This Estimator will launch the training job.

### Training parameters
There are two kinds of parameters that need to be set for training. The first kind is the parameters for the training job. These include:

* **Training instance count**: This is the number of instances on which to run the training. When the number of instances is greater than one, then the image classification algorithm will run in a distributed cluster. 
* **Training instance type**: This indicates the type of machine on which to run the training. Typically, we use GPU instances for computer vision models such as this one.
* **Output path**: This the s3 folder in which the training output is stored.

In [None]:
ic = sagemaker.estimator.Estimator(algorithm,
                                   sagemaker.get_execution_role(), 
                                   train_instance_count=1, 
                                   train_instance_type='ml.p3.8xlarge',
                                   output_path=s3_output,
                                   sagemaker_session=session)

Apart from the above set of parameters, there are hyperparameters that are specific to the algorithm. These are:

* **num_layers**: The number of layers (depth) for the network. We use 18 in this samples but other values such as 50, 152 can be used.
* **use_pretrained_model**: Set to 1 to use pretrained model for transfer learning.
* **image_shape**: The input image dimensions,'num_channels, height, width', for the network. It should be no larger than the actual image size. The number of channels should be same as the actual image.
* **num_classes**: This is the number of output classes for the new dataset. Imagenet was trained with 1000 output classes but the number of output classes can be changed for fine-tuning. For caltech, we use 257 because it has 256 object categories + 1 clutter class.
* **num_training_samples**: This is the total number of training samples. It is set to 15240 for caltech dataset with the current split.
* **mini_batch_size**: The number of training samples used for each mini batch. In distributed training, the number of training samples used per batch will be N * mini_batch_size where N is the number of hosts on which training is run.
* **epochs**: Number of training epochs.
* **learning_rate**: Learning rate for training.
* **precision_dtype**: Training datatype precision (default: float32). If set to 'float16', the training will be done in mixed_precision mode and will be faster than float32 mode


In [None]:
ic.set_hyperparameters(num_layers=18,               # Train a Resnet-18 model
                       use_pretrained_model=1,      # Fine-tune on our dataset
                       num_classes=257,             # 256 classes + 1 clutter class
                       num_training_samples=15420,  # Number of training samples
                       optimizer='nag',
                       epochs=10,
                       augmentation_type='crop_color_transform' # Add altered images
                      )

### Configure model tuning job

Some HyperParameters come in range form. 
[See here for the details on these HyperParameters](https://docs.aws.amazon.com/sagemaker/latest/dg/IC-Hyperparameter.html)

In [None]:
from sagemaker.tuner import CategoricalParameter,IntegerParameter, ContinuousParameter

hyperparameter_ranges = {
                        'mini_batch_size': IntegerParameter(128, 2048),
                        'learning_rate': ContinuousParameter(0.001, 0.1, scaling_type='Logarithmic'),
                        }

In [None]:
objective_metric_name = 'validation:accuracy'
objective_type = 'Maximize'

In [None]:
from sagemaker.tuner import HyperparameterTuner

tuner = HyperparameterTuner(ic,
                            objective_metric_name,
                            hyperparameter_ranges,
                            objective_type=objective_type,
                            max_jobs=10,
                            max_parallel_jobs=2)

## Start the training

Now we can start the training job by calling the fit method of the Estimator object.

In [None]:
tuner.fit(inputs=data_channels, logs=True)

## Show the best tuning job so far

We kick off ten tuning epochs, with 2 running concurrently to observe default AWS account limits. In a production environment you can scale this out horizontally and run 10 concurrently if you desire, and the total billable time will be the same.

In [None]:
# Get tuning job name
job_name = tuner.latest_tuning_job.job_name
print(job_name)

In [None]:
# Show best tuning job
from pprint import pprint
import time

sagemaker = boto3.Session().client(service_name='sagemaker') 
best_job_yet = None
last_job_count = -1

tuning_job_result = sagemaker.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)
status = tuning_job_result['HyperParameterTuningJobStatus']
print(status)
print("each batch can take approx. 7 or 8 minutes")
print("don't move on until this has completed for all 10 jobs")
counter = 0
while status != 'Completed' and counter < 60:

    if tuning_job_result.get('BestTrainingJob', None):
        current_job = tuning_job_result['BestTrainingJob']
        if current_job != best_job_yet:
            best_job_yet = current_job
            print("Best model found so far:")
            pprint(tuning_job_result['BestTrainingJob'])

    job_count = tuning_job_result['TrainingJobStatusCounters']['Completed']
    if job_count != last_job_count:
        last_job_count = job_count
        print("%d training jobs have completed out of 10" % job_count)
        
    # sleep for 30 and then update markers
    time.sleep(30)
    counter += 1
    tuning_job_result = sagemaker.describe_hyper_parameter_tuning_job(HyperParameterTuningJobName=job_name)
    status = tuning_job_result['HyperParameterTuningJobStatus']
    if status == 'Failed':
        raise Exception('Job failed because :: {}'.format(tuning_job_result['FailureReason']))
if tuning_job_result.get('BestTrainingJob', None):
    print("Best model after training:")
    pprint(tuning_job_result['BestTrainingJob'])


### Deploy the best model using Elastic Inference

***

A trained model does nothing on its own. We now want to use the model to perform inference. For this example, that means predicting the class of a given image. To deploy the trained model, we simply use the deploy method of the Estimator, but unlike the previous module, we add an accelerator type, and use a cheaper instance of the ml family.


In [None]:
ic_predictor = tuner.deploy(initial_instance_count=1,
                         instance_type='ml.c5.large',        # $0.134/hour in eu-west-1
                         accelerator_type='ml.eia1.medium')  # $0.140/hour in eu-west-1

#ic_predictor = ic.deploy(initial_instance_count=1,
#                         instance_type='ml.p2.xlarge')     # $1.361/hour in eu-west-1

c5.large+eia1.medium give you performance comparable to p2.xlarge at ***80% discount***.

You'll save ***$782 per instance per month***. 

### Download a test image

In the introductory level workshop, this was identified with approximatley 78% confidence

In [None]:
!wget -O /tmp/test.jpg http://www.vision.caltech.edu/Image_Datasets/Caltech256/images/008.bathtub/008_0007.jpg
file_name = '/tmp/test.jpg'
# test image
from IPython.display import Image
Image(file_name)  


### Evaluation

Let's now use the SageMaker endpoint hosting the trained model to predict the class of the test image. The model outputs class probabilities.  Typically, one selects the class with the maximum probability as the final predicted class output.


In [None]:
object_categories = ['ak47', 'american-flag', 'backpack', 'baseball-bat', 'baseball-glove', 'basketball-hoop', 'bat', 'bathtub', 'bear', 'beer-mug', 'billiards', 'binoculars', 'birdbath', 'blimp', 'bonsai-101', 'boom-box', 'bowling-ball', 'bowling-pin', 'boxing-glove', 'brain-101', 'breadmaker', 'buddha-101', 'bulldozer', 'butterfly', 'cactus', 'cake', 'calculator', 'camel', 'cannon', 'canoe', 'car-tire', 'cartman', 'cd', 'centipede', 'cereal-box', 'chandelier-101', 'chess-board', 'chimp', 'chopsticks', 'cockroach', 'coffee-mug', 'coffin', 'coin', 'comet', 'computer-keyboard', 'computer-monitor', 'computer-mouse', 'conch', 'cormorant', 'covered-wagon', 'cowboy-hat', 'crab-101', 'desk-globe', 'diamond-ring', 'dice', 'dog', 'dolphin-101', 'doorknob', 'drinking-straw', 'duck', 'dumb-bell', 'eiffel-tower', 'electric-guitar-101', 'elephant-101', 'elk', 'ewer-101', 'eyeglasses', 'fern', 'fighter-jet', 'fire-extinguisher', 'fire-hydrant', 'fire-truck', 'fireworks', 'flashlight', 'floppy-disk', 'football-helmet', 'french-horn', 'fried-egg', 'frisbee', 'frog', 'frying-pan', 'galaxy', 'gas-pump', 'giraffe', 'goat', 'golden-gate-bridge', 'goldfish', 'golf-ball', 'goose', 'gorilla', 'grand-piano-101', 'grapes', 'grasshopper', 'guitar-pick', 'hamburger', 'hammock', 'harmonica', 'harp', 'harpsichord', 'hawksbill-101', 'head-phones', 'helicopter-101', 'hibiscus', 'homer-simpson', 'horse', 'horseshoe-crab', 'hot-air-balloon', 'hot-dog', 'hot-tub', 'hourglass', 'house-fly', 'human-skeleton', 'hummingbird', 'ibis-101', 'ice-cream-cone', 'iguana', 'ipod', 'iris', 'jesus-christ', 'joy-stick', 'kangaroo-101', 'kayak', 'ketch-101', 'killer-whale', 'knife', 'ladder', 'laptop-101', 'lathe', 'leopards-101', 'license-plate', 'lightbulb', 'light-house', 'lightning', 'llama-101', 'mailbox', 'mandolin', 'mars', 'mattress', 'megaphone', 'menorah-101', 'microscope', 'microwave', 'minaret', 'minotaur', 'motorbikes-101', 'mountain-bike', 'mushroom', 'mussels', 'necktie', 'octopus', 'ostrich', 'owl', 'palm-pilot', 'palm-tree', 'paperclip', 'paper-shredder', 'pci-card', 'penguin', 'people', 'pez-dispenser', 'photocopier', 'picnic-table', 'playing-card', 'porcupine', 'pram', 'praying-mantis', 'pyramid', 'raccoon', 'radio-telescope', 'rainbow', 'refrigerator', 'revolver-101', 'rifle', 'rotary-phone', 'roulette-wheel', 'saddle', 'saturn', 'school-bus', 'scorpion-101', 'screwdriver', 'segway', 'self-propelled-lawn-mower', 'sextant', 'sheet-music', 'skateboard', 'skunk', 'skyscraper', 'smokestack', 'snail', 'snake', 'sneaker', 'snowmobile', 'soccer-ball', 'socks', 'soda-can', 'spaghetti', 'speed-boat', 'spider', 'spoon', 'stained-glass', 'starfish-101', 'steering-wheel', 'stirrups', 'sunflower-101', 'superman', 'sushi', 'swan', 'swiss-army-knife', 'sword', 'syringe', 'tambourine', 'teapot', 'teddy-bear', 'teepee', 'telephone-box', 'tennis-ball', 'tennis-court', 'tennis-racket', 'theodolite', 'toaster', 'tomato', 'tombstone', 'top-hat', 'touring-bike', 'tower-pisa', 'traffic-light', 'treadmill', 'triceratops', 'tricycle', 'trilobite-101', 'tripod', 't-shirt', 'tuning-fork', 'tweezer', 'umbrella-101', 'unicorn', 'vcr', 'video-projector', 'washing-machine', 'watch-101', 'waterfall', 'watermelon', 'welding-mask', 'wheelbarrow', 'windmill', 'wine-bottle', 'xylophone', 'yarmulke', 'yo-yo', 'zebra', 'airplanes-101', 'car-side-101', 'faces-easy-101', 'greyhound', 'tennis-shoes', 'toad', 'clutter']

In [None]:
import json
import numpy as np

# Load test image from file
with open(file_name, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)

# Set content type
ic_predictor.content_type = 'application/x-image'

# Predict image and print JSON predicton
result = json.loads(ic_predictor.predict(payload))

# Print top class
index = np.argmax(result)
print("Result: label - " + object_categories[index] + ", probability - " + str(result[index]))

And here we get a 99.9% score for confidence, compared to the 78% in the Transfer Learning workshop, by using Auto Tuning, and a cheaper hosted prediction endpoint by using Elastic Inference in just a couple of lines of code.


### Clean up

When we're done with the endpoint, we can just delete it and the backing instance will be released.  Run the following cell to delete the endpoint.

In [None]:
ic_predictor.delete_endpoint()
# save some local disk space
!rm -rf ./caltech*