# Deploy fast.ai model with Amazon SageMaker
_**Hosting a fastai based Pre-Trained Model in Amazon SageMaker Algorithm Containers**_

## Background

Amazon SageMaker includes functionality to support a hosted notebook environment, distributed, managed training, and real-time hosting. We think it works best when all three of these services are used together, but they can also be used independently. Some use cases may only require hosting. Maybe the model was trained prior to Amazon SageMaker existing, in a different service.

This notebook shows how to use a pre-existing [fast.ai](https://github.com/fastai/fastai) based model with an Amazon SageMaker Algorithm container to quickly create a hosted endpoint for that model.

## Setup
*This notebook was created and tested on an ml.p2.xlarge notebook instance.*

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.
- The Elastic Container Registry (ECR) repository where the custom Docker image used for model inference will be stored.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [1]:
# Put these at the top of every notebook, to get automatic reloading and inline plotting
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
bucket='sagemaker-mcclean-eu-west-1'  # customize to the name of your S3 bucket
ecr_repo_name = 'fastai-conv-net'     # customize to the name of your ECR repo
PATH='data/dogscats/'           # customize to the relative location of your data folder

Now we need to install some extra libraries to the fastai conda environment needed to call the sagemaker and AWS service endpoints via the boto3 library.

In [None]:
!pip install boto3 sagemaker

We also need to link the fastai library to be a accessbile as a symlink to the directory where the notebook resides.

In [3]:
! [ -L fastai ] || echo "Creating symlink" & ln -s ~/SageMaker/fastai/fastai fastai

In [4]:
import boto3
import re
import os
import urllib
import zipfile

In [5]:
%%time

from sagemaker import get_execution_role

role = get_execution_role()

client = boto3.client("sts")
account_id = client.get_caller_identity()["Account"]
region_name = boto3.Session().region_name
print('AWS Account ID: {}'.format(account_id))
print('Region: {}'.format(region_name))

training_image = '{}.dkr.ecr.{}.amazonaws.com/{}:latest'.format(account_id, region_name, ecr_repo_name)

print('Docker image for training is: {}'.format(training_image))
print('IAM role for SageMaker: {}'.format(role))

AWS Account ID: 934676248949
Region: eu-west-1
Docker image for training is: 934676248949.dkr.ecr.eu-west-1.amazonaws.com/fastai-conv-net:latest
IAM role for SageMaker: arn:aws:iam::934676248949:role/service-role/AmazonSageMaker-ExecutionRole-20171203T194740
CPU times: user 120 ms, sys: 20 ms, total: 140 ms
Wall time: 758 ms


## Data
For simplicity, we'll utilize the dataset that is part of [lesson 1](http://course.fast.ai/lessons/lesson1.html) of the fast.ai course. We will download the _dogscats_ image dataset from a Kaggle competition and save to a local directory.

In [7]:
%%time
if not os.path.isdir(PATH):
    print("Downloading data....")
    os.makedirs("data", exist_ok=True)
    zipfile_path = 'data/dogscats.zip'
    urllib.request.urlretrieve("http://files.fast.ai/data/dogscats.zip", zipfile_path)
    print("Extracting zipfile....")
    f = zipfile.ZipFile(zipfile_path)
    f.extractall("data")
    os.remove(zipfile_path)

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 30.8 µs


## Train model locally
Now we will train the model based on lesson 1 of the fast.ai course.

In [8]:
import torch
from fastai.imports import *

In [9]:
from fastai.transforms import *
from fastai.conv_learner import *
from fastai.model import *
from fastai.dataset import *
from fastai.sgdr import *
from fastai.plots import *

In [10]:
sz=224

In [11]:
torch.cuda.is_available()

True

In [12]:
torch.backends.cudnn.enabled

True

In [13]:
arch=resnet34
data = ImageClassifierData.from_paths(PATH, tfms=tfms_from_model(arch, sz))
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 2)

Downloading: "https://download.pytorch.org/models/resnet34-333f7ec4.pth" to /home/ec2-user/.torch/models/resnet34-333f7ec4.pth
100%|██████████| 87306240/87306240 [00:01<00:00, 50916979.96it/s]


100%|██████████| 360/360 [02:24<00:00,  2.48it/s]
100%|██████████| 32/32 [00:13<00:00,  2.45it/s]


epoch      trn_loss   val_loss   accuracy                     
    0      0.04534    0.025115   0.989746  
    1      0.046546   0.024737   0.990234                     



[0.024736615, 0.990234375]

In [14]:
learn.save('fastai_catsdogs')

## Upload models to S3
Now that we have trained a model and saved locally we can upload it plus some extra files to S3.

In [15]:
def get_relative_path(filename):
    s1 = os.path.split(filename)
    p = os.path.split(s1[0])[1]
    return os.path.join(p, s1[1])

In [16]:
def create_dummy_data(src_path, dest_root, sub_dir, num_items=2):
    if not os.path.isdir(dest_root): os.mkdir(dest_root)
    dst_path = os.path.join(dest_root, sub_dir)
    classes = os.listdir(src_path)
    for d in classes:
        if d.startswith('.'): continue
        if not os.path.isdir(dst_path): os.mkdir(dst_path)
        if not os.path.isdir(os.path.join(dst_path, d)): os.mkdir(os.path.join(dst_path, d))
        fnames = glob('{}/{}/*.jpg'.format(src_path, d))
        for i in range(num_items):
            shutil.copyfile(fnames[i], os.path.join(dst_path, get_relative_path(fnames[i])))

In [17]:
create_dummy_data(PATH + "train", PATH + "models/data", "train")
create_dummy_data(PATH + "valid", PATH + "models/data", "valid")

## Hyperparameters
Set the hyperparameters for the training of the model.

In [None]:
# The algorithm supports multiple parameters
# we need to specify the input image size
image_size = 224
# number of epochs
epochs = 3
# learning rate
learning_rate = 0.01

## Build the Docker image
Build the Docker image and upload to ECR to be used for both training and inference of our fast.ai model.

# Training
Run the training using Amazon sagemaker CreateTrainingJob API

In [None]:
%%time
import time
import boto3
from time import gmtime, strftime


s3 = boto3.client('s3')
# create unique job name 
job_name_prefix = 'sagemaker-fastai-dogscats'
timestamp = time.strftime('-%Y-%m-%d-%H-%M-%S', time.gmtime())
job_name = job_name_prefix + timestamp
training_params = \
{
    # specify the training docker image
    "AlgorithmSpecification": {
        "TrainingImage": training_image,
        "TrainingInputMode": "File"
    },
    "RoleArn": role,
    "OutputDataConfig": {
        "S3OutputPath": 's3://{}/{}/output'.format(bucket, job_name_prefix)
    },
    "ResourceConfig": {
        "InstanceCount": 1,
        "InstanceType": "ml.p2.xlarge",
        "VolumeSizeInGB": 50
    },
    "TrainingJobName": job_name,
    "HyperParameters": {
        "image_size": str(image_size),
        "epochs": str(epochs),
        "learning_rate": str(learning_rate)
    },
    "StoppingCondition": {
        "MaxRuntimeInSeconds": 360000
    },
#Training data should be inside a subdirectory called "train"
#Validation data should be inside a subdirectory called "validation"
#The algorithm currently only supports fullyreplicated model (where data is copied onto each machine)
    "InputDataConfig": [
        {
            "ChannelName": "training",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/{}/train/'.format(bucket, data_prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "image/jpeg",
            "CompressionType": "None"
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "S3Prefix",
                    "S3Uri": 's3://{}/{}/valid/'.format(bucket, data_prefix),
                    "S3DataDistributionType": "FullyReplicated"
                }
            },
            "ContentType": "image/jpeg",
            "CompressionType": "None"
        }
    ]
}
print('Training job name: {}'.format(job_name))
print('\nInput Data Location: {}'.format(training_params['InputDataConfig'][0]['DataSource']['S3DataSource']))

In [None]:
# create the Amazon SageMaker training job
sagemaker = boto3.client(service_name='sagemaker')
sagemaker.create_training_job(**training_params)

# confirm that the training job has started
status = sagemaker.describe_training_job(TrainingJobName=job_name)['TrainingJobStatus']
print('Training job current status: {}'.format(status))

try:
    # wait for the job to finish and report the ending status
    sagemaker.get_waiter('training_job_completed_or_stopped').wait(TrainingJobName=job_name)
    training_info = sagemaker.describe_training_job(TrainingJobName=job_name)
    status = training_info['TrainingJobStatus']
    print("Training job ended with status: " + status)
except:
    print('Training failed to start')
     # if exception is raised, that means it has failed
    message = sagemaker.describe_training_job(TrainingJobName=job_name)['FailureReason']
    print('Training failed with the following error: {}'.format(message))

## Create Model

We now create a SageMaker Model from the training output. Using the model we can create an Endpoint Configuration.

In [None]:
%%time
import boto3
from time import gmtime, strftime

sage = boto3.Session().client(service_name='sagemaker') 

model_name="fastai-conv-net-model"
print(model_name)
info = sage.describe_training_job(TrainingJobName=job_name)
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': training_image,
    'ModelDataUrl': model_data,
}

create_model_response = sage.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

## Create the Endpoint Configuration

Now we can create the Endpoint configuration to deploy the model for inference.

In [None]:
from time import gmtime, strftime

endpoint_config_name = 'fastai-convnet-endpoint-config-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_config_name)
create_endpoint_config_response = sagemaker.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ProductionVariants=[{
        'InstanceType':'ml.p2.xlarge',
        'InitialInstanceCount':1,
        'ModelName':model_name,
        'VariantName':'AllTraffic'}])

print("Endpoint Config Arn: " + create_endpoint_config_response['EndpointConfigArn'])

## Create the Endpoint

Now we can create the endpoint to do the model inference.

In [None]:
%%time
import time

endpoint_name = 'fastai-convnet-endpoint-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
create_endpoint_response = sagemaker.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name)
print(create_endpoint_response['EndpointArn'])

resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
status = resp['EndpointStatus']
print("Status: " + status)

try:
    sagemaker.get_waiter('endpoint_in_service').wait(EndpointName=endpoint_name)
finally:
    resp = sagemaker.describe_endpoint(EndpointName=endpoint_name)
    status = resp['EndpointStatus']
    print("Arn: " + resp['EndpointArn'])
    print("Create endpoint ended with status: " + status)

    if status != 'InService':
        message = sagemaker.describe_endpoint(EndpointName=endpoint_name)['FailureReason']
        print('Training failed with the following error: {}'.format(message))
        raise Exception('Endpoint creation did not succeed')

## Randomly select test image
Randomly select an image from the test folder to submit to the SageMaker prediction endpoint.

In [None]:
import os, random
dir_name = 'data/dogscats/test1/'
file_name = dir_name + random.choice(os.listdir(dir_name)) #change dir name to whatever
print(file_name)
#file_name = 'data/dogscats/test1/9969.jpg'
# test image
from IPython.display import Image
Image(file_name)

## Call Endpoint
Call the endpoint with some test data.

In [None]:
%%time
import time
import json

runtime = boto3.Session().client('runtime.sagemaker')

with open(file_name, 'rb') as f:
    payload = f.read()
    payload = bytearray(payload)
response = runtime.invoke_endpoint(EndpointName=endpoint_name, 
                                   ContentType='application/x-image', 
                                   Body=payload)
result = response['Body'].read()
print(json.loads(result))

## Delete Endpoint
Delete the endpoint to stop incurring costs.

In [None]:
import sagemaker as sage

sess = sage.Session()
sess.delete_endpoint(endpoint_name)