# Building your own algorithm container

With Amazon SageMaker, you can package your own algorithms that can than be trained and deployed in the SageMaker environment. This notebook will guide you through an example that shows you how to build a Docker container for SageMaker that you can also use in EC2.

By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. 


1. [Building your own algorithm container](#Building-your-own-algorithm-container)
  1. [When should I build my own algorithm container?](#When-should-I-build-my-own-algorithm-container%3F)
  1. [Permissions](#Permissions)
  1. [The example](#The-example)
  1. [The presentation](#The-presentation)
1. [Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker](#Part-1%3A-Packaging-and-Uploading-your-Algorithm-for-use-with-Amazon-SageMaker)
    1. [An overview of Docker](#An-overview-of-Docker)
    1. [How Amazon SageMaker runs your Docker container](#How-Amazon-SageMaker-runs-your-Docker-container)
      1. [Running your container during training](#Running-your-container-during-training)
        1. [The input](#The-input)
        1. [The output](#The-output)
      1. [Running your container during hosting](#Running-your-container-during-hosting)
    1. [The parts of the sample container](#The-parts-of-the-sample-container)
    1. [The Dockerfile](#The-Dockerfile)
    1. [Building and registering the container](#Building-and-registering-the-container)
  1. [Testing your algorithm on your local machine or on an Amazon SageMaker notebook instance](#Testing-your-algorithm-on-your-local-machine-or-on-an-Amazon-SageMaker-notebook-instance)
1. [Part 2: Using your Algorithm in Amazon SageMaker](#Part-2%3A-Using-your-Algorithm-in-Amazon-SageMaker)
  1. [Set up the environment](#Set-up-the-environment)
  1. [Create the session](#Create-the-session)
  1. [Upload the data for training](#Upload-the-data-for-training)
  1. [Create an estimator and fit the model](#Create-an-estimator-and-fit-the-model)
  1. [Hosting your model](#Hosting-your-model)
    1. [Deploy the model](#Deploy-the-model)
    2. [Choose some data and use it for a prediction](#Choose-some-data-and-use-it-for-a-prediction)
    3. [Optional cleanup](#Optional-cleanup)
  1. [Run Batch Transform Job](#Run-Batch-Transform-Job)
    1. [Create a Transform Job](#Create-a-Transform-Job)
    2. [View Output](#View-Output)

_or_ I'm impatient, just [let me see the code](#The-Dockerfile)!

## When should I build my own algorithm container?

You may not need to create a container to bring your own code to Amazon SageMaker. When you are using a framework (such as Apache MXNet or TensorFlow) that has direct support in SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework. This set of frameworks is continually expanding, so we recommend that you check the current list if your algorithm is written in a common machine learning environment.

Even if there is direct SDK support for your environment or framework, you may find it more effective to build your own container. If the code that implements your algorithm is quite complex on its own or you need special additions to the framework, building your own container may be the right choice.

If there isn't direct SDK support for your environment, don't worry. You'll see in this walk-through that building your own container is quite straightforward.

## Permissions

Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because we'll creating new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryFullAccess` to the role that you used to start your notebook instance. There's no need to restart your notebook instance when you do this, the new permissions will be available immediately.

## The example

Here, we'll show how to package a simple Python example which showcases the [decision tree][] algorithm from the widely used [scikit-learn][] machine learning package. The example is purposefully fairly trivial since the point is to show the surrounding structure that you'll want to add to your own code so you can train and host it in Amazon SageMaker.

The ideas shown here will work in any language or environment. You'll need to choose the right tools for your environment to serve HTTP requests for inference, but good HTTP environments are available in every language these days.

In this example, we use a single image to support training and hosting. This is easy because it means that we only need to manage one image and we can set it up to do everything. Sometimes you'll want separate images for training and hosting because they have different requirements. Just separate the parts discussed below into separate Dockerfiles and build two images. Choosing whether to have a single image or two images is really a matter of which is more convenient for you to develop and manage.

If you're only using Amazon SageMaker for training or hosting, but not both, there is no need to build the unused functionality into your container.

[scikit-learn]: http://scikit-learn.org/stable/
[decision tree]: http://scikit-learn.org/stable/modules/tree.html

## The presentation

This presentation is divided into two parts: _building_ the container and _using_ the container.

### The Dockerfile

The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. 

For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.

Along the way, we clean up extra space. This makes the container smaller and faster to start.

Let's look at the Dockerfile for the example:

In [1]:
!cat container/Dockerfile

# Build an image that can do training and inference in SageMaker
# This is a Python 2 image that uses the nginx, gunicorn, flask stack
# for serving inferences in a stable way.

FROM ubuntu:18.04

MAINTAINER Amazon AI <sage-learner@amazon.com>


RUN apt-get -y update && apt-get install -y --no-install-recommends \
         wget \
         python3 \
         python3-pip\
         python3-setuptools\
         nginx \
         ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Here we get all python packages.
# There's substantial overlap between scipy and numpy that we eliminate by
# linking them together. Likewise, pip leaves the install caches populated which uses
# a significant amount of space. These optimizations save a fair amount of space in the
# image, which reduces start up time.
RUN pip3 install numpy==1.16.2 scipy==1.2.1 scikit-learn==0.20.2 pandas boto3 flask gunicorn 
    #&& \
 #       (cd /usr/local/lib/python3.6/dist-packages/scipy/.libs; rm *; ln ../../numpy/.libs/

In [4]:
import json
import boto3
import re
import os
import numpy as np
import pandas as pd
from time import gmtime, strftime
import pickle
from sklearn import tree, linear_model
from sklearn.utils import shuffle

s3bucket = 'privisaa-bucket-virginia' #replace with your bucket 
datakey = 'iris.csv'
s3client = boto3.client('s3')

params = {'max_leaf_nodes':10,
         'bucket':s3bucket,
         'input_data':datakey}

with open('hyperparameters.json', 'w') as f:
    json.dump(params, f)

To get our sagemaker training container working in EC2, we have to make a few changes. 

Sagemaker training containers are run with the following command in Sagemaker:

Docker run < imagename > train
    
The train command executes an entrypoint that runs our training code. Sagemaker expects a few things when running training jobs:

- hyperparameters
- input data

Above we saved our hyperparameters.json, our image will need to either contain or be sent these hyperparameters as well as our training data.

In [3]:
%%sh

cp -r ~/.aws container

# The name of our algorithm
algorithm_name=sagemaker-decision-trees-ec2

cd container

chmod +x decision_trees/train
chmod +x decision_trees/serve

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.
aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

rm -rf container/.aws

Login Succeeded
Sending build context to Docker daemon  68.61kB
Step 1/10 : FROM ubuntu:18.04
 ---> d27b9ffc5667
Step 2/10 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> 9aaae5d8bdd3
Step 3/10 : RUN apt-get -y update && apt-get install -y --no-install-recommends          wget          python3          python3-pip         python3-setuptools         nginx          ca-certificates     && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 5e3947b20d07
Step 4/10 : RUN pip3 install numpy==1.16.2 scipy==1.2.1 scikit-learn==0.20.2 pandas boto3 flask gunicorn
 ---> Using cache
 ---> 30a825e8f417
Step 5/10 : ENV PYTHONUNBUFFERED=TRUE
 ---> Using cache
 ---> 15c31efaa8a6
Step 6/10 : ENV PYTHONDONTWRITEBYTECODE=TRUE
 ---> Using cache
 ---> a148c9b730fd
Step 7/10 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> ed2488a43c18
Step 8/10 : COPY decision_trees /opt/program
 ---> 5b46f18eb688
Step 9/10 : COPY .aws /root/.aws
 ---> 7ea70513e328
Step 10/10 : WORKDIR /opt

https://docs.docker.com/engine/reference/commandline/login/#credentials-store



# Part 2: Using your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train models and use the model for hosting or batch transforms. Let's do that with the algorithm we made above.

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [5]:
# S3 prefix
prefix = 'DEMO-scikit-byo-iris'

# we remove the execution role
# from sagemaker import get_execution_role

# role = get_execution_role()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which we have included. 


In [None]:
# WORK_DIRECTORY = 'data'

# data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## View our data

In [5]:
iris = pd.read_csv('container/decision_trees/iris.csv', header=None)
iris = shuffle(iris)
train_data = iris.iloc[:-5,:]
val_data = iris.iloc[-5:,:]
train_data.head()

Unnamed: 0,0,1,2,3,4
142,virginica,5.8,2.7,5.1,1.9
119,virginica,6.0,2.2,5.0,1.5
55,versicolor,5.7,2.8,4.5,1.3
67,versicolor,5.8,2.7,4.1,1.0
41,setosa,4.5,2.3,1.3,0.3


## Run our training job in the notebook

In [None]:
# set variables
max_leaf_nodes = 30

# labels are in the first column
train_y = train_data.iloc[:,0]
train_X = train_data.iloc[:,1:]

# Now use scikit-learn's decision tree classifier to train the model.
clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
# clf = linear_model.LogisticRegression(max_iter=1000)
clf = clf.fit(train_X, train_y)

print("Val predictions: ",clf.predict(val_data.iloc[:,1:]))

## Create an estimator and fit the model

Normally here we would use a SageMaker Estimator to fit our algorithm, instead we will do the equivalent for running our container in EC2 by running 

docker run < imagename > train

In [None]:
%%sh

account=$(aws sts get-caller-identity --query Account --output text)
region=$(aws configure get region)

# use aws cli for this instead 
# account = sess.boto_session.client('sts').get_caller_identity()['Account']
# region = sess.boto_session.region_name
# image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-decision-trees:latest'.format(account, region)

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${image}:latest"

### Instead of using a sagemaker estimator we can instead just run the docker train command 
# tree = sage.estimator.Estimator(image,
#                        role, 1, 'ml.c4.2xlarge',
#                        output_path="s3://{}/output".format(sess.default_bucket()),
#                        sagemaker_session=sess)

# tree.fit(data_location)

docker run 209419068016.dkr.ecr.us-east-1.amazonaws.com/sagemaker-decision-trees-ec2 train

### Run inference

In [10]:
# download our data from s3
s3client.download_file(s3bucket, 'decision-tree-model.pkl', 'decision-tree-model.pkl')


In [13]:
# load our model
model = pickle.load(open('decision-tree-model.pkl','rb'))



### Choose some data and use it for a prediction

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [14]:
shape=pd.read_csv("data/iris.csv", header=None)
shape.sample(3)

Unnamed: 0,0,1,2,3,4
108,virginica,6.7,2.5,5.8,1.8
25,setosa,5.0,3.0,1.6,0.2
1,setosa,4.9,3.0,1.4,0.2


In [15]:
# drop the label column in the training set
shape.drop(shape.columns[[0]],axis=1,inplace=True)
shape.sample(3)

Unnamed: 0,1,2,3,4
90,5.5,2.6,4.4,1.2
3,4.6,3.1,1.5,0.2
115,6.4,3.2,5.3,2.3


In [16]:
import itertools

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data = shape.iloc[indices[:-1]]


Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [17]:
model.predict(test_data.values)

array(['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa',
       'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'versicolor', 'versicolor', 'versicolor', 'versicolor',
       'virginica', 'virginica', 'virginica', 'virginica', 'virginica',
       'virginica', 'virginica', 'virginica', 'virginica'], dtype=object)