# Building your own algorithm container

# Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker

In [1]:
!cat Dockerfile

FROM python:2

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential nginx git

RUN pip install --upgrade \
    numpy \
    scipy \
    scikit-learn \
    matplotlib \
    pandas


RUN  git clone https://github.com/mvsusp/sagemaker-containers.git -b mvs-sagemaker-containers-train-improvements && cd sagemaker-containers && pip install .

COPY decision_trees /decision_trees

ENV PYTHONPATH /decision_trees


ENV SAGEMAKER_TRAINING_MODULE train
ENV SAGEMAKER_SERVING_MODULE serve:main


### Building the container

In [2]:
%%sh

IMAGE_NAME=scikit-learn-image

docker build -t ${IMAGE_NAME} .

Sending build context to Docker daemon    960kB
Step 1/8 : FROM python:2
 ---> 17c0fe4e76a5
Step 2/8 : RUN apt-get update && apt-get install -y --no-install-recommends     build-essential nginx git
 ---> Using cache
 ---> 07ba460e18b6
Step 3/8 : RUN pip install --upgrade     numpy     scipy     scikit-learn     matplotlib     pandas
 ---> Using cache
 ---> 0b8511851624
Step 4/8 : RUN  git clone https://github.com/mvsusp/sagemaker-containers.git -b mvs-sagemaker-containers-train-improvements && cd sagemaker-containers && pip install .
 ---> Using cache
 ---> baae39ee7ee9
Step 5/8 : COPY decision_trees /decision_trees
 ---> Using cache
 ---> 68a8dcfa7151
Step 6/8 : ENV PYTHONPATH /decision_trees
 ---> Using cache
 ---> ea7700919582
Step 7/8 : ENV SAGEMAKER_TRAINING_MODULE train
 ---> Using cache
 ---> f465d7e812bd
Step 8/8 : ENV SAGEMAKER_SERVING_MODULE serve:main
 ---> Using cache
 ---> 09a5f26709de
Successfully built 09a5f26709de
Successfully tagged scikit-learn-image:latest


### Training the container locally

In [5]:
%%sh

IMAGE_NAME=scikit-learn-image

docker run -v $(pwd)/data:/opt/ml/input/data/training/ -v $(pwd)/model:/opt/ml/model/  ${IMAGE_NAME} train --max-leaf-nodes 10

2018-08-02 16:25:42,964 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)
2018-08-02 16:25:42,967 sagemaker-containers INFO     Invoking user script

Training Env:

{
    "network_interface_name": "ethwe", 
    "log_level": 20, 
    "model_dir": "/opt/ml/model", 
    "num_gpus": 0, 
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    }, 
    "input_config_dir": "/opt/ml/input/config", 
    "num_cpus": 2, 
    "input_data_config": {
        "training": {}
    }, 
    "output_data_dir": "/opt/ml/output/data", 
    "hosts": [
        "0a8f2cf64ccb"
    ], 
    "output_dir": "/opt/ml/output", 
    "module_dir": "/opt/ml/code", 
    "hyperparameters": {
        "max-leaf-nodes": 10
    }, 
    "module_name": "None", 
    "current_host": "0a8f2cf64ccb", 
    "input_dir": "/opt/ml/input", 
    "job_name": null, 
    "resource_config": {
        "current_host": "0a8f2cf64ccb", 
        "hosts": [
            "0a8f2cf64ccb"
        ]
    

### Serving the container locally in the background

In [10]:
%%script bash --bg

IMAGE_NAME=scikit-learn-image

docker run --name DEMO-scikit-byo-sagemaker-containers -p 8080:8080 -v $(pwd)/model:/opt/ml/model/  ${IMAGE_NAME} serve

You can see the server running here

In [12]:
!docker logs DEMO-scikit-byo-sagemaker-containers

[2018-08-02 16:36:40 +0000] [12] [INFO] Starting gunicorn 19.9.0
[2018-08-02 16:36:40 +0000] [12] [INFO] Listening at: unix:/tmp/gunicorn.sock (12)
[2018-08-02 16:36:40 +0000] [12] [INFO] Using worker: gevent
[2018-08-02 16:36:40 +0000] [18] [INFO] Booting worker with pid: 18
[2018-08-02 16:36:40 +0000] [19] [INFO] Booting worker with pid: 19


### You can make predictions now

In [14]:
!curl -X POST http://localhost:8080/invocations -d '[[1.0,2.0,5.0,9.0]]' -H "Content-Type: application/json" -H "Accept: application/json"

["virginica"]

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    32  100    13  100    19     31     46 --:--:-- --:--:-- --:--:--    46


### Don't forget to stop your container

In [18]:
%%script bash

docker stop DEMO-scikit-byo-sagemaker-containers
docker rm DEMO-scikit-byo-sagemaker-containers

DEMO-scikit-byo-sagemaker-containers
DEMO-scikit-byo-sagemaker-containers


### Pushing the image

In [23]:
%%script bash

IMAGE_NAME=scikit-learn-image

# Get the account number associated with the current IAM credentials
AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
REGION=$(aws configure get region)
REGION=${REGION:-us-west-2}


ECR_IMAGE_NAME="${AWS_ACCOUNT}.dkr.ecr.${REGION}.amazonaws.com/${IMAGE_NAME}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${IMAGE_NAME}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${IMAGE_NAME}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${REGION} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker tag ${IMAGE_NAME} ${ECR_IMAGE_NAME}

docker push ${ECR_IMAGE_NAME}


Login Succeeded
The push refers to repository [369233609183.dkr.ecr.us-west-2.amazonaws.com/scikit-learn-image]
3659b7d07935: Preparing
0d2eb7422a3d: Preparing
687cfc3f7262: Preparing
a0af8ca0a64c: Preparing
8b7c9a62baba: Preparing
66ea76482fd5: Preparing
1defa7e52b6e: Preparing
8eb4c3a69e64: Preparing
1fa8778eb779: Preparing
fa0c3f992cbd: Preparing
ce6466f43b11: Preparing
719d45669b35: Preparing
3b10514a95be: Preparing
8eb4c3a69e64: Waiting
1fa8778eb779: Waiting
fa0c3f992cbd: Waiting
ce6466f43b11: Waiting
719d45669b35: Waiting
3b10514a95be: Waiting
66ea76482fd5: Waiting
1defa7e52b6e: Waiting
a0af8ca0a64c: Layer already exists
687cfc3f7262: Layer already exists
3659b7d07935: Layer already exists
0d2eb7422a3d: Layer already exists
8b7c9a62baba: Layer already exists
1defa7e52b6e: Layer already exists
66ea76482fd5: Layer already exists
fa0c3f992cbd: Layer already exists
8eb4c3a69e64: Layer already exists
1fa8778eb779: Layer already exists
ce6466f43b11: Layer already exists
719d45669b35: L



# Part 2: Training and Hosting your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train and serve models. Let's do that with the algorithm we made above.

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [22]:
# S3 prefix
prefix = 'DEMO-scikit-byo-sagemaker-containers'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = 'SageMakerRole'

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [20]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which we have included. 

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [21]:
WORK_DIRECTORY = 'data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constructed as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [24]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/scikit-learn-image:latest'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

tree.fit(data_location)

INFO:sagemaker:Creating training-job with name: scikit-learn-image-2018-08-02-16-42-41-094


...............
[31m2018-08-02 16:44:58,144 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[31m2018-08-02 16:44:58,146 sagemaker-containers INFO     Invoking user script
[0m
[31mTraining Env:
[0m
[31m{
    "network_interface_name": "ethwe", 
    "log_level": 20, 
    "model_dir": "/opt/ml/model", 
    "num_gpus": 0, 
    "channel_input_dirs": {
        "training": "/opt/ml/input/data/training"
    }, 
    "input_config_dir": "/opt/ml/input/config", 
    "num_cpus": 8, 
    "input_data_config": {
        "training": {
            "TrainingInputMode": "File", 
            "RecordWrapperType": "None", 
            "S3DistributionType": "FullyReplicated"
        }
    }, 
    "output_data_dir": "/opt/ml/output/data", 
    "hosts": [
        "algo-1"
    ], 
    "output_dir": "/opt/ml/output", 
    "module_dir": "/opt/ml/code", 
    "hyperparameters": {}, 
    "module_name": "None", 
    "current_host": "algo-1", 
    "input_dir": "/opt/ml/input", 
  

## Deploy the model

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

In [31]:
from sagemaker.predictor import json_deserializer, json_serializer
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=json_serializer, deserializer=json_deserializer)

INFO:sagemaker:Creating model with name: scikit-learn-image-2018-08-02-17-00-08-769
INFO:sagemaker:Creating endpoint with name scikit-learn-image-2018-08-02-16-42-41-094


--------------------------------------------------!

## Choose some data and use it for a prediction

In order to do some predictions, we'll extract some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [26]:
shape=pd.read_csv("data/iris.csv", header=None)

import itertools

a = [50*i for i in range(3)]
b = [40+i for i in range(10)]
indices = [i+j for i,j in itertools.product(a,b)]

test_data=shape.iloc[indices[:-1]]
test_X=test_data.iloc[:,1:]
test_y=test_data.iloc[:,0]

Prediction is as easy as calling predict with the predictor we got back from deploy and the data we want to do predictions with. The serializers take care of doing the data conversions for us.

In [32]:
predictor.content_type = 'application/json'
predictor.accept = 'application/json'

print(predictor.predict(test_X.values))

['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica']


## Optional cleanup

When you're done with the endpoint, you'll want to clean it up.

In [33]:
sess.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: scikit-learn-image-2018-08-02-16-42-41-094
