# Auto categorizer model in Sagemaker

<img src="../old-work-overview.png" width="40%" style="float:left;" /><img src="../auto-categorizer-model.png" width="60%" style="float:left;" />

### Building and registering the container

The following shell code shows how to build the container image using `docker build` and push the container image to ECR using `docker push`. This code is also available as the shell script `container/build-and-push.sh`, which you can run as `build-and-push.sh sagemaker-auto-categorization` to build the image `sagemaker-auto-categorization`. 

This code looks for an ECR repository in the account you're using and the current default region (if you're using a SageMaker notebook instance, this will be the region where the notebook instance was created). If the repository doesn't exist, the script will create it.

In [53]:
%%sh
# The name of our algorithm
algorithm_name=sagemaker-auto-categorization

account=$(aws sts get-caller-identity --query Account --output text)

# Get the region defined in the current configuration (default to us-west-2 if none defined)
region=$(aws configure get region)
region=${region:-us-west-2}

fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
fi

# Get the login command from ECR and execute it directly
$(aws ecr get-login --region ${region} --no-include-email)

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${algorithm_name} -f _sagemaker/Dockerfile . 
docker tag ${algorithm_name} ${fullname}

docker push ${fullname}

Login Succeeded
Sending build context to Docker daemon    636MB
Step 1/19 : FROM ubuntu:16.04
 ---> 7e87e2b3bf7a
Step 2/19 : MAINTAINER Amazon AI <sage-learner@amazon.com>
 ---> Using cache
 ---> 8fbda223875c
Step 3/19 : WORKDIR /opt
 ---> Using cache
 ---> b5e579200e0a
Step 4/19 : COPY requirements.txt /opt/requirements.txt
 ---> Using cache
 ---> f17aeedf6f36
Step 5/19 : RUN apt-get -y update && apt-get install -y --no-install-recommends          wget          python3          nginx          git          libicu-dev          ca-certificates     && rm -rf /var/lib/apt/lists/*
 ---> Using cache
 ---> 0a15d4a4f323
Step 6/19 : RUN alias python=python3
 ---> Using cache
 ---> 51d3395c87c4
Step 7/19 : RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py &&     pip install -r requirements.txt &&         rm -rf /root/.cache
 ---> Using cache
 ---> 55b8d0a4e299
Step 8/19 : ENV MODEL_PATH=/opt/ml/model
 ---> Using cache
 ---> dc8f06a791be
Step 9/19 : ENV PYTHONUNBUFFERED=TRUE
 --

Error processing tar file(exit status 1): write /opt/program/learning/trained-models/doc2vec.model.docvecs.doctag_syn0.npy: no space left on device


#### In the container the input and output is defined by sagemaker like the following:
###### The input

* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.
* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. 
* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.

###### The output

* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.
* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.

# Part 2: Training and Hosting your Algorithm in Amazon SageMaker

Once you have your container packaged, you can use it to train and serve models. Let's do that with the algorithm we made above.

## Set up the environment

Here we specify a bucket to use and the role that will be used for working with SageMaker.

In [25]:
# S3 prefix
prefix = 'data/DEMO-auto-categorizer'

# Define IAM role
import boto3
import re

import os
import numpy as np
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()

## Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [26]:
import sagemaker as sage
from time import gmtime, strftime

sess = sage.Session()

## Upload the data for training

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using some articles from CS (AWS RDS). 

We can use use the tools provided by the SageMaker Python SDK to upload the data to a default bucket. 

In [47]:
WORK_DIRECTORY = 'learning/data'

data_location = sess.upload_data(WORK_DIRECTORY, key_prefix=prefix)

## Create an estimator and fit the model

In order to use SageMaker to fit our algorithm, we'll create an `Estimator` that defines how to use the container to train. This includes the configuration we need to invoke SageMaker training:

* The __container name__. This is constructed as in the shell commands above.
* The __role__. As defined above.
* The __instance count__ which is the number of machines to use for training.
* The __instance type__ which is the type of machine to use for training.
* The __output path__ determines where the model artifact will be written.
* The __session__ is the SageMaker session object that we defined above.

Then we use fit() on the estimator to train against the data that we uploaded above.

In [28]:
account = sess.boto_session.client('sts').get_caller_identity()['Account']
region = sess.boto_session.region_name
image = '{}.dkr.ecr.{}.amazonaws.com/sagemaker-auto-categorization:latest'.format(account, region)

tree = sage.estimator.Estimator(image,
                       role, 1, 'ml.c4.2xlarge',
                       output_path="s3://{}/output".format(sess.default_bucket()),
                       sagemaker_session=sess)

### Train model


In [None]:
%%capture
tree.fit(data_location)

INFO:sagemaker:Creating training-job with name: sagemaker-auto-categorization-2019-03-05-14-19-23-464


## Deploy the model

Deploying the model to SageMaker hosting just requires a `deploy` call on the fitted model. This call takes an instance count, instance type, and optionally serializer and deserializer functions. These are used when the resulting predictor is created on the endpoint.

In [16]:
from sagemaker.predictor import csv_serializer
predictor = tree.deploy(1, 'ml.m4.xlarge', serializer=csv_serializer, endpoint_name='categorize')

INFO:sagemaker:Creating model with name: sagemaker-auto-categorization-2019-02-11-14-32-56-853
INFO:sagemaker:Creating endpoint with name categorize


---------------------------------------------------------------------------!

## Choose some data and use it for a prediction

In order to do some predictions, we'll test the algorithm on some of the data we used for training and do predictions against it. This is, of course, bad statistical practice, but a good way to see how the mechanism works.

In [None]:
predictor.predict("Exempel text")

## Optional cleanup

When you're done with the endpoint, you'll want to clean it up.

In [15]:
sess.delete_endpoint(predictor.endpoint)

INFO:sagemaker:Deleting endpoint with name: sagemaker-auto-categorization-2018-09-28-10-58-26-758
