# Bring your own Scikit learn model to SageMaker for Batch Transform
This tutorial shows you how to bring your [Scikit-learn](https://scikit-learn.org/stable/) models to SageMaker so that you can host and make inferences using SageMaker infrastructure. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing. 

The [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) module makes it easy to take existing scikit-learn model and generate predictions using the SageMaker hosting and inferencing service. For more information about the Scikit-learn container, see the [sagemaker-scikit-learn-containers](https://github.com/aws/sagemaker-scikit-learn-container) repository and the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) repository. Note that the version of the sklearn used for training has to match that of provided container.

For more on Scikit-learn, please visit the Scikit-learn website: <http://scikit-learn.org/stable/>.

### Table of contents
* [Upload the data for training](#upload_data)
* [Create a Scikit-learn script to train with](#create_sklearn_script)
* [Batch Transform](#batch_transform)
 * [Prepare Input Data](#prepare_input_data)
 * [Run Transform Job](#run_transform_job)
 * [Check Output Data](#check_output_data)

In [3]:
!pip install -U scikit-learn==0.23.1

Collecting scikit-learn==0.23.1
  Downloading scikit_learn-0.23.1-cp36-cp36m-manylinux1_x86_64.whl (6.8 MB)
[K     |████████████████████████████████| 6.8 MB 2.9 MB/s eta 0:00:01     |▋                               | 122 kB 2.9 MB/s eta 0:00:03
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Installing collected packages: threadpoolctl, scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.22.1
    Uninstalling scikit-learn-0.22.1:
      Successfully uninstalled scikit-learn-0.22.1
Successfully installed scikit-learn-0.23.1 threadpoolctl-2.1.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [4]:
# S3 prefix
prefix = 'Scikit-iris'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Upload the data for training <a class="anchor" id="upload_data"></a>

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're using a sample of the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is included with Scikit-learn. We will load the dataset, write locally, then write the dataset to s3 to use.

In [5]:
import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs('./data', exist_ok=True)
np.savetxt('./data/iris.csv', joined_iris, delimiter=',', fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

In [10]:
joined_iris, iris

(array([[0. , 5.1, 3.5, 1.4, 0.2],
        [0. , 4.9, 3. , 1.4, 0.2],
        [0. , 4.7, 3.2, 1.3, 0.2],
        [0. , 4.6, 3.1, 1.5, 0.2],
        [0. , 5. , 3.6, 1.4, 0.2],
        [0. , 5.4, 3.9, 1.7, 0.4],
        [0. , 4.6, 3.4, 1.4, 0.3],
        [0. , 5. , 3.4, 1.5, 0.2],
        [0. , 4.4, 2.9, 1.4, 0.2],
        [0. , 4.9, 3.1, 1.5, 0.1],
        [0. , 5.4, 3.7, 1.5, 0.2],
        [0. , 4.8, 3.4, 1.6, 0.2],
        [0. , 4.8, 3. , 1.4, 0.1],
        [0. , 4.3, 3. , 1.1, 0.1],
        [0. , 5.8, 4. , 1.2, 0.2],
        [0. , 5.7, 4.4, 1.5, 0.4],
        [0. , 5.4, 3.9, 1.3, 0.4],
        [0. , 5.1, 3.5, 1.4, 0.3],
        [0. , 5.7, 3.8, 1.7, 0.3],
        [0. , 5.1, 3.8, 1.5, 0.3],
        [0. , 5.4, 3.4, 1.7, 0.2],
        [0. , 5.1, 3.7, 1.5, 0.4],
        [0. , 4.6, 3.6, 1. , 0.2],
        [0. , 5.1, 3.3, 1.7, 0.5],
        [0. , 4.8, 3.4, 1.9, 0.2],
        [0. , 5. , 3. , 1.6, 0.2],
        [0. , 5. , 3.4, 1.6, 0.4],
        [0. , 5.2, 3.5, 1.5, 0.2],
        [0. , 5.2, 3

Once we have the dataset built up, we can start the sklearn ML training. 

In [16]:
import joblib
import sklearn
print(sklearn.__version__)
from sklearn import tree

0.23.1


In [12]:
train_y = joined_iris[:, 0] # label
train_X = joined_iris[:, 1:] # features

Let's try a simple [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html). Model trains very quickly because the data is small.

In [15]:
clf = tree.DecisionTreeClassifier(max_leaf_nodes=30)
clf = clf.fit(train_X, train_y)

Once we have the classifier `clf` trained, we can save it using [joblib](https://scikit-learn.org/stable/modules/model_persistence.html) which is more efficient on objects that carry large numpy arrays internally as is often the case for fitted scikit-learn estimators. The model needs to be archived into a file `model.tar.gz` and saved to S3 to be used as a SageMaker model.

In [40]:
joblib.dump(clf, 'model.joblib')

!tar -czf model.tar.gz model.joblib

In [None]:
model='s3://%s/%s/output/model.tar.gz' % (sagemaker_session.default_bucket(), prefix)
!aws s3 cp model.tar.gz {model}

To get inferences for an entire dataset, use batch transform. With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Amazon SageMaker saves the inferences in an S3 bucket that you specify when you create the batch transform job. Batch transform manages all of the compute resources required to get inferences. This includes launching instances and deleting them after the batch transform job has completed. Batch transform manages interactions between the data and the model with an object within the instance node called an agent.

Use batch transform when you:

- Want to get inferences for an entire dataset and index them to serve inferences in real time

- Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get inferences

- Don't need the subsecond latency that Amazon SageMaker hosted endpoints provide

You can also use batch transform to preprocess your data before using it to train a new model or generate inferences.

The following diagram shows [the workflow of a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html):

![image](https://docs.aws.amazon.com/sagemaker/latest/dg/images/batch-transform-v2.png)

Now we have the classifier packaged up and are almost ready to use the model as a SageMaker model for inferencing. We would also need the serving codes that goes along with the classifier `model.tar.gz`. The serving script that will be put into the serving sklearn container requires an input argument `--model-dir` so that SageMaker can pass in the model location, and a function `model_fn` to deserialize and return the fitted classifier. We create a script below and save it as `sklearn_iris_serving.py`


In [23]:
%%writefile sklearn_iris_serving.py
from __future__ import print_function

import argparse
import joblib
import os
import pandas as pd
from sklearn import tree

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # Sagemaker specific arguments. Defaults are set in the environment variables.
#     parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    
def model_fn(model_dir):
    """Deserialized and return fitted model

    Note that this should have the same name as the serialized model in the main method
    """
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

Writing sklearn_iris_serving.py


Now we are ready to create a SageMaker Model for sklearn model. We need the model location (s3 path), SageMaker execution role, serving script as `entry_point` and the sklearn version number.

In [33]:
from sagemaker.sklearn.model import SKLearnModel
sklearn_model = SKLearnModel(model_data=model,
                             role=role,
                             entry_point="sklearn_iris_serving.py",
                             framework_version="0.23-1")

## Batch Transform <a class="anchor" id="batch_transform"></a>

In [34]:
# Define a SKLearn Transformer from the sklearn model
transformer = sklearn_model.transformer(instance_count=1, instance_type='ml.m5.xlarge')

### Prepare Input Data <a class="anchor" id="prepare_input_data"></a>
We will extract 10 random samples of 100 rows from the training data, then split the features (X) from the labels (Y). Then upload the input data to a given location in S3.

In [35]:
%%bash
# Randomly sample the iris dataset 10 times, then split X and Y
mkdir -p batch_data/XY batch_data/X batch_data/Y
for i in {0..9}; do
    cat data/iris.csv | shuf -n 100 > batch_data/XY/iris_sample_${i}.csv
    cat batch_data/XY/iris_sample_${i}.csv | cut -d',' -f2- > batch_data/X/iris_sample_X_${i}.csv
    cat batch_data/XY/iris_sample_${i}.csv | cut -d',' -f1 > batch_data/Y/iris_sample_Y_${i}.csv
done

In [36]:
# Upload input data from local filesystem to S3
batch_input_s3 = sagemaker_session.upload_data('batch_data/X', key_prefix=prefix + '/batch_input')

### Run Transform Job <a class="anchor" id="run_transform_job"></a>
Using the Transformer, run a transform job on the S3 input data. `content_type` indicates that the input `batch_input_s3` is of csv type.

In [37]:
# Start a transform job and wait for it to finish
transformer.transform(batch_input_s3, content_type='text/csv')
print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

Waiting for transform job: sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075
.....................[34m2020-08-06 23:39:01,441 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-06 23:39:01,443 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)[0m
[34m2020-08-06 23:39:01,444 INFO - sagemaker-containers - nginx config: [0m
[34mworker_processes auto;[0m
[34mdaemon off;[0m
[34mpid /tmp/nginx.pid;[0m
[34merror_log  /dev/stderr;
[0m
[34mworker_rlimit_nofile 4096;
[0m
[34mevents {
  worker_connections 2048;[0m
[34m}
[0m
[34mhttp {
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  access_log /dev/stdout combined;

  upstream gunicorn {
    server unix:/tmp/gunicorn.sock;
  }

  server {
    listen 8080 deferred;
    client_max_body_size 0;

    keepalive_timeout 3;

    location ~ ^/(ping|invocations|execution-parameters) {
      proxy_set_header X-Forwarded-For $prox

### Check Output Data  <a class="anchor" id="check_output_data"></a>
After the transform job has completed, download the output data from S3. For each file "f" in the input data, we have a corresponding file "f.out" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier.

In [38]:
# Download the output data from S3 to local filesystem
batch_output = transformer.output_path
!mkdir -p batch_data/output
!aws s3 cp --recursive $batch_output/ batch_data/output/
# Head to see what the batch output looks like
!head batch_data/output/*

download: s3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075/iris_sample_X_0.csv.out to batch_data/output/iris_sample_X_0.csv.out
download: s3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075/iris_sample_X_5.csv.out to batch_data/output/iris_sample_X_5.csv.out
download: s3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075/iris_sample_X_1.csv.out to batch_data/output/iris_sample_X_1.csv.out
download: s3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075/iris_sample_X_2.csv.out to batch_data/output/iris_sample_X_2.csv.out
download: s3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075/iris_sample_X_4.csv.out to batch_data/output/iris_sample_X_4.csv.out
download: s3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08

In [39]:
%%bash
# For each sample file, compare the predicted labels from batch output to the true labels
for i in {1..9}; do
    diff -s batch_data/Y/iris_sample_Y_${i}.csv \
        <(cat batch_data/output/iris_sample_X_${i}.csv.out | sed 's/[["]//g' | sed 's/, \|]/\n/g') \
        | sed "s/\/dev\/fd\/63/batch_data\/output\/iris_sample_X_${i}.csv.out/"
done

Files batch_data/Y/iris_sample_Y_1.csv and batch_data/output/iris_sample_X_1.csv.out are identical
Files batch_data/Y/iris_sample_Y_2.csv and batch_data/output/iris_sample_X_2.csv.out are identical
Files batch_data/Y/iris_sample_Y_3.csv and batch_data/output/iris_sample_X_3.csv.out are identical
Files batch_data/Y/iris_sample_Y_4.csv and batch_data/output/iris_sample_X_4.csv.out are identical
Files batch_data/Y/iris_sample_Y_5.csv and batch_data/output/iris_sample_X_5.csv.out are identical
Files batch_data/Y/iris_sample_Y_6.csv and batch_data/output/iris_sample_X_6.csv.out are identical
Files batch_data/Y/iris_sample_Y_7.csv and batch_data/output/iris_sample_X_7.csv.out are identical
Files batch_data/Y/iris_sample_Y_8.csv and batch_data/output/iris_sample_X_8.csv.out are identical
Files batch_data/Y/iris_sample_Y_9.csv and batch_data/output/iris_sample_X_9.csv.out are identical


### Inferencing job information
You can access retrospectively the batch transformation job information such as time of execution, model version, input data from the [SageMaker console](https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-east-1#/transform-jobs) or from boto3 using [describe_transform_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.describe_transform_job) API.

In [44]:
import boto3
sm_client=boto3.client('sagemaker')
result=sm_client.describe_transform_job(
            TransformJobName=transformer.latest_transform_job.job_name)

In [46]:
# model used
result['ModelName']

'sagemaker-scikit-learn-2020-08-06-23-35-42-199'

In [47]:
# Input data
result['TransformInput']

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
   'S3Uri': 's3://sagemaker-us-west-2-029454422462/Scikit-iris/batch_input'}},
 'ContentType': 'text/csv',
 'CompressionType': 'None',
 'SplitType': 'None'}

In [48]:
# Output result
result['TransformOutput']

{'S3OutputPath': 's3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075',
 'AssembleWith': 'None',
 'KmsKeyId': ''}

In [45]:
result

{'TransformJobName': 'sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075',
 'TransformJobArn': 'arn:aws:sagemaker:us-west-2:029454422462:transform-job/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075',
 'TransformJobStatus': 'Completed',
 'ModelName': 'sagemaker-scikit-learn-2020-08-06-23-35-42-199',
 'TransformInput': {'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix',
    'S3Uri': 's3://sagemaker-us-west-2-029454422462/Scikit-iris/batch_input'}},
  'ContentType': 'text/csv',
  'CompressionType': 'None',
  'SplitType': 'None'},
 'TransformOutput': {'S3OutputPath': 's3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-2020-08-06-23-35-45-075',
  'AssembleWith': 'None',
  'KmsKeyId': ''},
 'TransformResources': {'InstanceType': 'ml.m5.xlarge', 'InstanceCount': 1},
 'CreationTime': datetime.datetime(2020, 8, 6, 23, 35, 45, 252000, tzinfo=tzlocal()),
 'TransformStartTime': datetime.datetime(2020, 8, 6, 23, 37, 2, tzinfo=tzlocal()

In [50]:
# More information about the model
model_result=sm_client.describe_model(ModelName=result['ModelName'])
model_result

{'ModelName': 'sagemaker-scikit-learn-2020-08-06-23-35-42-199',
 'PrimaryContainer': {'Image': '246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-scikit-learn:0.23-1-cpu-py3',
  'Mode': 'SingleModel',
  'ModelDataUrl': 's3://sagemaker-us-west-2-029454422462/Scikit-iris/output/model.tar.gz',
  'Environment': {'SAGEMAKER_CONTAINER_LOG_LEVEL': '20',
   'SAGEMAKER_ENABLE_CLOUDWATCH_METRICS': 'false',
   'SAGEMAKER_PROGRAM': 'sklearn_iris_serving.py',
   'SAGEMAKER_REGION': 'us-west-2',
   'SAGEMAKER_SUBMIT_DIRECTORY': 's3://sagemaker-us-west-2-029454422462/sagemaker-scikit-learn-2020-08-06-23-35-41-900/sourcedir.tar.gz'}},
 'ExecutionRoleArn': 'arn:aws:iam::029454422462:role/service-role/AmazonSageMaker-ExecutionRole-20191112T221060',
 'CreationTime': datetime.datetime(2020, 8, 6, 23, 35, 42, 404000, tzinfo=tzlocal()),
 'ModelArn': 'arn:aws:sagemaker:us-west-2:029454422462:model/sagemaker-scikit-learn-2020-08-06-23-35-42-199',
 'EnableNetworkIsolation': False,
 'ResponseMetadata': {'R