# Bring your own Scikit learn model to SageMaker for Batch Transform
This tutorial shows you how to bring your [Scikit-learn](https://scikit-learn.org/stable/) models to SageMaker so that you can host and make inferences using SageMaker infrastructure. Scikit-learn is a popular Python machine learning framework. It includes a number of different algorithms for classification, regression, clustering, dimensionality reduction, and data/feature pre-processing. 

The [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) module makes it easy to take existing scikit-learn model and generate predictions using the SageMaker hosting and inferencing service. For more information about the Scikit-learn container, see the [sagemaker-scikit-learn-containers](https://github.com/aws/sagemaker-scikit-learn-container) repository and the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk) repository. Note that the version of the sklearn used for training has to match that of provided container.

For more on Scikit-learn, please visit the Scikit-learn website: <http://scikit-learn.org/stable/>.

### Table of contents
* [Upload the data for training](#upload_data)
* [Batch Transform](#batch_transform)
 * [Prepare Input Data](#prepare_input_data)
 * [Run Transform Job](#run_transform_job)
 * [Check Output Data](#check_output_data)

In [None]:
!pip install -U scikit-learn==0.20.0

First, lets create our Sagemaker session and role, and create a S3 prefix to use for the notebook example.

In [None]:
# S3 prefix
prefix = 'Scikit-two-outputs'

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

## Upload the data for training <a class="anchor" id="upload_data"></a>

When training large models with huge amounts of data, you'll typically use big data tools, like Amazon Athena, AWS Glue, or Amazon EMR, to create your data in S3. For the purposes of this example, we're creating a sample classification dataset. We will load the dataset, write locally, then write the dataset to s3 to use.

In [None]:
import numpy as np
import os
from scipy import sparse
from sklearn.datasets.samples_generator import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=90000, n_features=100, random_state=0)

Once we have the dataset built up, we can start the sklearn ML training. 

In [None]:
import sklearn
print(sklearn.__version__)
from sklearn import tree

Let's try a simple [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
clf = tree.DecisionTreeClassifier(max_leaf_nodes=30)
clf = clf.fit(X, y)

Once we have the classifier `clf` trained, we can save it using `pickle` for fitted scikit-learn estimators. The model needs to be archived into a file `model.tar.gz` and saved to S3 to be used as a SageMaker model.

In [None]:
import pickle
pickle.dump(clf, open('model.pkl', 'wb'))
!tar -czf model.tar.gz model.pkl

In [None]:
!tar -tzvf model.tar.gz

In [None]:
model='s3://%s/%s/output/model.tar.gz' % (sagemaker_session.default_bucket(), prefix)
!aws s3 cp model.tar.gz {model}

To get inferences for an entire dataset, use batch transform. With batch transform, you create a batch transform job using a trained model and the dataset, which must be stored in Amazon S3. Amazon SageMaker saves the inferences in an S3 bucket that you specify when you create the batch transform job. Batch transform manages all of the compute resources required to get inferences. This includes launching instances and deleting them after the batch transform job has completed. Batch transform manages interactions between the data and the model with an object within the instance node called an agent.

Use batch transform when you:

- Want to get inferences for an entire dataset and index them to serve inferences in real time

- Don't need a persistent endpoint that applications (for example, web or mobile apps) can call to get inferences

- Don't need the subsecond latency that Amazon SageMaker hosted endpoints provide

You can also use batch transform to preprocess your data before using it to train a new model or generate inferences.

The following diagram shows [the workflow of a batch transform job](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html):

![image](https://docs.aws.amazon.com/sagemaker/latest/dg/images/batch-transform-v2.png)

Now we have the classifier packaged up and are almost ready to use the model as a SageMaker model for inferencing. We would also need the serving codes that goes along with the classifier `model.tar.gz`. The serving script that will be put into the serving sklearn container requires an input argument `--model-dir` so that SageMaker can pass in the model location, and a function `model_fn` to deserialize and return the fitted classifier. We create a script below and save it as `sklearn_iris_serving.py`. 


Additionally, we also need to customize `predict_fn` and `output_fn` to accommodate the multi-value output. The `predict_fn` would need to return all the values in a numpy array of shape (m, n), where m is number of data point, and n is number of prediction output. For example, we would like the batch transform to output hard labels, and probability for each class, 2 values for binary classification use case, the output file should contain 3 output values for each data point. We will concatenate `clf.predict()` and `clf.predict_proba()` as below using `np.hstack()` in the `predict_fn`. The numpy array output from `predict_fn` would then pass to `output_fn` by the SageMaker Batch Transform to construct the final output based on the datatype `accept_type`, a user input to `model.transformer`.




In [None]:
clf.predict(X)

In [None]:
clf.predict_proba(X)

In [None]:
out = np.hstack([clf.predict(X).reshape(-1, 1), clf.predict_proba(X)])

In [None]:
out.shape

In [None]:
%%writefile sklearn_serving.py
from __future__ import print_function

import argparse
import os
import pickle
import numpy as np
import json
from sagemaker_containers.beta.framework import worker, encoders

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    # Sagemaker specific arguments. Defaults are set in the environment variables.
#     parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    
def model_fn(model_dir):
    """Deserialized and return fitted model

    Note that this should have the same name as the serialized model in the main method
    """
    clf = pickle.load(open(os.path.join(model_dir, "model.pkl"), 'rb'))
    return clf

def predict_fn(input_data, model):
    """Make inference against the model

    We would like to return the probability in addition to the hard classification. 
    Note that the output should be of np.array type which is what the downstream function
    output_fn expects.
    """
    prediction = model.predict(input_data)
    pred_prob = model.predict_proba(input_data)
    return np.hstack([prediction.reshape(-1, 1), pred_prob])

def output_fn(prediction, accept):
    """Format prediction output
    
    The default accept/content-type between containers for serial inference is JSON.
    We also want to set the ContentType or mimetype as the same value as accept so the next
    container can read the response payload correctly.
    """
    if accept == "application/json":
        instances = []
        for row in prediction.tolist():
            instances.append({"prediction": row})

        json_output = {"instances": instances}

        return worker.Response(json.dumps(json_output), mimetype=accept)
    elif accept == 'text/csv':
        return worker.Response(encoders.encode(prediction, accept), mimetype=accept)
    else:
        raise RuntimeException("{} accept type is not supported by this script.".format(accept))

Now we are ready to create a SageMaker Model for sklearn model. We need the model location (s3 path), SageMaker execution role, serving script as `entry_point` and the sklearn version number.

In [None]:
from sagemaker.sklearn.model import SKLearnModel

In [None]:
sklearn_model = SKLearnModel(model_data=model,
                             role=role,
                             entry_point="sklearn_serving.py",
                             framework_version="0.20.0")

## Preparing input data <a class="anchor" id="prepare_input_data"></a>
We also upload the data to s3.

In [None]:
np.savetxt('./data/x_dense.csv', X, delimiter=',')

In [None]:
!ls -lhrt ./data/

In [None]:
# Upload input data from local filesystem to S3
test_csv_input_s3 = sagemaker_session.upload_data('./data/x_dense.csv', key_prefix=prefix + '/x_csv')

## Batch Transform <a class="anchor" id="batch_transform"></a>

### Run Transform Job with default <a class="anchor" id="run_transform_job"></a>
Using the Transformer, run a transform job on the S3 input data. When creating the transformer, we need to specify `accept='text/csv'` which instruct the output to be of csv format. `content_type` in the `.transform()` call indicates that the input `batch_input_s3` is of csv type. We are also instructing the trnasformer to split the input data by `'Line'` and assemble them back to a single csv by `'Line'`.

### max_payload = 6, split_type = 'Line'

In [None]:
transformer = sklearn_model.transformer(instance_count=1, instance_type='ml.m5.xlarge', 
                                         max_payload = 6, accept = 'text/csv', assemble_with = 'Line')
# Start a transform job
transformer.transform(test_csv_input_s3, content_type='text/csv', split_type = 'Line')

### Check Output Data  <a class="anchor" id="check_output_data"></a>
After the transform job has completed, download the output data from S3. For each file "f" in the input data, we have a corresponding file "f.out" containing the predicted labels from each input row. We can compare the predicted labels to the true labels saved earlier.

In [None]:
# Download the output data from S3 to local filesystem
batch_output = transformer.output_path
!mkdir -p batch_data/output
!aws s3 cp --recursive $batch_output/ batch_data/output/
# Head to see what the batch output looks like
!head batch_data/output/*