# 1. Train a scikit-learn Model Using a Custom Training Script.

* Goals:
    * Walk through a basic example of training a custom model in a Studio Notebook.
    * Introduce the `sagemaker.sklearn.estimator.SKLearn` class for handling end-to-end training custom Scikit-learn models.

* Code adapted from the [scikit_learn_iris](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_iris) sample notebook.

---
## 1. Setup

Change into the notebooks directory

In [26]:
%cd /root/sagemaker-workshop-420/notebooks

/root/sagemaker-workshop-420/notebooks


In [27]:
import os

import boto3
import numpy as np
import sagemaker
from sagemaker import get_execution_role
from sklearn import datasets

Setup S3 paths and local paths to use for the notebook.

In [28]:
BUCKET = 'sagemaker-workshop-420'
PREFIX = 'iris'

LOCAL_DATA_DIRECTORY = f'../data/{PREFIX}'

print(f"Artifacts will be written to s3://{BUCKET}/{PREFIX}/ .")

Artifacts will be written to s3://sagemaker-workshop-420/iris/ .


Create boto3 Session, Sagemaker session and role.

In [29]:
boto_session = boto3.Session()
region = boto_session.region_name
sagemaker_session = sagemaker.Session()
role = get_execution_role()
print(role)

arn:aws:iam::209970524256:role/service-role/AmazonSageMaker-ExecutionRole-20200414T065516


## 2. Create a dataset and upload to S3 for training

We're using a sample of the classic [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), which is included with Scikit-learn. We will load the dataset, write it locally, then write the dataset to s3 to use.

In [30]:
# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs(LOCAL_DATA_DIRECTORY, exist_ok=True)
np.savetxt(f'{LOCAL_DATA_DIRECTORY}/iris.csv', joined_iris, delimiter=',',
           fmt='%1.1f, %1.3f, %1.3f, %1.3f, %1.3f')

Once we have the data locally, we can use use the tools provided by the SageMaker Python SDK to upload the data to S3. 

In [31]:
train_input = sagemaker_session.upload_data(
    f'{LOCAL_DATA_DIRECTORY}/iris.csv',
    bucket=BUCKET,
    key_prefix=PREFIX)

## 3. Create a custom Scikit-learn script to train a model

SageMaker can now run a scikit-learn script using the `SKLearn` estimator. When executed on SageMaker a number of helpful environment variables are available to access properties of the training environment, such as:

* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to. Any artifacts saved in this folder are uploaded to S3 for model hosting after the training job completes.
* `SM_OUTPUT_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed and uploaded to S3 to the same S3 prefix as the model artifacts.

Supposing two input channels, 'train' and 'test', were used in the call to the `SKLearn` estimator's `fit()` method, the following environment variables will be set, following the format `SM_CHANNEL_[channel_name]`:

* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel
* `SM_CHANNEL_TEST`: Same as above, but for the 'test' channel.

A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to model_dir so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script that we will run in this notebook is the below:

In [32]:
!pygmentize '../scripts/sklearn_iris.py'

[37m#  Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.[39;49;00m
[37m#  [39;49;00m
[37m#  Licensed under the Apache License, Version 2.0 (the "License").[39;49;00m
[37m#  You may not use this file except in compliance with the License.[39;49;00m
[37m#  A copy of the License is located at[39;49;00m
[37m#  [39;49;00m
[37m#      http://www.apache.org/licenses/LICENSE-2.0[39;49;00m
[37m#  [39;49;00m
[37m#  or in the "license" file accompanying this file. This file is distributed [39;49;00m
[37m#  on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either [39;49;00m
[37m#  express or implied. See the License for the specific language governing [39;49;00m
[37m#  permissions and limitations under the License.[39;49;00m

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m print_function

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m

Because the Scikit-learn container imports your training script, you should always put your training code in a main guard `(if __name__=='__main__':)` so that the container does not inadvertently run your training code at the wrong point in execution.

For more information about training environment variables, please visit https://github.com/aws/sagemaker-containers.

## 4. Create SageMaker Scikit Estimator

To run our Scikit-learn training script on SageMaker, we construct a `sagemaker.sklearn.estimator.sklearn` estimator, which accepts several constructor arguments:

* __entry_point__: The path to the Python script SageMaker runs for training and prediction.
* __role__: Role ARN
* __train_instance_type__ *(optional)*: The type of SageMaker instances for training. __Note__: Because Scikit-learn does not natively support GPU training, Sagemaker Scikit-learn does not currently support training on GPU instance types.
* __sagemaker_session__ *(optional)*: The session used to train on Sagemaker.
* __hyperparameters__ *(optional)*: A dictionary passed to the train function as hyperparameters.
* __code_location__ *(optional)*: The S3 prefix URI where custom code will be uploaded
* __output_path__ *(optional)*: S3 location for saving the training result (model artifacts and output files).

To see the code for the SKLearn Estimator, see here: https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/sklearn

In [33]:
from sagemaker.sklearn.estimator import SKLearn

sklearn_estimator = SKLearn(
    entry_point='../scripts/sklearn_iris.py',
    train_instance_type="ml.c4.xlarge",
    code_location=f"s3://{BUCKET}/{PREFIX}",
    output_path=f"s3://{BUCKET}/{PREFIX}",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={'max_leaf_nodes': 30})

## 5. Fit the SKLearn Estimator on Iris data

Training is very simple, just call `fit` on the Estimator! This will start a SageMaker Training job that will download the data for us, invoke our scikit-learn code (in the provided script file), and save any model artifacts that the script creates.

In [34]:
sklearn_estimator.fit({'train': train_input})

2020-04-14 11:31:11 Starting - Starting the training job...
2020-04-14 11:31:14 Starting - Launching requested ML instances...
2020-04-14 11:32:11 Starting - Preparing the instances for training......
2020-04-14 11:33:09 Downloading - Downloading input data...
2020-04-14 11:33:34 Training - Downloading the training image..[34m2020-04-14 11:33:48,337 sagemaker-containers INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-04-14 11:33:48,340 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-04-14 11:33:48,350 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-04-14 11:33:48,641 sagemaker-containers INFO     Module sklearn_iris does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34m2020-04-14 11:33:48,641 sagemaker-containers INFO     Generating setup.cfg[0m
[34m2020-04-14 11:33:48,641 sagemaker-containers INFO     Generating MANIFEST.in[0m
[34m2020-04-14 11:33:48,

## 6. Head to the URL in the following cell to view the details of this training job.

In [35]:
f"https://{region}.console.aws.amazon.com/sagemaker/home?region={region}#/jobs"

'https://us-east-2.console.aws.amazon.com/sagemaker/home?region=us-east-2#/jobs'