## Amazon SageMaker Awesome Builder Notebook

We will use Amazon SageMaker Processing jobs to leverage a simplified, managed experience to run data pre- or post-processing and model evaluation workloads on the Amazon SageMaker platform.

A processing job downloads input from Amazon Simple Storage Service (Amazon S3), then uploads outputs to Amazon S3 during or after the processing job.

This notebook is used to run a processing job using a scikit-learn script that cleans, pre-processes, performs feature engineering, and splits the input data into train and test sets.

The dataset loaded here is from PAMAP2 Physical Activity Dataset from UCI database:
https://archive.ics.uci.edu/ml/datasets/PAMAP2+Physical+Activity+Monitoring

We will select features from this dataset, clean the data, and turn the data into features that the training algorithm can use to train a multi-class classification model, and split the data into train and test sets. 

After training a logistic regression model, you evaluate the model against a hold-out test dataset, and save the classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.

## Data pre-processing and feature engineering

To run the scikit-learn preprocessing script as a processing job, create a `SKLearnProcessor`, which lets you run scripts inside of processing jobs using the scikit-learn image provided.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor

region = boto3.session.Session().region_name

role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
                                     role=role,
                                     instance_type='ml.m5.xlarge',
                                     instance_count=1)

print('Region: {}'.format(region))
print('Role: {}'.format(role))

Region: us-east-1
Role: arn:aws:iam::572539092864:role/service-role/AmazonSageMaker-ExecutionRole-20200407T174741


Before writing the script we will use for data cleaning, pre-processing, and feature engineering, we inspect the first 10 rows (observations) of the dataset. 

The initial label for the observation is the `activity_id` category. The primary features from the dataset you select are `heart_rate`, `imu_wrist_temp` (temperature in Celsius), and 3 sets of Inertial Measurement Units (IMU) data. IMU data is collected from 3 separate sensors located on wearable devices, like a Smart Watch, plus a heart rate monitor. 

The dataset can be used for activity recognition and intensity estimation, while developing and applying algorithms of data processing, segmentation, feature extraction and classification.

** Sensors **
3 Colibri wireless inertial measurement units (IMU):
- sampling frequency: 100Hz
- position of the sensors:
- 1 IMU over the wrist on the dominant arm
- 1 IMU on the chest
- 1 IMU on the dominant side's ankle
HR-monitor:
- sampling frequency: ~9Hz

In [30]:
import pandas as pd
import numpy as np

bucket_name = 'octank-smartwatch-data'
s3prefix = 'sagemaker-train'

input_data_uri = 's3://{}/{}/subject101.csv'.format(bucket_name, s3prefix)

print('S3 input URI: {}'.format(input_data_uri))

S3 input URI: s3://octank-smartwatch-data/sagemaker-train/subject101.csv


In [31]:
column_names = ['timestamp', 'activity_id', 'heart_rate', 'imu_wrist_temp', 
  'imu_wrist_accel16_x', 'imu_wrist_accel16_y', 'imu_wrist_accel16_z', 
  'imu_wrist_accel6_x', 'imu_wrist_accel6_y', 'imu_wrist_accel6_z', 
  'imu_wrist_gyro_x', 'imu_wrist_gyro_y', 'imu_wrist_gyro_z', 
  'imu_wrist_magnet_x', 'imu_wrist_magnet_y', 'imu_wrist_magnet_z', 
  'imu_wrist_orient1', 'imu_wrist_orient2', 'imu_wrist_orient3', 'imu_wrist_orient4', 
  'imu_chest_temp', 'imu_chest_accel16_x', 'imu_chest_accel16_y', 'imu_chest_accel16_z', 
  'imu_chest_accel6_x', 'imu_chest_accel6_y', 'imu_chest_accel6_z', 
  'imu_chest_gyro_x', 'imu_chest_gyro_y', 'imu_chest_gyro_z', 
  'imu_chest_magnet_x', 'imu_chest_magnet_y', 'imu_chest_magnet_z', 
  'imu_chest_orient1', 'imu_chest_orient2', 'imu_chest_orient3', 'imu_chest_orient4', 
  'imu_ankle_temp', 'imu_ankle_accel16_x', 'imu_ankle_accel16_y', 'imu_ankle_accel16_z', 
  'imu_ankle_accel6_x', 'imu_ankle_accel6_y', 'imu_ankle_accel6_z', 
  'imu_ankle_gyro_x', 'imu_ankle_gyro_y', 'imu_ankle_gyro_z', 
  'imu_ankle_magnet_x', 'imu_ankle_magnet_y', 'imu_ankle_magnet_z', 
  'imu_ankle_orient1', 'imu_ankle_orient2', 'imu_ankle_orient3', 'imu_ankle_orient4']

df = pd.read_csv(input_data_uri, header=0, names=column_names, index_col=False)

In [32]:
# inspect DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 376416 entries, 0 to 376415
Data columns (total 54 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   timestamp            376416 non-null  float64
 1   activity_id          376416 non-null  int64  
 2   heart_rate           34388 non-null   float64
 3   imu_wrist_temp       374962 non-null  float64
 4   imu_wrist_accel16_x  374962 non-null  float64
 5   imu_wrist_accel16_y  374962 non-null  float64
 6   imu_wrist_accel16_z  374962 non-null  float64
 7   imu_wrist_accel6_x   374962 non-null  float64
 8   imu_wrist_accel6_y   374962 non-null  float64
 9   imu_wrist_accel6_z   374962 non-null  float64
 10  imu_wrist_gyro_x     374962 non-null  float64
 11  imu_wrist_gyro_y     374962 non-null  float64
 12  imu_wrist_gyro_z     374962 non-null  float64
 13  imu_wrist_magnet_x   374962 non-null  float64
 14  imu_wrist_magnet_y   374962 non-null  float64
 15  imu_wrist_magnet_

In [36]:
print(df.shape)
df.head(n=10)

(376416, 54)


Unnamed: 0,timestamp,activity_id,heart_rate,imu_wrist_temp,imu_wrist_accel16_x,imu_wrist_accel16_y,imu_wrist_accel16_z,imu_wrist_accel6_x,imu_wrist_accel6_y,imu_wrist_accel6_z,...,imu_ankle_gyro_x,imu_ankle_gyro_y,imu_ankle_gyro_z,imu_ankle_magnet_x,imu_ankle_magnet_y,imu_ankle_magnet_z,imu_ankle_orient1,imu_ankle_orient2,imu_ankle_orient3,imu_ankle_orient4
0,8.39,0,,30.0,2.18837,8.5656,3.66179,2.39494,8.55081,3.64207,...,-0.006577,-0.004638,0.000368,-59.8479,-38.8919,-58.5253,1.0,0.0,0.0,0.0
1,8.4,0,,30.0,2.37357,8.60107,3.54898,2.30514,8.53644,3.7328,...,0.003014,0.000148,0.022495,-60.7361,-39.4138,-58.3999,1.0,0.0,0.0,0.0
2,8.41,0,,30.0,2.07473,8.52853,3.66021,2.33528,8.53622,3.73277,...,0.003175,-0.020301,0.011275,-60.4091,-38.7635,-58.3956,1.0,0.0,0.0,0.0
3,8.42,0,,30.0,2.22936,8.83122,3.7,2.23055,8.59741,3.76295,...,0.012698,-0.014303,-0.002823,-61.5199,-39.3879,-58.2694,1.0,0.0,0.0,0.0
4,8.43,0,,30.0,2.29959,8.82929,3.5471,2.26132,8.65762,3.77788,...,-0.006089,-0.016024,0.00105,-60.2954,-38.8778,-58.3977,1.0,0.0,0.0,0.0
5,8.44,0,,30.0,2.33738,8.829,3.54767,2.27703,8.77828,3.7323,...,-0.031973,-0.053934,0.015594,-60.6307,-38.8676,-58.2711,1.0,0.0,0.0,0.0
6,8.45,0,,30.0,2.37142,9.055,3.39347,2.39786,8.89814,3.64131,...,-0.019643,-0.039937,-0.000785,-60.5171,-38.9819,-58.2733,1.0,0.0,0.0,0.0
7,8.46,0,,30.0,2.33951,9.13251,3.54668,2.44371,8.98841,3.62596,...,0.013747,-0.010042,0.017701,-61.2916,-39.6182,-58.1499,1.0,0.0,0.0,0.0
8,8.47,0,,30.0,2.25966,9.09415,3.43015,2.42877,9.01871,3.61081,...,0.007649,-0.013923,0.014498,-60.8509,-39.0821,-58.1478,1.0,0.0,0.0,0.0
9,8.48,0,104.0,30.0,2.29745,8.9045,3.46984,2.39736,8.94335,3.53551,...,0.0789,0.002283,0.020352,-61.5302,-38.724,-58.386,1.0,0.0,0.0,0.0


### Here are the raw labels, activity_ids, coming in from the dataset

List of activityIDs and corresponding activities:
0=other (transient activities),
1=lying,
2=sitting,
3=standing,
4=walking,
5=running,
6=cycling,
7=Nordic walking,
9=watching TV,
10=computer work,
11=car driving,
12=ascending stairs,
13=descending stairs,
16=vacuum cleaning,
17=ironing,
18=folding laundry,
19=house cleaning,
20=playing soccer,
24=rope jumping


In [16]:
# Note: unless we impute the heart_rate data, we will lose most of our dataset to 'NaN' dropping
print('Number of rows BEFORE drop: {}'.format(df.size))
df.dropna(inplace=True)
# df.drop_duplicates(inplace=True)
print('Number of rows AFTER drop: {}'.format(df.size))

Number of rows BEFORE drop: 20326464
Number of rows AFTER drop: 1840806


In [38]:
# We can replace the raw activity_id (integers) with their label meanings

raw_activity_ids = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

activity_labels = ['unassigned', 'lying', 'sitting', 'standing ', 'walking', 'running', 'cycling', 
                'nordic_walking ', 'missing_8', 'watching_tv ', 'computer_work ', 'car_driving ', 
                'ascending_stairs ', 'descending_stairs ', 'missing_14', 'missing_15', 
                'vacuuming', 'ironing', 'folding_laundry', 'house_cleaning', 'playing soccer',
                'missing_21', 'missing_22', 'missing_23', 'rope jumping'] 

df.replace(raw_activity_ids, activity_labels, inplace=True)

In [43]:
# Let's take a look at 'activity_id' label and show each transition in the first 100K rows
prev_activity = None
for idx in range(0,100000,1):
    row = df.iloc[idx]
    #print('row[{}]: {}'.format(idx, row))
    if (row['activity_id'] != prev_activity):
        print('Transition to new activity label[{}]: {}'.format(idx, row['activity_id']))
        prev_activity = row['activity_id']

Transition to activity label[0]: unassigned
Transition to activity label[2927]: lying
Transition to activity label[30114]: sitting
Transition to activity label[53594]: standing 
Transition to activity label[75311]: unassigned
Transition to activity label[84966]: ironing


In [18]:
# Let's find out how many unique/discrete values per column in our dataset
df.nunique(axis='index') 

timestamp              34089
activity_id               13
heart_rate               106
imu_wrist_temp            63
imu_wrist_accel16_x    32425
imu_wrist_accel16_y    32527
imu_wrist_accel16_z    32492
imu_wrist_accel6_x     32315
imu_wrist_accel6_y     32919
imu_wrist_accel6_z     32368
imu_wrist_gyro_x       33910
imu_wrist_gyro_y       33883
imu_wrist_gyro_z       33860
imu_wrist_magnet_x     33464
imu_wrist_magnet_y     32978
imu_wrist_magnet_z     33370
imu_wrist_orient1          1
imu_wrist_orient2          1
imu_wrist_orient3          1
imu_wrist_orient4          1
imu_chest_temp            88
imu_chest_accel16_x    31183
imu_chest_accel16_y    29647
imu_chest_accel16_z    30940
imu_chest_accel6_x     33222
imu_chest_accel6_y     29860
imu_chest_accel6_z     32419
imu_chest_gyro_x       33868
imu_chest_gyro_y       33886
imu_chest_gyro_z       33863
imu_chest_magnet_x     33231
imu_chest_magnet_y     32476
imu_chest_magnet_z     33227
imu_chest_orient1          1
imu_chest_orie

In [44]:
# Let's get the distinct list of unique activities
print(np.unique(df['activity_id'].to_numpy()))

['ascending_stairs ' 'cycling' 'descending_stairs ' 'ironing' 'lying'
 'nordic_walking ' 'rope jumping' 'running' 'sitting' 'standing '
 'unassigned' 'vacuuming' 'walking']


In [None]:
# <Modified down to here>

This notebook cell writes a file `preprocessing.py`, which contains the pre-processing script. You can update the script, and rerun this cell to overwrite `preprocessing.py`. You run this as a processing job in the next cell. In this script, you

* Remove duplicates and rows with conflicting data
* transform the target `income` column into a column containing two labels.
* transform the `age` and `num persons worked for employer` numerical columns into categorical features by binning them
* scale the continuous `capital gains`, `capital losses`, and `dividends from stocks` so they're suitable for training
* encode the `education`, `major industry code`, `class of worker` so they're suitable for training
* split the data into training and test datasets, and saves the training features and labels and test features and labels.

Our training script will use the pre-processed training features and labels to train a model, and our model evaluation script will use the trained model and pre-processed test features and labels to evaluate the model.

In [None]:
%%writefile preprocessing.py

import argparse
import os
import warnings

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelBinarizer, KBinsDiscretizer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import make_column_transformer

from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

column_names = ['timestamp', 'activity_id', 'heart_rate', 'imu_wrist_temp', 
  'imu_wrist_accel16_x', 'imu_wrist_accel16_y', 'imu_wrist_accel16_z', 
  'imu_wrist_accel6_x', 'imu_wrist_accel6_y', 'imu_wrist_accel6_z', 
  'imu_wrist_gyro_x', 'imu_wrist_gyro_y', 'imu_wrist_gyro_z', 
  'imu_wrist_magnet_x', 'imu_wrist_magnet_y', 'imu_wrist_magnet_z', 
  'imu_wrist_orient1', 'imu_wrist_orient2', 'imu_wrist_orient3', 'imu_wrist_orient4', 
  'imu_chest_temp', 'imu_chest_accel16_x', 'imu_chest_accel16_y', 'imu_chest_accel16_z', 
  'imu_chest_accel6_x', 'imu_chest_accel6_y', 'imu_chest_accel6_z', 
  'imu_chest_gyro_x', 'imu_chest_gyro_y', 'imu_chest_gyro_z', 
  'imu_chest_magnet_x', 'imu_chest_magnet_y', 'imu_chest_magnet_z', 
  'imu_chest_orient1', 'imu_chest_orient2', 'imu_chest_orient3', 'imu_chest_orient4', 
  'imu_ankle_temp', 'imu_ankle_accel16_x', 'imu_ankle_accel16_y', 'imu_ankle_accel16_z', 
  'imu_ankle_accel6_x', 'imu_ankle_accel6_y', 'imu_ankle_accel6_z', 
  'imu_ankle_gyro_x', 'imu_ankle_gyro_y', 'imu_ankle_gyro_z', 
  'imu_ankle_magnet_x', 'imu_ankle_magnet_y', 'imu_ankle_magnet_z', 
  'imu_ankle_orient1', 'imu_ankle_orient2', 'imu_ankle_orient3', 'imu_ankle_orient4']


raw_activity_ids = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

activity_labels = ['unassigned', 'lying', 'sitting', 'standing ', 'walking', 'running', 'cycling', 
                'nordic_walking ', 'missing_8', 'watching_tv ', 'computer_work ', 'car_driving ', 
                'ascending_stairs ', 'descending_stairs ', 'missing_14', 'missing_15', 
                'vacuuming', 'ironing', 'folding_laundry', 'house_cleaning', 'playing soccer',
                'missing_21', 'missing_22', 'missing_23', 'rope jumping'] 


def print_shape(df):
    negative_examples, positive_examples = np.bincount(df['income'])
    print('Data shape: {}, {} positive examples, {} negative examples'.format(df.shape, positive_examples, negative_examples))

if __name__=='__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--train-test-split-ratio', type=float, default=0.3)
    args, _ = parser.parse_known_args()
    
    print('Received arguments {}'.format(args))

    input_data_path = os.path.join('/opt/ml/processing/input', 'subject101.csv')
    
    print('Reading input data from {}'.format(input_data_path))
    df_raw = pd.read_csv(input_data_path)
    df = pd.DataFrame(data=df_raw, columns=column_names)
    
    print('Number of rows BEFORE drop: {}'.format(df.size))
    df.dropna(inplace=True)
    df.drop_duplicates(inplace=True)
    print('Number of rows AFTER drop: {}'.format(df.size))
    
    df.replace(raw_activity_ids, activity_labels, inplace=True)
    
    negative_examples, positive_examples = np.bincount(df['income'])
    print('Data after cleaning: {}, {} positive examples, {} negative examples'.format(df.shape, positive_examples, negative_examples))
    
    split_ratio = args.train_test_split_ratio
    print('Splitting data into train and test sets with ratio {}'.format(split_ratio))
    X_train, X_test, y_train, y_test = train_test_split(df.drop('income', axis=1), df['income'], test_size=split_ratio, random_state=0)

    preprocess = make_column_transformer(
        (['age', 'num persons worked for employer'], KBinsDiscretizer(encode='onehot-dense', n_bins=10)),
        (['capital gains', 'capital losses', 'dividends from stocks'], StandardScaler()),
        (['education', 'major industry code', 'class of worker'], OneHotEncoder(sparse=False))
    )
    print('Running preprocessing and feature engineering transformations')
    train_features = preprocess.fit_transform(X_train)
    test_features = preprocess.transform(X_test)
    
    print('Train data shape after preprocessing: {}'.format(train_features.shape))
    print('Test data shape after preprocessing: {}'.format(test_features.shape))
    
    train_features_output_path = os.path.join('/opt/ml/processing/train', 'train_features.csv')
    train_labels_output_path = os.path.join('/opt/ml/processing/train', 'train_labels.csv')
    
    test_features_output_path = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_labels_output_path = os.path.join('/opt/ml/processing/test', 'test_labels.csv')
    
    print('Saving training features to {}'.format(train_features_output_path))
    pd.DataFrame(train_features).to_csv(train_features_output_path, header=False, index=False)
    
    print('Saving test features to {}'.format(test_features_output_path))
    pd.DataFrame(test_features).to_csv(test_features_output_path, header=False, index=False)
    
    print('Saving training labels to {}'.format(train_labels_output_path))
    y_train.to_csv(train_labels_output_path, header=False, index=False)
    
    print('Saving test labels to {}'.format(test_labels_output_path))
    y_test.to_csv(test_labels_output_path, header=False, index=False)


Run this script as a processing job. Use the `SKLearnProcessor.run()` method. You give the `run()` method one `ProcessingInput` where the `source` is the census dataset in Amazon S3, and the `destination` is where the script reads this data from, in this case `/opt/ml/processing/input`. These local paths inside the processing container must begin with `/opt/ml/processing/`.

Also give the `run()` method a `ProcessingOutput`, where the `source` is the path the script writes output data to. For outputs, the `destination` defaults to an S3 bucket that the Amazon SageMaker Python SDK creates for you, following the format `s3://sagemaker-<region>-<account_id>/<processing_job_name>/output/<output_name/`. You also give the ProcessingOutputs values for `output_name`, to make it easier to retrieve these output artifacts after the job is run.

The `arguments` parameter in the `run()` method are command-line arguments in our `preprocessing.py` script.

In [None]:
from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data_uri,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

output_config = preprocessing_job_description['ProcessingOutputConfig']
for output in output_config['Outputs']:
    if output['OutputName'] == 'train_data':
        preprocessed_training_data = output['S3Output']['S3Uri']
    if output['OutputName'] == 'test_data':
        preprocessed_test_data = output['S3Output']['S3Uri']

Now inspect the output of the pre-processing job, which consists of the processed features.

In [None]:
training_features = pd.read_csv(preprocessed_training_data + '/train_features.csv', nrows=10)
print('Training features shape: {}'.format(training_features.shape))
training_features.head(n=10)

## Training using the pre-processed data

We create a `SKLearn` instance, which we will use to run a training job using the training script `train.py`.  

In [None]:
from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(
    entry_point='train.py',
    train_instance_type="ml.m5.xlarge",
    role=role)

The training script `train.py` trains a logistic regression model on the training data, and saves the model to the `/opt/ml/model` directory, which Amazon SageMaker tars and uploads into a `model.tar.gz` file into S3 at the end of the training job.

In [None]:
%%writefile train.py

import os

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib

if __name__=="__main__":
    training_data_directory = '/opt/ml/input/data/train'
    train_features_data = os.path.join(training_data_directory, 'train_features.csv')
    train_labels_data = os.path.join(training_data_directory, 'train_labels.csv')
    print('Reading input data')
    X_train = pd.read_csv(train_features_data, header=None)
    y_train = pd.read_csv(train_labels_data, header=None)

    model = LogisticRegression(class_weight='balanced', solver='lbfgs')
    print('Training LR model')
    model.fit(X_train, y_train)
    model_output_directory = os.path.join('/opt/ml/model', "model.joblib")
    print('Saving model to {}'.format(model_output_directory))
    joblib.dump(model, model_output_directory)

Run the training job using `train.py` on the preprocessed training data.

In [None]:
sklearn.fit({'train': preprocessed_training_data})
training_job_description = sklearn.jobs[-1].describe()
model_data_s3_uri = '{}{}/{}'.format(
    training_job_description['OutputDataConfig']['S3OutputPath'],
    training_job_description['TrainingJobName'],
    'output/model.tar.gz')

## Model Evaluation

`evaluation.py` is the model evaluation script. Since the script also runs using scikit-learn as a dependency,  run this using the `SKLearnProcessor` you created previously. This script takes the trained model and the test dataset as input, and produces a JSON file containing classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.


In [None]:
%%writefile evaluation.py

import json
import os
import tarfile

import pandas as pd

from sklearn.externals import joblib
from sklearn.metrics import classification_report, roc_auc_score, accuracy_score

if __name__=="__main__":
    model_path = os.path.join('/opt/ml/processing/model', 'model.tar.gz')
    print('Extracting model from path: {}'.format(model_path))
    with tarfile.open(model_path) as tar:
        tar.extractall(path='.')
    print('Loading model')
    model = joblib.load('model.joblib')

    print('Loading test input data')
    test_features_data = os.path.join('/opt/ml/processing/test', 'test_features.csv')
    test_labels_data = os.path.join('/opt/ml/processing/test', 'test_labels.csv')

    X_test = pd.read_csv(test_features_data, header=None)
    y_test = pd.read_csv(test_labels_data, header=None)
    predictions = model.predict(X_test)

    print('Creating classification evaluation report')
    report_dict = classification_report(y_test, predictions, output_dict=True)
    report_dict['accuracy'] = accuracy_score(y_test, predictions)
    report_dict['roc_auc'] = roc_auc_score(y_test, predictions)

    print('Classification report:\n{}'.format(report_dict))

    evaluation_output_path = os.path.join('/opt/ml/processing/evaluation', 'evaluation.json')
    print('Saving classification report to {}'.format(evaluation_output_path))

    with open(evaluation_output_path, 'w') as f:
        f.write(json.dumps(report_dict))

In [None]:
import json
from sagemaker.s3 import S3Downloader

sklearn_processor.run(code='evaluation.py',
                      inputs=[ProcessingInput(
                                  source=model_data_s3_uri,
                                  destination='/opt/ml/processing/model'),
                              ProcessingInput(
                                  source=preprocessed_test_data,
                                  destination='/opt/ml/processing/test')],
                      outputs=[ProcessingOutput(output_name='evaluation',
                                  source='/opt/ml/processing/evaluation')]
                     )                    
evaluation_job_description = sklearn_processor.jobs[-1].describe()

Now retrieve the file `evaluation.json` from Amazon S3, which contains the evaluation report.

In [None]:
evaluation_output_config = evaluation_job_description['ProcessingOutputConfig']
for output in evaluation_output_config['Outputs']:
    if output['OutputName'] == 'evaluation':
        evaluation_s3_uri = output['S3Output']['S3Uri'] + '/evaluation.json'
        break

evaluation_output = S3Downloader.read_file(evaluation_s3_uri)
evaluation_output_dict = json.loads(evaluation_output)
print(json.dumps(evaluation_output_dict, sort_keys=True, indent=4))

## Running processing jobs with your own dependencies

Above, you used a processing container that has scikit-learn installed, but you can run your own processing container in your processing job as well, and still provide a script to run within your processing container.

Below, you walk through how to create a processing container, and how to use a `ScriptProcessor` to run your own code within a container. Create a scikit-learn container and run a processing job using the same `preprocessing.py` script you used above. You can provide your own dependencies inside this container to run your processing script with.

In [None]:
!mkdir docker

This is the Dockerfile to create the processing container. Install `pandas` and `scikit-learn` into it. You can install your own dependencies.

In [None]:
%%writefile docker/Dockerfile

FROM python:3.7-slim-buster

RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

This block of code builds the container using the `docker` command, creates an Amazon Elastic Container Registry (Amazon ECR) repository, and pushes the image to Amazon ECR.

In [None]:
import boto3

account_id = boto3.client('sts').get_caller_identity().get('Account')
ecr_repository = 'sagemaker-processing-container'
tag = ':latest'

uri_suffix = 'amazonaws.com'
if region in ['cn-north-1', 'cn-northwest-1']:
    uri_suffix = 'amazonaws.com.cn'
processing_repository_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)

# Create ECR repository and push docker image
!docker build -t $ecr_repository docker
!$(aws ecr get-login --region $region --registry-ids $account_id --no-include-email)
!aws ecr create-repository --repository-name $ecr_repository
!docker tag {ecr_repository + tag} $processing_repository_uri
!docker push $processing_repository_uri

The `ScriptProcessor` class lets you run a command inside this container, which you can use to run your own script.

In [None]:
from sagemaker.processing import ScriptProcessor

script_processor = ScriptProcessor(command=['python3'],
                image_uri=processing_repository_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')

Run the same `preprocessing.py` script you ran above, but now, this code is running inside of the Docker container you built in this notebook, not the scikit-learn image maintained by Amazon SageMaker. You can add the dependencies to the Docker image, and run your own pre-processing, feature-engineering, and model evaluation scripts inside of this container.

In [None]:
script_processor.run(code='preprocessing.py',
                      inputs=[ProcessingInput(
                        source=input_data_uri,
                        destination='/opt/ml/processing/input')],
                      outputs=[ProcessingOutput(output_name='train_data',
                                                source='/opt/ml/processing/train'),
                               ProcessingOutput(output_name='test_data',
                                                source='/opt/ml/processing/test')],
                      arguments=['--train-test-split-ratio', '0.2']
                     )
script_processor_job_description = script_processor.jobs[-1].describe()
print(script_processor_job_description)