# Build a SciKit Learn Pipeline Model

In this notebook we will train and deploy a SKLearn Pipeline on Sagemaker infrastructure. 


## Create a training script

We need to define a model that can run on Sagemaker training hardware.

Detailed guidance here https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#preparing-the-scikit-learn-training-script


In [1]:
%%writefile src/sklearn_pipeline_training_script.py

import argparse
#import joblib
from sklearn.externals import joblib
import os

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

from sklearn.compose import ColumnTransformer
#, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer, StandardScaler, OneHotEncoder
import UnknownFeatureGenerator as ufg

# inference functions ---------------
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf


if __name__ =='__main__':
    
    #------------------------------- parsing input parameters (from command line)
    print('extracting arguments')
    parser = argparse.ArgumentParser()

    # RandomForest hyperparameters
    parser.add_argument('--n_estimators', type=int, default=150)
    parser.add_argument('--min_samples_leaf', type=int, default=20)
    parser.add_argument('--max_depth', type=int, default=9)
    
    # Data, model, and output directories
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train_dir', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--test_dir', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    parser.add_argument('--train_file', type=str, default='train.csv')
    parser.add_argument('--test_file', type=str, default='validation.csv')
    parser.add_argument('--features', type=str, default='')  # explicitly name which features to use
    parser.add_argument('--target_variable', type=str)  # explicitly name the column to be used as target

    args, _ = parser.parse_known_args()
    
    #------------------------------- data preparation
    print('reading data')
    train_df = pd.read_csv(os.path.join(args.train_dir, args.train_file))
    test_df = pd.read_csv(os.path.join(args.test_dir, args.test_file))

    features = args.features.split()
    if features == []:
        features = list(train_df.columns)
        features.remove(args.target_variable)
    
    print('building training and testing datasets')
    X_train = train_df[features]
    X_test = test_df[features]
    y_train = train_df[args.target_variable]
    y_test = test_df[args.target_variable]
    
    
    numeric_cols = list( X_train.select_dtypes(include="number").columns)
    categorical_cols = list( X_train.select_dtypes(exclude="number").columns)
    
    #------------------------------- setup the preprocessing
    print('preprocesser setup')
    
    feature_gen = Pipeline([
        ("unknown", ufg.UnknownFeatureGenerator("admission_type_id", "UNKNOWN_admission_type") )
    ])
    
    numeric_transformer = make_pipeline(
        SimpleImputer(strategy='median'),
        StandardScaler()
    )

    categorical_transformer = make_pipeline(
        SimpleImputer(strategy='constant', fill_value='missing'),
        OneHotEncoder(handle_unknown='ignore')
    )

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, numeric_cols),
            ("cat", categorical_transformer, categorical_cols)
        ]
    )
    
    #------------------------------- model training
    print('training model')
    rfcl = RandomForestClassifier(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        max_depth=args.max_depth,
        n_jobs=-1)
    
    model = Pipeline(steps=[
        ('feature_gen', feature_gen),
        ('preprocessor', preprocessor),
        ('rf', rfcl )
    ])
    
    model.fit(X_train, y_train)
    
    #-------------------------------  model testing
    print('testing model')

    test_preds = model.predict_proba(X_test)
    roc_auc = roc_auc_score(y_test, test_preds[:,1])
    print("Validation AUC: ", roc_auc)
        
    #------------------------------- save model
    path = os.path.join(args.model_dir, "model.joblib")
    joblib.dump(model, path)
    print('model saved at ' + path)

Overwriting src/sklearn_pipeline_training_script.py



#### Local training

Script arguments allows us to remove from the script any SageMaker-specific configuration, and run locally

**Note** This script relies on scikit-learn version 0.22 (Certain functions have been deprecated in 0.23)


In [24]:
! python src/sklearn_pipeline_training_script.py \
    --n_estimators 200 \
    --min_samples_leaf 8 \
    --model_dir 'model/' \
    --train_dir '../../data/partitioned/' \
    --test_dir '../../data/partitioned/' \
    --train_file 'train.csv' \
    --test_file 'validation.csv' \
    --target_variable 'readmitted'


extracting arguments
reading data
building training and testing datasets
preprocesser setup
training model
testing model
Validation AUC:  0.6852319377840976
model saved at model/model.joblib


## Create a script to execute training

To build the model we execute on Sagemake infrastructure using the SKLearn Estimator.
https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html

Write out a script for this (that can be run locally or via the master RUN script)

In [25]:
%%writefile RUN_Sagemaker_02b_Build.py

# We use the Estimator from the SageMaker Python SDK
from sagemaker.sklearn.estimator import SKLearn
from sagemaker.sklearn.model import SKLearnModel
import sagemaker
import sys

sys.path.append("../../")
import utils.config as cfg
import utils.models as mods

config = cfg.get_config()
region = config['region']
target = config['target']
bucket_name = config['bucket_name']
bucket_prefix = config['bucket_prefix']
sgmk_session = config['sgmk_session']
sgmk_role = config['sgmk_role']
sm_boto3 = config['sm_boto3']

train_path_s3 = cfg.get_s3_path('train')
test_path_s3 = cfg.get_s3_path('validation')

sklearn_estimator = SKLearn(
    entry_point='sklearn_pipeline_training_script.py',
    role=sgmk_role,
    source_dir='src',
    instance_count=1,
    instance_type='ml.m5.large',
    framework_version='0.20.0',
    base_job_name='rf-scikit',
    metric_definitions=[
        { 'Name': 'AUC', 'Regex': 'Validation AUC: ([0-9.]+).*$' },
    ],
    hyperparameters={
        'n_estimators': 300,
        'min_samples_leaf': 8,
        'target_variable': target,
    },
    max_run=20*60,  # Maximum allowed active runtime (in seconds)
    use_spot_instances=True,  # Use spot instances to reduce cost
    max_wait=30*60,  # Maximum clock time (including spot delays)
)

data_dict = {'train':train_path_s3, 'test': test_path_s3}

sklearn_estimator.fit({'train':train_path_s3, 'test': test_path_s3}, wait=True)

sklearn_estimator.latest_training_job.wait(logs='None')

model_artifact = sm_boto3.describe_training_job(
    TrainingJobName=sklearn_estimator.latest_training_job.name)['ModelArtifacts']['S3ModelArtifacts']

print('Model artifact saved at:', model_artifact)

model = SKLearnModel(
    model_data=model_artifact,
    framework_version='0.20.0',
    py_version='py3',
    role=sgmk_role,
    source_dir='src',
    entry_point='sklearn_pipeline_training_script.py',
)

predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
)

mods.register("Custom_SKLearn_RF", "Random Forest Classifier with Added Features", model_artifact, predictor)



Writing RUN_Sagemaker_02b_Build.py


### Execute it

In [26]:
!python RUN_Sagemaker_02b_Build.py

2020-12-29 02:31:07 Starting - Starting the training job...
2020-12-29 02:31:09 Starting - Launching requested ML instances......
2020-12-29 02:32:11 Starting - Preparing the instances for training...
2020-12-29 02:32:52 Downloading - Downloading input data...
2020-12-29 02:33:28 Training - Downloading the training image..[34m2020-12-29 02:33:42,567 sagemaker-training-toolkit INFO     Imported framework sagemaker_sklearn_container.training[0m
[34m2020-12-29 02:33:42,569 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-12-29 02:33:42,579 sagemaker_sklearn_container.training INFO     Invoking user training script.[0m
[34m2020-12-29 02:33:42,768 botocore.utils INFO     IMDS ENDPOINT: http://169.254.169.254/[0m
[34m2020-12-29 02:33:42,933 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-12-29 02:33:42,946 sagemaker-training-toolkit INFO     No GPUs detected (normal if no gpus installed)[

In [9]:
from sagemaker.sklearn.model import SKLearnPredictor

predictor2 = SKLearnPredictor(
    endpoint_name=predictor._endpoint_config_name,
    sagemaker_session=sgmk_session
)
print(predictor2._endpoint_config_name)

sagemaker-scikit-learn-2020-12-28-01-10-24-002


In [3]:
model_artifact = "s3://sagemaker-us-east-2-320389841409/rf-scikit-2020-12-29-02-31-07-488/output/model.tar.gz"

In [4]:
from sagemaker.sklearn.model import SKLearnPredictor
from sagemaker.sklearn.model import SKLearnModel
import sys
sys.path.append("../../")
import utils.config as cfg
import utils.models as mods

config = cfg.get_config()
sgmk_session = config['sgmk_session']
sgmk_role = config['sgmk_role']

model = SKLearnModel(
    model_data=model_artifact,
    framework_version='0.20.0',
    py_version='py3',
    role=sgmk_role,
    source_dir='src',
    entry_point='sklearn_pipeline_training_script.py',
)


In [5]:

predictor = model.deploy(
    instance_type='ml.c5.large',
    initial_instance_count=1,
)

mods.register("Custom_SKLearn_RF", "Random Forest Classifier with Added Features", model_artifact, predictor)


-------------!

'Done'

In [6]:
print(predictor._endpoint_config_name)

sagemaker-scikit-learn-2020-12-29-08-22-10-131


In [9]:
# Create a new predictor with a specified serializer

from sagemaker.predictor import csv_serializer
from sagemaker.sklearn.model import SKLearnPredictor

new_predictor = SKLearnPredictor(
        endpoint_name=predictor._endpoint_config_name,
        sagemaker_session=sgmk_session,
        serializer = csv_serializer,
)

In [10]:
import pandas as pd
test_df = pd.read_csv('../../data/partitioned/test.csv')


In [11]:
test_df.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,319655078,87697485,Caucasian,Male,[50-60),?,Clinic Referral,1,1,1,...,No,No,No,No,No,No,No,No,No,0
1,46247880,9346356,Caucasian,Female,[60-70),?,Physician Referral,1,7,8,...,No,No,No,No,No,No,No,No,No,0
2,85492566,24242400,Caucasian,Female,[60-70),?,Transfer from a Skilled Nursing Facility (SNF),1,17,3,...,No,Down,No,No,No,No,No,Ch,Yes,0
3,238261572,90486225,Caucasian,Male,[80-90),?,Clinic Referral,1,1,5,...,No,Steady,No,No,No,No,No,Ch,Yes,0
4,138396858,47461050,Caucasian,Male,[60-70),?,Physician Referral,1,1,1,...,No,No,No,No,No,No,No,No,No,1


In [12]:
target_variable="readmitted"
features = list(test_df.columns)
features.remove(target_variable)

In [13]:
X_test = test_df[features]
X_test.head()

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,examide,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed
0,319655078,87697485,Caucasian,Male,[50-60),?,Clinic Referral,1,1,1,...,No,No,No,No,No,No,No,No,No,No
1,46247880,9346356,Caucasian,Female,[60-70),?,Physician Referral,1,7,8,...,No,No,No,No,No,No,No,No,No,No
2,85492566,24242400,Caucasian,Female,[60-70),?,Transfer from a Skilled Nursing Facility (SNF),1,17,3,...,No,No,Down,No,No,No,No,No,Ch,Yes
3,238261572,90486225,Caucasian,Male,[80-90),?,Clinic Referral,1,1,5,...,No,No,Steady,No,No,No,No,No,Ch,Yes
4,138396858,47461050,Caucasian,Male,[60-70),?,Physician Referral,1,1,1,...,No,No,No,No,No,No,No,No,No,No


In [14]:
# the SKLearnPredictor does the serialization from pandas for us
preds = new_predictor.predict(X_test)

#print(predictor.predict(testX[data.feature_names]))

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


KeyError: 0