### Training a Model Using a Built-in Amazon SageMaker Algorithm - B
Mico Ellerich M. Comia

This notebook trains a Amazon SageMaker Linear Learner built-in algorithm to predict a binary output given a multi-dimensional input. No hyperparameter optimizations were applied and as such, the default values were used as is.

---

- SELECT 2 MACHINE LEARNING ALGORITHMS 
- FOR EACH OF THE ALGORITHMS
    - PERFORM TRAINING ON THE TRAINING DATASET
    - EVALUATE ON THE VALIDATION DATASET
    - TEST THE TRAINED MODEL ON THE TEST SET
    - SAVE THE MODEL USING JOBLIB (OR ALTERNATIVE)
- COMPARE THE “PERFORMANCE” OF THE 2 MODELS USING THE EVALUATION METRICS

---

In [12]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import warnings
import joblib
import time
import ast

from sklearn import metrics

### I. Import dataset splits
---
First, we import the generated synthetic dataset from the previous notebook using Pandas' read_csv. This imports the CSV files as dataframes.

In [2]:
X_train =  pd.read_csv('files/data/X_train.csv')
X_test =  pd.read_csv('files/data/X_test.csv') 
X_val = pd.read_csv('files/data/X_val.csv') 
y_train =  pd.read_csv('files/data/y_train.csv')
y_test =  pd.read_csv('files/data/y_test.csv') 
y_val = pd.read_csv('files/data/y_val.csv') 

### II. Training the Linear Learner model
---
#### A. Training using the training dataset

Using the Linear Learner algorithm, we set the predictor_type as binary_classifier since we will be predicting binary outputs.

In [3]:
import sagemaker
from sagemaker import get_execution_role

session = sagemaker.Session() # Represent experiment session
role = get_execution_role() # Execution role of instance
bucket = session.default_bucket() # Refers to S3 bucket

In [4]:
from sagemaker import LinearLearner

estimator = LinearLearner(role=role,
                          instance_count=1,
                          instance_type='ml.m5.xlarge',
                          predictor_type='binary_classifier', 
                          mini_batch_size=4)

We then convert our training sets to arrays before passing them to the fit method.

In [5]:
X_train_arr = X_train.values
y_train_arr = y_train['Y'].values

The record_set method automatically uploads the training set to S3.

In [6]:
record_set = estimator.record_set(train=X_train_arr.astype('float32'), 
                                  labels=y_train_arr.astype('float32'))

In [7]:
estimator.fit(record_set)

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


2021-05-29 14:56:42 Starting - Starting the training job...
2021-05-29 14:56:47 Starting - Launching requested ML instancesProfilerReport-1622300201: InProgress
.........
2021-05-29 14:58:25 Starting - Preparing the instances for training......
2021-05-29 14:59:25 Downloading - Downloading input data...
2021-05-29 15:00:07 Training - Training image download completed. Training in progress.
2021-05-29 15:00:07 Uploading - Uploading generated training model[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[05/29/2021 15:00:05 INFO 139668206253888] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': 

Since we did not use automated hyperparameter turners or assigned different hyperparameter values, the default values for the model were used. Accessing the _hyperparameter_ attribute of the estimator shows us these default values.

In [8]:
estimator._hyperparameters

{'predictor_type': 'binary_classifier',
 'binary_classifier_model_selection_criteria': None,
 'target_recall': None,
 'target_precision': None,
 'positive_example_weight_mult': None,
 'epochs': None,
 'use_bias': None,
 'num_models': None,
 'num_calibration_samples': None,
 'init_method': None,
 'init_scale': None,
 'init_sigma': None,
 'init_bias': None,
 'optimizer': None,
 'loss': None,
 'wd': None,
 'l1': None,
 'momentum': None,
 'learning_rate': None,
 'beta_1': None,
 'beta_2': None,
 'bias_lr_mult': None,
 'bias_wd_mult': None,
 'use_lr_scheduler': None,
 'lr_scheduler_step': None,
 'lr_scheduler_factor': None,
 'lr_scheduler_minimum_lr': None,
 'normalize_data': None,
 'normalize_label': None,
 'unbias_data': None,
 'unbias_label': None,
 'num_point_for_scaler': None,
 'margin': None,
 'quantile': None,
 'loss_insensitivity': None,
 'huber_delta': None,
 'early_stopping_patience': None,
 'early_stopping_tolerance': None,
 'num_classes': None,
 'accuracy_top_k': None,
 'f_beta'

Before we could use our model for prediction, we must create an inference endpoint which can be done using the deploy method.

In [10]:
linear_model = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.t2.medium')
# Creates a "prediction endpoint" instance or separate instance for prediction

Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


-------------------!

The following lines of code uses the deployed model to create a prediction. We then use list comprehension to extract the predicted outputs and probabilities for the training set.

In [34]:
payload = X_train.values.astype('float32')
linear_pred_train = linear_model.predict(payload) 

In [59]:
linear_train_prob = [linear_pred_train[x].label['score'].float32_tensor.values[0] for x in range(0, len(linear_pred_train))]
linear_train_pred = [linear_pred_train[x].label['predicted_label'].float32_tensor.values[0] for x in range(0, len(linear_pred_train))]

In [62]:
linear_train_scores = [metrics.accuracy_score(y_train, linear_train_pred)*100,
                     metrics.precision_score(y_train, linear_train_pred)*100,
                     metrics.recall_score(y_train, linear_train_pred)*100] 

df_linear_train = pd.DataFrame(linear_train_scores, columns = ['Scores'], index = ['Accuracy', 'Precision', ' Recall'])
df_linear_train

Unnamed: 0,Scores
Accuracy,89.833333
Precision,86.627907
Recall,95.207668


The trained model achieved an accuracy of 89.83% on the training set which is slightly higher than SKLearn's Logistic Regression model. Similar to Logistic Regression, Linear Learner's probability informs the model whether to classify the point as 0 or 1.

In [63]:
df_train = pd.concat([X_train, y_train], axis = 1)
df_train["Pred"] = linear_train_pred
df_train.head(5)

Unnamed: 0,X1,X2,X3,X4,X5,Y,Pred
0,-1.529955,1.033142,1.205786,0.06703,1.009152,1,1.0
1,-0.352503,-1.71406,1.15273,1.865132,-0.811933,0,0.0
2,0.767941,-0.049229,-0.619864,-0.663686,-0.023202,0,0.0
3,-0.345336,-1.656973,2.182604,0.799374,0.477992,0,0.0
4,-1.74595,1.103049,0.770478,0.754043,0.352163,1,1.0


Using the extracted probabilities from the inference endpoint, we can see that once the probability exceeds a certain threshold (>= 0.5), the output is classified as "1."

In [64]:
df_train["Probability"] = linear_train_prob
selection = df_train[["Y", "Pred", "Probability"]]
selection.head(10)

Unnamed: 0,Y,Pred,Probability
0,1,1.0,0.912893
1,0,0.0,0.014013
2,0,0.0,0.019133
3,0,0.0,0.18099
4,1,1.0,0.790828
5,0,0.0,0.0111
6,1,1.0,0.574729
7,1,1.0,0.542053
8,0,0.0,0.022855
9,0,0.0,0.018688


---
#### B. Evaluating the model using the validation set
To evaluate the performance of our model, we use the accuracy, precision, and recall metrics. A higher value for these metrics are desirable. The evaluation steps are similar for both the validation and test sets. Ideally, we use the validation set when we're performing cross validation techniques.

In [66]:
payload = X_val.values.astype('float32')
linear_pred_val = linear_model.predict(payload) 

In [68]:
linear_val_pred = [linear_pred_val[x].label['predicted_label'].float32_tensor.values[0] for x in range(0, len(linear_pred_val))]

In [74]:
linear_val_scores = [metrics.accuracy_score(y_val, linear_val_pred)*100,
                     metrics.precision_score(y_val, linear_val_pred)*100,
                     metrics.recall_score(y_val, linear_val_pred)*100] 

df_linear_val = pd.DataFrame(linear_val_scores, columns = ['Scores'], index = ['Accuracy', 'Precision', ' Recall'])
df_linear_val

Unnamed: 0,Scores
Accuracy,91.0
Precision,85.981308
Recall,96.842105


---
#### C. Evaluating the model using the test set

In [76]:
payload = X_test.values.astype('float32')
linear_pred_test = linear_model.predict(payload) 

In [77]:
linear_test_pred = [linear_pred_test[x].label['predicted_label'].float32_tensor.values[0] for x in range(0, len(linear_pred_val))]

In [78]:
linear_test_scores = [metrics.accuracy_score(y_test, linear_test_pred)*100,
                     metrics.precision_score(y_test, linear_test_pred)*100,
                     metrics.recall_score(y_test, linear_test_pred)*100] 

df_linear_test = pd.DataFrame(linear_test_scores, columns = ['Scores'], index = ['Accuracy', 'Precision', ' Recall'])
df_linear_test

Unnamed: 0,Scores
Accuracy,89.0
Precision,82.242991
Recall,96.703297


Generally, the scores of the Linear Learner model are high. Moreover, there is little variance between the scores of the training, validation, and test set which could mean that the model is not underfitting nor overfitting.

---
#### D. Saving metrics and model
For future use and reference, we save the scores of the model. For the metrics, we have concatenated the different scores into a single dataframe and exported it as CSV with a timestamp. This also allows us to access the metrics in our local machine.

In [79]:
# Getting the current time to serve as timestamps
timestr = time.strftime("%m%d-%H%M")

In [81]:
df_linear_scores = pd.concat([df_linear_train, df_linear_val, df_linear_test], axis = 1)
df_linear_scores.columns = ['Training','Validation','Test']
df_linear_scores

Unnamed: 0,Training,Validation,Test
Accuracy,89.833333,91.0,89.0
Precision,86.627907,85.981308,82.242991
Recall,95.207668,96.842105,96.703297


In [83]:
metrics_filename = 'files/results/linear_' + timestr + '.csv'
df_linear_scores.to_csv(metrics_filename, index = False)

Since we have trained our model in SageMaker, the model artifact is automatically stored in S3. The location of the artifact can be found by accessing the output_path attribute of the estimator.

In [86]:
estimator.output_path

's3://sagemaker-us-east-1-305262579855/'

---
#### E. Deleting endpoint
Finally, we delete our endpoint after producing the needed data.

In [87]:
linear_model.delete_endpoint()