## IDS 706

## Project: Build and Deploy Sagemaker Classifier Model to Predict Which Client's are not Likely to Purchase Long-Term Deposit Subscriptions.

If you are running a bank, how do you target long-term deposit marketing campaigns to increase your odds of success?

### 1. Background

A Portuguese retail bank (unnamed for privacy) in 2014, sought to estimate the success of telemarketing calls made to clients between 2008 and 2013 to sell long-term deposits. Long-term financial instruments are indispensable to the growth and operational liquidity of financial institutions; thus, a lot of resources are deployed to sell such plans. Often, clients made a decision on long-term deposit subscription after receiving more than one phone call. 

### 2. Project Goal
The goal of this project is to build and and deploy a Sagemaker classifier model to predict whether a customer will purchase a long-term deposit, given predictor variables.

### 3. Data Code

* age : age of customer   

* job : 0 - admin., 1 - blue collar, 2 - technician, 3 - services, 4 - management, 5 - retired, 6 - entrepreneur, 7 - self-employed, 8 - housemaid, 9 - unemployed, 10 -       student, 11 - unknown

* marital : 0 - single, 1 - married, 3 - divorced

* education : 0 - primary, 1 - secondary, 2 - tertiary, 3 - unknown

* default (has credit in default?) : 0 - no, 1 - yes

* balance (account balance in Dollars): numeric

* housing (has housing loan?) : 0 - no, 1 - yes

* loan (has personal loan?) : 0 - no, 1 - yes

* contact - to be dropped

* day (day of the week): numeric

* month (last contact month of year) : 0 - jan, 1 - feb, 2 - mar, 3 - apr, 4 - may, 5 - jun, 6 - jul, 7 - aug, 8 - sep, 9 - oct, 10 - nov, 11 - dec

* day (last contact day of the week) : 0 - mon, 1 - tue, 2 - wed, 3 - thu, 4 - fri, 5 - sat, 6 - sun

* duration (of last contact in seconds) : numeric

* campaign (number of contacts performed during this campaign) : numeric

* pdays (number of days that passed by after the client was last contacted from a previous campaign): numeric 

* previous (number of contacts performed before this campaign and for this client ) : numeric

* poutcome (outcome of previous marketing campaigns) : 0 - failures, 1 - success, 2 - nonexistent

* subscription_decision (response/target variable) : 0 - no, 1 - yes



### 4. EDA

In [1]:
# import libraries
import pandas as pd
import numpy as np
# ingest data
df = pd.read_csv("revised_bank.csv", sep=";")
df = df.rename(columns={"y": "decision"})
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,decision
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


### 5. Data Pre-processing

In [162]:
df["decision"].value_counts()

no     600
yes    521
Name: decision, dtype: int64

Right from the onset we know that approximately 13% of observations in the data set bought long-term deposits

Right from the onset we noticed a significant imbalance in number of observations belonging each group of the outcome variable. This is bound to affect the model. So we will enforce balance my making sure that there are comparable number of customers who bought long term deposits versus customers who did not.

In [163]:
df.sort_values(by=['decision'], inplace=True)
df.decision

0        no
396      no
397      no
398      no
399      no
       ... 
766     yes
765     yes
764     yes
762     yes
1120    yes
Name: decision, Length: 1121, dtype: object

In [164]:
df.reset_index(drop=True, inplace=True)

In [165]:
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,decision
0,33,management,divorced,tertiary,no,0,no,no,cellular,13,aug,305,2,-1,0,unknown,no
1,48,services,married,primary,yes,-583,yes,no,unknown,2,jun,25,7,-1,0,unknown,no
2,32,management,single,tertiary,no,151,yes,no,unknown,6,may,118,1,-1,0,unknown,no
3,51,services,married,secondary,no,867,yes,no,cellular,3,feb,177,2,211,3,failure,no
4,30,self-employed,married,secondary,no,1772,yes,no,cellular,13,apr,158,4,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1116,28,self-employed,single,tertiary,no,4579,no,no,cellular,12,jan,409,2,-1,0,unknown,yes
1117,34,services,married,secondary,no,1076,no,no,cellular,12,may,152,1,182,6,success,yes
1118,31,technician,married,tertiary,no,636,yes,no,cellular,4,may,352,4,-1,0,unknown,yes
1119,63,retired,married,secondary,no,474,no,no,cellular,25,jan,423,1,-1,0,unknown,yes


In [166]:
df = df.truncate(before=3400)
df = df.reset_index(drop=True)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,decision


In [7]:
df["decision"].value_counts()

no     600
yes    521
Name: decision, dtype: int64

We now have a more balanced data set.

We selected the following predictor variables to model the subscription decision:

* age: numeric

* balance (account balance in US $): numeric

* housing (has housing loan?): categorical

* loan (has personal loan?): categorical


And the target variable:

* decision (subscription decision on long term deposit): categorical (0=no, 1=yes)

In [8]:
df_clean = df.drop(columns=["job", "marital", "education", "default", "previous", "contact", "day", "month", "duration", "campaign", "pdays", "poutcome"]).copy()
df_clean.decision = df_clean.decision.replace("no", 0)
df_clean.decision = df_clean.decision.replace("yes", 1)
df_clean.housing = df_clean.housing.replace("no", 0)
df_clean.housing = df_clean.housing.replace("yes", 1)
df_clean.loan = df_clean.loan.replace("no", 0)
df_clean.loan = df_clean.loan.replace("yes", 1)
df_clean

Unnamed: 0,age,balance,housing,loan,decision
0,33,0,0,0,0
1,34,417,1,0,0
2,60,71,0,0,0
3,43,1188,0,0,0
4,50,8139,1,0,0
...,...,...,...,...,...
1116,41,-386,0,1,1
1117,39,426,0,0,1
1118,28,171,0,0,1
1119,38,6728,0,0,1


Split the data into target and feature variables

In [9]:
X = df_clean.drop("decision", axis=1)
y = df_clean["decision"].astype('int')
X

Unnamed: 0,age,balance,housing,loan
0,33,0,0,0
1,34,417,1,0
2,60,71,0,0
3,43,1188,0,0
4,50,8139,1,0
...,...,...,...,...
1116,41,-386,0,1
1117,39,426,0,0
1118,28,171,0,0
1119,38,6728,0,0


In [10]:
y.dtype

dtype('int64')

### 6. Train and Evaluate XGBOOST Model Locally

In [11]:
# import scikit-learn packages
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_score
from xgboost import XGBClassifier

We will create a pipeline that transforms the categorical columns and carries out the logistic regression for both training and testing data.

In [12]:
# define column transformer
#column_trans = make_column_transformer(
    #(OneHotEncoder(), ['job', 'marital', 'education', 'default', 'housing', 'loan']),
    #remainder = "passthrough")


# instantiate an xgboost logistic regression model
xgboost = XGBClassifier(use_label_encoder=False, learning_rate=0.005, n_estimators=100, objective='binary:logistic', eval_metric='logloss')

In [13]:
# define pipeline
#pipe = make_pipeline(column_trans, xgboost)

We could use cross-validation to eveluate the model. However, it would deprive us the opportunity easily evaluate our model on data it has was not trained with

In [14]:
#cross_val_score(pipe, X, y, cv=5, scoring="accuracy").mean()

In [15]:
# we split our data into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# we split our test data into validation data in order to fit the XGBOOST model
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size = 0.5)

In [16]:
X_test.columns

Index(['age', 'balance', 'housing', 'loan'], dtype='object')

In [19]:
#fit a model to the training data with pipeline
xgboost.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              eval_metric='logloss', gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.005, max_delta_step=0,
              max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=2,
              num_parallel_tree=1, objective='binary:logistic',
              predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              use_label_encoder=False, validate_parameters=1, ...)

We evaluated the model on the training data.

In [20]:
# predict y_train_hat
y_train_hat = xgboost.predict(X_train)

# model evaluation for training data
print("Precision = {}".format(precision_score(y_train, y_train_hat)))
print("Recall = {}".format(recall_score(y_train, y_train_hat)))
print("Accuracy = {}".format(accuracy_score(y_train, y_train_hat)))

Precision = 0.7002652519893899
Recall = 0.7115902964959568
Accuracy = 0.7193877551020408


We then evaluated the model on the testing data.

In [21]:
# predict y_test_hat
y_test_hat = xgboost.predict(X_test)

# model evaluation for testing data
print("Precision = {}".format(precision_score(y_test, y_test_hat)))
print("Recall = {}".format(recall_score(y_test, y_test_hat)))
print("Accuracy = {}".format(accuracy_score(y_test, y_test_hat)))

Precision = 0.5866666666666667
Recall = 0.55
Accuracy = 0.6011904761904762


### 7. Train and Evaluate Linear Learner Logistic Model Using Sagemaker

In [22]:
X1 = np.array(X).astype('float32')
y1 = np.array(y).astype('float32')

In [23]:
y1 = y1.reshape(-1,1)
y1.shape

(1121, 1)

In [24]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size = 0.2)

In [25]:
X1_test[:5,:]

array([[ 3.000e+01,  2.581e+03,  0.000e+00,  0.000e+00],
       [ 4.800e+01,  5.680e+02,  1.000e+00,  0.000e+00],
       [ 2.700e+01, -1.240e+02,  0.000e+00,  0.000e+00],
       [ 3.700e+01,  3.750e+02,  0.000e+00,  0.000e+00],
       [ 4.000e+01,  0.000e+00,  0.000e+00,  0.000e+00]], dtype=float32)

In [40]:
X1_train.shape

(896, 4)

In [26]:
y1_train.shape

(896, 1)

In [27]:
y1_test.shape

(225, 1)

In [167]:
### Initialize and Store Training Data into S3 Bucket

In [28]:
import sagemaker
import boto3
from sagemaker import Session

# Let's create a Sagemaker session
sagemaker_session = sagemaker.Session()
bucket = 'linear-logistic2.0'
# Let's define the S3 bucket and prefix that we want to use in this session
# bucket = 'sagemaker-practica' # bucket named 'sagemaker-practical' was created beforehand
prefix = 'linear_learner' # prefix is the subfolder within the bucket.

# Let's get the execution role for the notebook instance. 
# This is the IAM role that you created when you created your notebook instance. You pass the role to the training job.
# Note that AWS Identity and Access Management (IAM) role that Amazon SageMaker can assume to perform tasks on your behalf (for example, reading training results, called model artifacts, from the S3 bucket and writing training results to Amazon S3). 
role = sagemaker.get_execution_role()
print(role)

arn:aws:iam::194272474426:role/service-role/AmazonSageMaker-ExecutionRole-20211015T211867


In [30]:
# labels should be a vector
y1_train = y1_train[:,0]

In [31]:
import io # The io module allows for dealing with various types of I/O (text I/O, binary I/O and raw I/O). 
import numpy as np
import sagemaker.amazon.common as smac # sagemaker common libary

## CODE BELOW CONVERTS THE DATA IN NUMPY ARRAY FORMAT TO RecordIO FORMAT
# This is the format required by Sagemaker Linear Learner 

buf = io.BytesIO() # create an in-memory byte array (buf is a buffer I will be writing to)
smac.write_numpy_to_dense_tensor(buf, X1_train, y1_train)
buf.seek(0) 
# When you write to in-memory byte arrays, it increments 1 every time you write to it
# Let's reset that back to zero 

0

In [32]:
import os

# Code to upload RecordIO data to S3
 
# Key refers to the name of the file    
key = 'linear-train-data'

# The following code uploads the data in record-io format to S3 bucket to be accessed later for training
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)

# Let's print out the training data location in s3
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

uploaded training data location: s3://linear-logistic2.0/linear_learner/train/linear-train-data


In [33]:
# Make sure that the target label is a vector
y1_test = y1_test[:,0]

In [34]:
# Code to upload RecordIO data to S3

buf = io.BytesIO() # create an in-memory byte array (buf is a buffer I will be writing to)
smac.write_numpy_to_dense_tensor(buf, X1_test, y1_test)
buf.seek(0) 
# When you write to in-memory byte arrays, it increments 1 every time you write to it
# Let's reset that back to zero 


0

In [35]:
# Key refers to the name of the file    
key = 'linear-test-data'

# The following code uploads the data in record-io format to S3 bucket to be accessed later for training
boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'test', key)).upload_fileobj(buf)

# Let's print out the testing data location in s3
s3_test_data = 's3://{}/{}/test/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_test_data))

uploaded training data location: s3://linear-logistic2.0/linear_learner/test/linear-test-data


In [36]:
# create an output placeholder in S3 bucket to store the linear learner output

output_location = 's3://{}/{}/output'.format(bucket, prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))

Training artifacts will be uploaded to: s3://linear-logistic2.0/linear_learner/output


In [37]:
# This code is used to get the training container of sagemaker built-in algorithms
# all we have to do is to specify the name of the algorithm, that we want to use

# Let's obtain a reference to the linearLearner container image
# Note that all regression models are named estimators
# You don't have to specify (hardcode) the region, get_image_uri will get the current region name using boto3.Session

from sagemaker.amazon.amazon_estimator import get_image_uri

container = get_image_uri(boto3.Session().region_name, 'linear-learner')

The method get_image_uri has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
Defaulting to the only supported framework/algorithm version: 1. Ignoring framework/algorithm version: 1.


In [41]:
# We have pass in the container, the type of instance that we would like to use for training 
# output path and sagemaker session into the Estimator. 
# We can also specify how many instances we would like to use for training
# sagemaker_session = sagemaker.Session()

linear = sagemaker.estimator.Estimator(container,
                                       role, 
                                       train_instance_count = 1, 
                                       train_instance_type = 'ml.c4.xlarge',
                                       output_path = output_location,
                                       sagemaker_session = sagemaker_session,
                                       train_use_spot_instances = True,
                                       train_max_run = 300,
                                       train_max_wait = 600)


# We can tune parameters like the number of features that we are passing in, type of predictor like 'regressor' or 'classifier', mini batch size, epochs
# Train 32 different versions of the model and will get the best out of them (built-in parameters optimization!)

linear.set_hyperparameters(feature_dim = 4,
                           predictor_type = 'binary_classifier',
                           mini_batch_size = 5,
                           epochs = 10,
                           num_models = 32,
                           loss = 'absolute_loss')

# Now we are ready to pass in the training data from S3 to train the linear learner model

linear.fit({'train': s3_train_data})

# Let's see the progress using cloudwatch logs

train_instance_count has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_instance_type has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_run has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_use_spot_instances has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
train_max_wait has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


2021-11-17 15:52:02 Starting - Starting the training job...
2021-11-17 15:52:26 Starting - Launching requested ML instancesProfilerReport-1637164322: InProgress
......
2021-11-17 15:53:26 Starting - Preparing the instances for training.........
2021-11-17 15:55:02 Downloading - Downloading input data...
2021-11-17 15:55:29 Training - Downloading the training image...
2021-11-17 15:55:50 Training - Training image download completed. Training in progress.[34mDocker entrypoint called with argument(s): train[0m
[34mRunning default environment configuration script[0m
[34m[11/17/2021 15:55:56 INFO 140605343524672] Reading default configuration from /opt/amazon/lib/python3.7/site-packages/algorithm/resources/default-input.json: {'mini_batch_size': '1000', 'epochs': '15', 'feature_dim': 'auto', 'use_bias': 'true', 'binary_classifier_model_selection_criteria': 'accuracy', 'f_beta': '1.0', 'target_recall': '0.8', 'target_precision': '0.8', 'num_models': 'auto', 'num_calibration_samples': '1

### 8. Deploy and Test Sagemaker Linear Learner Model

In [210]:
# Deploying the model to perform inference 

linear_classifier2= linear.deploy(initial_instance_count = 1,
                                          instance_type = 'ml.m4.xlarge')

------!

In [211]:
linear_classifier2.endpoint

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


'linear-learner-2021-11-21-04-07-17-994'

In [212]:
from sagemaker.predictor import csv_serializer, json_deserializer

# Content type overrides the data that will be passed to the deployed model, since the deployed model expects data in text/csv format.

# Serializer accepts a single argument, the input data, and returns a sequence of bytes in the specified content type

# Deserializer accepts two arguments, the result data and the response content type, and return a sequence of bytes in the specified content type.

# Reference: https://sagemaker.readthedocs.io/en/stable/predictors.html

# linear_regressor.content_type = 'text/csv'
linear_classifier2.serializer = csv_serializer
linear_classifier2.deserializer = json_deserializer

In [213]:
def np2csv(arr):
    csv = io.BytesIO()
    np.savetxt(csv, arr, delimiter=",", fmt="%g")
    return csv.getvalue().decode().rstrip()

In [214]:
import json

runtime = boto3.client("runtime.sagemaker")

payload = np2csv(X1_test)
response = runtime.invoke_endpoint(
    EndpointName='linear-learner-2021-11-21-04-07-17-994', ContentType="text/csv", Body=payload
)
result = json.loads(response["Body"].read().decode())
test_pred = np.array([r["score"] for r in result["predictions"]])

In [215]:
test_pred

array([ 1.00285006e+00,  9.91261005e-03,  1.00013506e+00,  9.98888671e-01,
        9.97880459e-01,  1.09378099e-02,  1.25964284e-02,  9.98641968e-01,
        1.24634504e-02,  9.71913338e-03,  1.22565627e-02,  9.94534135e-01,
        9.68050957e-03,  1.36758685e-02,  1.32721066e-02,  9.08565521e-03,
        1.00351369e+00,  1.12811327e-02,  9.99398768e-01,  8.72182846e-03,
        1.33011937e-02,  5.71089983e-03,  1.00138736e+00,  9.95606363e-01,
        8.77016783e-03,  1.17211342e-02,  9.97880459e-01,  1.29795074e-02,
        1.14455223e-02,  1.16245151e-02,  9.95587051e-01,  9.95426118e-01,
        9.98805285e-01,  1.27957463e-02,  1.79201365e-02,  1.18142366e-02,
        1.01616383e-02,  9.99409735e-01,  1.23304725e-02,  1.00008881e+00,
        5.01942635e-03,  9.90481794e-01,  9.99170244e-01,  5.79184294e-03,
        1.12763047e-02,  1.28745437e-02,  1.10756159e-02,  1.09051466e-02,
        1.39600635e-02,  1.28696561e-02,  1.68103576e-02,  1.09413862e-02,
        1.19073391e-02,  

In [216]:
# making prediction on the test data

result = linear_classifier2.predict([[20, 3000, 1, 1]])

The csv_serializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.
The json_deserializer has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


In [217]:
result # results are in Json format

{'predictions': [{'score': 0.00952976942062378, 'predicted_label': 0}]}

In [218]:
# Since the result is in json format, we access the scores by iterating through the scores in the predictions

predictions = np.array([r['predicted_label'] for r in result['predictions']])

In [219]:
predictions

array([0])

In [220]:
y1_test = y1_test.reshape(-1,1)
y1_test.shape

(225, 1)

### 6. Evaluation of Sagemaker Linear Learner Model

Let's compare linear learner based mean absolute prediction errors from a baseline prediction which uses majority class to predict every instance.

In [222]:
test_mae_linear = np.mean(np.abs(y1_test - test_pred))
test_mae_baseline = np.mean(
    np.abs(y1_test- np.median(y1_train))
)  ## training median as baseline predictor

print("Test MAE Baseline :", round(test_mae_baseline, 3))
print("Test MAE Linear:", round(test_mae_linear, 3))

Test MAE Baseline : 0.467
Test MAE Linear: 0.499


Let's compare predictive accuracy using a classification threshold of 0.5 for the predicted and compare against the majority class prediction from training data set.

In [None]:
test_pred_class = (test_pred > 0.5) + 0
test_pred_baseline = np.repeat(np.median(y1_train), len(y1_test))

prediction_accuracy = np.mean((test_y == test_pred_class)) * 100
baseline_accuracy = np.mean((test_y == test_pred_baseline)) * 100

print("Prediction Accuracy:", round(prediction_accuracy, 1), "%")
print("Baseline Accuracy:", round(baseline_accuracy, 1), "%")

In [207]:
# Delete the end-point

linear_classifier2.delete_endpoint()