# Exploring Sagemaker notebooks on AWS

In this notebook I will be trying to recreatea a notebook example
from the AWS Sagemaker notebook examples, specifically from this page
   https://sagemaker-examples.readthedocs.io/en/latest/introduction_to_applying_machine_learning/xgboost_customer_churn/xgboost_customer_churn_outputs.html
   
As you will see below I ran into some issues using XGBoost with the Sagemaker notebook 
so I decided to finish the modeling using Random Forest from SKlearn.

That also presented some problems as the default version of sklearn is a bit old on the site.

Anyway, more of an example of trying to make some of these cloud products work
than a real novel data science project.

But a worthy exercise in any case.

Here we go ...

## setup a sagemaker session

In [1]:
import sagemaker

sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = "sagemaker/test-churn"

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

In [2]:
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import io
import os
import sys
import time
import json
from IPython.display import display
from time import strftime, gmtime
from sagemaker.inputs import TrainingInput
from sagemaker.serializers import CSVSerializer

## Download the files from an S3 bucket

In [3]:
s3 = boto3.client("s3")
s3.download_file(f"sagemaker-sample-files", "datasets/tabular/synthetic/churn.txt", "churn.txt")

## Load the file and preview

In [4]:
orig = pd.read_csv("./churn.txt")
pd.set_option("display.max_columns", 500)
display(orig.head(3))

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,PA,163,806,403-2562,no,yes,300,8.162204,3,7.579174,3.933035,4,6.508639,4.065759,100,5.111624,4.92816,6,5.673203,3,True.
1,SC,15,836,158-8416,yes,no,0,10.018993,4,4.226289,2.325005,0,9.972592,7.14104,200,6.436188,3.221748,6,2.559749,8,False.
2,MO,131,777,896-6253,no,yes,300,4.70849,3,4.76816,4.537466,3,4.566715,5.363235,100,5.142451,7.139023,2,6.254157,4,False.


## Description of the variables

State: the US state in which the customer resides, indicated by a two-letter abbreviation; for example, OH or NJ

Account Length: the number of days that this account has been active

Area Code: the three-digit area code of the corresponding customer’s phone number

Phone: the remaining seven-digit phone number

Int’l Plan: whether the customer has an international calling plan: yes/no

VMail Plan: whether the customer has a voice mail feature: yes/no

VMail Message: the average number of voice mail messages per month

Day Mins: the total number of calling minutes used during the day

Day Calls: the total number of calls placed during the day

Day Charge: the billed cost of daytime calls

Eve Mins, Eve Calls, Eve Charge: the billed cost for calls placed during the evening

Night Mins, Night Calls, Night Charge: the billed cost for calls placed during nighttime

Intl Mins, Intl Calls, Intl Charge: the billed cost for international calls

CustServ Calls: the number of calls placed to Customer Service

Churn?: whether the customer left the service: true/false

## Let's look at correlations to see if any two predictive columns are too hightly correlated

In [5]:
cordf = orig.corr()
cordf[cordf > 0.6]

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
Account Length,1.0,,,,,,,,,,,,,,,
Area Code,,1.0,,,,,,,,,,,,,,
VMail Message,,,1.0,,,,,,,,,,,,,
Day Mins,,,,1.0,,0.667941,,,0.766489,,,,,,,
Day Calls,,,,,1.0,,,,,,,,,,,
Day Charge,,,,0.667941,,1.0,,,,,,,,,,
Eve Mins,,,,,,,1.0,,,,,,,,,
Eve Calls,,,,,,,,1.0,,,,,,,,
Eve Charge,,,,0.766489,,,,,1.0,,,,,,,
Night Mins,,,,,,,,,,1.0,,,,,,


In [6]:
#Let's see if any of the predictors are hightly correlated


A few of the correlations seem high, e.g. Eve. Charge and Day Mins, but I am not going to worry to much about it  

since I plan to use a tree based method.  

In the example I borrowed from they removed some variables.

I think I will remove ["Area Code", "Phone", and "State"] 

since those are all geographic and have a lot of disctint values

In [7]:
churn = orig.drop(labels=["Area Code", "State", "Phone"], axis=1)

## Categorical Variables

let's look at the columns that have categorical variables

Not a problem with some tree methods but they have to be converted for XGBoost

In [8]:
obj_cols = [idx for idx, val in churn.dtypes.iteritems() if val == 'object']
for oc in obj_cols:
    print(oc)
    display(churn[oc].value_counts(dropna=False))
    print("--")

Int'l Plan


no     2507
yes    2493
Name: Int'l Plan, dtype: int64

--
VMail Plan


yes    2512
no     2488
Name: VMail Plan, dtype: int64

--
Churn?


False.    2502
True.     2498
Name: Churn?, dtype: int64

--


### Convert to numeric

In [9]:
churn.replace({"Int'l Plan":{'no': 0, 'yes': 1}}, inplace=True)
churn.replace({"VMail Plan":{'no': 0, 'yes': 1}}, inplace=True)
churn.replace({"Churn?":{'False.': 0, 'True.': 1}}, inplace=True)

In [10]:
#check to make sure everythin is numeric
churn.dtypes

Account Length      int64
Int'l Plan          int64
VMail Plan          int64
VMail Message       int64
Day Mins          float64
Day Calls           int64
Day Charge        float64
Eve Mins          float64
Eve Calls           int64
Eve Charge        float64
Night Mins        float64
Night Calls         int64
Night Charge      float64
Intl Mins         float64
Intl Calls          int64
Intl Charge       float64
CustServ Calls      int64
Churn?              int64
dtype: object

### Split the dataset into train, validation and test

In [11]:
def df_split(df, pct=0.1):
    idx = df.index
    b_size = int(np.floor(len(idx) * 0.1))
    b_idx = np.random.choice(idx, b_size, replace=False)
    b = churn.loc[b_idx]
    a = churn.drop(b_idx, axis=0)    
    return a, b
train_val,  test = df_split(churn, pct=0.1)
train, validation = df_split(train_val, pct=0.1)
print(churn.shape[0], train.shape[0], validation.shape[0], test.shape[0])

train.to_csv("train.csv", header=False, index=False)
validation.to_csv("validation.csv", header=False, index=False)
test.to_csv("test.csv", header=False, index=False)

5000 4550 450 500


# Sagemaker specific stuff

In [12]:
for dset in ["train", 'validation']:
    boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, f"{dset}/{dset}.csv")).upload_file(f"{dset}.csv")

In [13]:
container = sagemaker.image_uris.retrieve("xgboost", sess.boto_region_name, "1.5-1")
display(container)

'683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.5-1'

In [14]:
s3_input_train = TrainingInput(
    s3_data="s3://{}/{}/train".format(bucket, prefix), content_type="csv"
)

s3_input_validation = TrainingInput(
    s3_data="s3://{}/{}/validation/".format(bucket, prefix), content_type="csv")

In [15]:
s3_input_validation

<sagemaker.inputs.TrainingInput at 0x7fbbd081a640>

In [16]:
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(
    container,
    role,
    instance_count=1,
    instance_type= "ml.m4.xlarge",
    output_path="s3://{}/{}/output".format(bucket, prefix),
    sagemaker_session=sess,
)
xgb.set_hyperparameters(
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    verbosity=0,
    objective="binary:logistic",
    num_round=100,
)

## Fitting
Now we should be ready to fit the model.
At first I had an issus that I had no resources to allocate.
A few google searches let me to open an AWS ticket to get that resolved.
And it was fairly quickly.

But there still are some issues as can be seen from the error messages below

In [17]:
xgb.fit({"train": s3_input_train, "validation": s3_input_validation})

2022-11-13 15:37:44 Starting - Starting the training job...ProfilerReport-1668353864: InProgress
...
2022-11-13 15:38:28 Starting - Preparing the instances for training............
2022-11-13 15:40:32 Downloading - Downloading input data...
2022-11-13 15:41:10 Training - Training image download completed. Training in progress...[34m[2022-11-13 15:41:26.569 ip-10-2-155-235.ec2.internal:7 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34m[2022-11-13:15:41:26:INFO] Imported framework sagemaker_xgboost_container.training[0m
[34m[2022-11-13:15:41:26:INFO] Failed to parse hyperparameter objective value binary:logistic to Json.[0m
[34mReturning the value itself[0m
[34m[2022-11-13:15:41:26:INFO] No GPUs detected (normal if no gpus installed)[0m
[34m[2022-11-13:15:41:26:INFO] Running XGBoost Sagemaker in algorithm mode[0m
[34m[2022-11-13:15:41:26:INFO] Determined delimiter of CSV input is ','[0m
[34m[2022-11-13:15:41:26:INFO] Determined delimiter of CSV input is ','[0

UnexpectedStatusException: Error for Training job sagemaker-xgboost-2022-11-13-15-37-44-854: Failed. Reason: AlgorithmError: framework error: 
Traceback (most recent call last):
  File "/miniconda3/lib/python3.8/site-packages/sagemaker_xgboost_container/algorithm_mode/train.py", line 278, in train_job
    bst = xgb.train(
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 188, in train
    bst = _train_internal(params, dtrain,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/training.py", line 81, in _train_internal
    bst.update(dtrain, i, obj)
  File "/miniconda3/lib/python3.8/site-packages/xgboost/core.py", line 1680, in update
    _check_call(_LIB.XGBoosterUpdateOneIter(self.handle,
  File "/miniconda3/lib/python3.8/site-packages/xgboost/core.py", line 218, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [15:41:26] ../src/objective/regression_obj.cu:120: label must be in [0,1] for logistic regression
Stack trace:
  [bt] (0) /miniconda3/lib/python3.8/site-packages/xgboost/lib/libxgboost.so(+0x455ac9) [0x7f86d6229ac9]
  [bt] (1) /mini

I fixed the reource error but was not sure how to fix this one.

## Plan B
Let's try using RandomForest from Sklearn.

In [18]:
# define a function to create accuracy 
# for each of the datasets given an input model
from sklearn.metrics import accuracy_score
# from sklearn.metrics import log_loss # note: log_loss is not defined inthe verion on sklearn
def make_results(dsets, mod):
    dfs = []
    for name in dsets.keys():
        X, y  = dsets[name]
        rd = {}
        ypred = mod.predict(X.values)
        acc = accuracy_score(ypred, y.to_numpy().ravel())
        proba = mod.predict_proba(X.values)
        score = mod.score(X.values, y)
        rd["acc"] = acc
        rdf = pd.DataFrame(rd, index=[0])
        rdf["ds"] = name
        dfs.append(rdf)
    res_df = pd.concat(dfs)    
    return res_df

Create the 3 X and y data sets

In [19]:
ycol = "Churn?"
xcols = [c for c in churn if c != ycol]
Xtrain = train[xcols]
ytrain = train[ycol]
Xval = validation[xcols]
yval = validation[ycol]
Xtest = test[xcols]
ytest = test[ycol]

define a dictionary containing the the train, val and test sets, bot X and y for each
 and check the size of each  
I will use this in calling the function make_results above
to get metrics on each data set

In [20]:

ds_dict = {"train" : (Xtrain, ytrain),
            "val" : (Xval, yval),
            "test": (Xtest, ytest)}
for ds in ds_dict.keys():
    X, y = ds_dict[ds]
    print(f"""{ds} X {X.shape} y {y.shape}""")

train X (4550, 17) y (4550,)
val X (450, 17) y (450,)
test X (500, 17) y (500,)


In [21]:
# set the parameters to search over.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import  GridSearchCV 

param_grid = {
    "n_estimators": [50, 100, 250,  ],
    "max_depth": [ 20, 30],
     "min_samples_leaf": [ 3, 5],
    "criterion" : ["entropy"], # "log_loss"]   
}

Run the sklearn GridSearch algorithm to find the best parameters.
snd save the best fitting set of parameters

In [22]:
start = datetime.datetime.now()
grid_rf = GridSearchCV(
    RandomForestClassifier(), param_grid=param_grid,
    scoring="accuracy",
    refit=True,
    cv=5
    ,verbose=True
    , error_score = "raise"
)

rf_fit = grid_rf.fit(Xtrain.values, ytrain.to_numpy().ravel())
end = datetime.datetime.now()
print(f"search time {(end-start).total_seconds()}")

Fitting 5 folds for each of 12 candidates, totalling 60 fits
search time 84.982004


In [23]:
best_rf = rf_fit.best_estimator_
rf_res = make_results(ds_dict, mod=best_rf)
print(rf_fit.best_params_)
print ("\nAccuracy on the 3 data sets\n")
rf_res

{'criterion': 'entropy', 'max_depth': 30, 'min_samples_leaf': 3, 'n_estimators': 250}

Accuracy on the 3 data sets



Unnamed: 0,acc,ds
0,0.992747,train
0,0.926667,val
0,0.994,test


Note that I used "X.values" instead of just X in the fitting and eval functions.
If not, it generated errors in the fittting.
 

In [24]:
# So ... what were the best parameters
best_rf

RandomForestClassifier(criterion='entropy', max_depth=30, min_samples_leaf=3,
                       n_estimators=250)

In [25]:
print ("\n Cross tab on the test data set\n")
predictions = best_rf.predict(Xtest.values)
pd.crosstab(
    index=ytest,
    columns=np.round(predictions),
    rownames=["actual"],
    colnames=["predictions"],
)



 Cross tab on the test data set



predictions,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,227,1
1,2,270


Above is the crosstab from my run.
We slightly outperform the results from the example, but on only by a little.

The boost in performance could be due to the grid search over the params
but I suspect, or it could be just random.

# Summary

I was mainly interested incomparing this AWS notebook environment to Google's colab.

The AWS environment was fairly easy to get up and running and seemed to have many
packages installed ... although the sklearn was a bit old. It was also disappointing that the example from the AWS sight did not work as is.

Also, the Github integration seemed much more involved with AWS than it does with Colab.

I have read that you get more powerful machines with the free tier on AWS than you do with Colab, but I am not sure of that.

In summary, right now I would prefer to use Colab unless I needed some specific functionality from AWS or Sagemaker.
