### Module 15 - 1:  Using the XGBOOST on AWS SageMaker

In this module, we will use the XGBOOST algorithm using SageMaker's implementation.<P>

This will be very different than our past work using sklearn libraries and local processing on our SageMaker Studio IDE.<P>
    
This takes both machine learning and AWS knowledge.<P>
    
Quickly review pricing:
- https://aws.amazon.com/sagemaker/pricing/
       
**Data**: We will use a familiar dataset: Abalone

Recall:
- The predictors, X, are physical measurements of the abalone
- In the target column: adult = 1, youth = 0. We are trying to predict if the abalone is adult or youth.<P>
 
Very long process we will follow:<P>
1. Load and investigate the data
2. Prepare data for XGBOOST on SM
3. Splitting into 3 sets: Train, Test and Evaluate
4. Upload prepared data to S3
5. Setup Sagemaker training channels
6. Create Sagemaker XGBoost Model
7. Train the model (**Costs money**)
8. Start the model: Also called a inference instance, predictor or endpoint (**Costs money**)
9. Use the endpoint to evaluate performance
10. Predict on unseen data
11. Delete your endpoint so we don't have to pay for it anymore...


In [19]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import boto3
import pandas as pd
import numpy as np
import pickle
import time
from io import StringIO
import sagemaker
import pprint

### 1. Load and investigate the data
We prepared this data in an earlier module. It should be all ready to go for classification.<P>

In [20]:
# Setup boto3
sess = boto3.session.Session()
s3 = sess.client('s3') 
source_bucket = 'machinelearning-read-only'
source_key = 'data/abalone_clean.pkl' 
response = s3.get_object(Bucket = source_bucket, Key = source_key)
#
body = response['Body'].read()
#
# Create a new pandas DataFrame using the pickle.loads() function
abalone_df = pickle.loads(body)
abalone_df.head(3)

Unnamed: 0,sex,length,diameter,height,whole weight,shucked weight,viscera weight,shell weight,target
0,0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,1
1,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,0
2,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,0


In [21]:
# Verify data types are numeric and no missing values
abalone_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2835 entries, 0 to 2834
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   sex             2835 non-null   int64  
 1   length          2835 non-null   float64
 2   diameter        2835 non-null   float64
 3   height          2835 non-null   float64
 4   whole weight    2835 non-null   float64
 5   shucked weight  2835 non-null   float64
 6   viscera weight  2835 non-null   float64
 7   shell weight    2835 non-null   float64
 8   target          2835 non-null   int64  
dtypes: float64(7), int64(2)
memory usage: 199.5 KB


### 2. Prepare data for XGBOOST on SM

XGBoost on SageMaker requires the data to be in a single file. The file must have the target value be the first column. 

In [5]:
# Reorder the columns and put the  target into the first position.
df = abalone_df[['target','sex','length','diameter','height',
                 'whole weight','shucked weight','viscera weight','shell weight']]
df.head()

Unnamed: 0,target,sex,length,diameter,height,whole weight,shucked weight,viscera weight,shell weight
0,1,0,0.455,0.365,0.095,0.514,0.2245,0.101,0.15
1,0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07
2,0,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21
3,1,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155
4,1,1,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33


### 3. Splitting into 3 sets: Train, Test and Evaluate

We will start by splitting the dataset into two datasets. Then, we will split the 2nd dataset into 2 datasets.

- train: 80% of the data used to train the model
- validate: 10% of the data used to validate the training model
- test: 10% of the data held out for us to evaluate the model performance on unseen data

In [22]:
# Use train_test_split twice
train, test_and_validate = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])
test, validate = train_test_split(test_and_validate, test_size=0.5, 
                                  random_state=42, stratify=test_and_validate['target'])
print('Training:', train.shape) # Training data
print('Validatate:', validate.shape) # Validate during training
print('Test:', test.shape) # Saved for us to test with at the end

Training: (2268, 9)
Validatate: (284, 9)
Test: (283, 9)


In [23]:
# Quick check on evenly distributed target vavlues
print(train['target'].value_counts())
print(validate['target'].value_counts())
print(test['target'].value_counts())

1    1467
0     801
Name: target, dtype: int64
1    184
0    100
Name: target, dtype: int64
1    183
0    100
Name: target, dtype: int64


### 4. Upload prepared data to S3

In [24]:
# This function will help us upload the data to S3
def upload(key, bucket, dataframe):
    csv_buffer = StringIO()
    dataframe.to_csv(csv_buffer, header=False, index=False)
    csv_buffer.seek(0)
    response = s3.put_object(Bucket = bucket, Body = csv_buffer.getvalue(), Key = key)

In [25]:
# Upload to S3
# Define some variables to store the data
#
# This is a bucket in which you have full access
destination_bucket = 'machinelearning-shared'
#
# Keys: Modify for your use
train_file='data/kcolvin/sagemaker/abalone_train.csv'
validate_file='data/kcolvin/sagemaker/abalone_validate.csv'
test_file='data/kcolvin/sagemaker/abalone_test.csv'
#
# Use the function from above to upload to S3
upload(train_file,destination_bucket,train)
upload(validate_file,destination_bucket,validate)
upload(test_file,destination_bucket,test)

In [26]:
# Verify the files are in the correct location
response = s3.list_objects(Bucket = destination_bucket)
# Parse though the response
# Modify for your use
my_folder = 'kcolvin/sagemaker/abalone' # A filter to look for only files in my folder
for object in response['Contents']:
    if my_folder in object['Key']: # This is looking for a substring in the object['Key'] string
        print(object['Key'])

data/kcolvin/sagemaker/abalone_test.csv
data/kcolvin/sagemaker/abalone_train.csv
data/kcolvin/sagemaker/abalone_validate.csv


### 5. Setup Sagemaker training channels
This this is how the SM model obtains the data for training

In [27]:
# Setup channels
# Recall, we've setup these variables
print(destination_bucket)
print(train_file)
print(validate_file)

# Define train and validate channels
train_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}".format(destination_bucket,train_file),
    content_type='text/csv')
print("s3://{}/{}".format(destination_bucket,train_file))

validate_channel = sagemaker.inputs.TrainingInput(
    "s3://{}/{}".format(destination_bucket,validate_file),
    content_type='text/csv')
print("s3://{}/{}".format(destination_bucket,validate_file))
#
# Finally, setup a dictionary pointing to the channels (required for SM xgboost)
data_channels = {'train': train_channel, 'validation': validate_channel}
pprint.pprint(data_channels)

machinelearning-shared
data/kcolvin/sagemaker/abalone_train.csv
data/kcolvin/sagemaker/abalone_validate.csv
s3://machinelearning-shared/data/kcolvin/sagemaker/abalone_train.csv
s3://machinelearning-shared/data/kcolvin/sagemaker/abalone_validate.csv
{'train': <sagemaker.inputs.TrainingInput object at 0x7f0b367a2fd0>,
 'validation': <sagemaker.inputs.TrainingInput object at 0x7f0b367a2e50>}


### 6. Create Sagemaker XGBoost Model
Documentation:<BR>
- https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html
- https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-west-2.html#xgboost-us-west-2.title

In [28]:
# We will work in the 'us-west-2' region
reg = sess.region_name
reg

'us-west-2'

In [29]:
# This will get us a pointer to a 'container'
# Discuss containers:
#  https://www.docker.com/resources/what-container/
#  https://sagemaker-workshop.com/custom/containers.html
from sagemaker.image_uris import retrieve
#
#container = retrieve('xgboost',boto3.Session().region_name,'1.0-1') # Old version of model
container = retrieve('xgboost',reg,'1.5-1')
print(type(container))
container

<class 'str'>


'246618743249.dkr.ecr.us-west-2.amazonaws.com/sagemaker-xgboost:1.5-1'

In [30]:
# Set a location to ouput model artifacts (files)
my_prefix = 'data/kcolvin/sagemaker'

s3_output_location="s3://{}/{}/output/".format(destination_bucket,my_prefix)
print(s3_output_location)

s3://machinelearning-shared/data/kcolvin/sagemaker/output/


In [31]:
# Create a Sagemaker Estimator (this is creating the 'model')
# Setup job_name
my_job = 'kcolvin'
#
xgb_model=sagemaker.estimator.Estimator(container, # Reference to the container, defined above 
                                        sagemaker.get_execution_role(),
                                        instance_count=1,
                                        instance_type= 'ml.m5.large', # smallest avaialbe
                                        output_path=s3_output_location,
                                        sagemaker_session=sagemaker.session.Session(),
                                        base_job_name = my_job) 
xgb_model

<sagemaker.estimator.Estimator at 0x7f0b35df1cd0>

In [32]:
# Currently, all hyperparameters are set to the default value
#   https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html
#
xgb_model.hyperparameters()

{}

In [33]:
# We can set hyperparameters like this:
#
xgb_model.set_hyperparameters(
    num_round = 42,
    eval_metric = "error",
    objective = "binary:logistic")
# Show the non-default parameters
xgb_model.hyperparameters()

{'num_round': 42, 'eval_metric': 'error', 'objective': 'binary:logistic'}

### 7. Train the model
OK, here comes the recognized function: model.fit().<P>
    
This will launch the container and train the model using the train & validate data. <P>
    
It takes 4 or 5 minutes.

In [39]:
# Only run this cell once. 
#  It will start a new training container everytime you run it.
start = time.time()
print("Executing...")
xgb_model.fit(inputs=data_channels, logs=False)
end = time.time()
print(end - start, 'seconds')

Executing...

2022-09-02 18:19:20 Starting - Starting the training job...
2022-09-02 18:19:35 Starting - Preparing the instances for training.............
2022-09-02 18:20:49 Downloading - Downloading input data...........
2022-09-02 18:21:49 Training - Downloading the training image........
2022-09-02 18:22:35 Training - Training image download completed. Training in progress...
2022-09-02 18:22:50 Uploading - Uploading generated training model...
2022-09-02 18:23:07 Completed - Training job completed
231.5843222141266 seconds


In [40]:
# After complete, we can see the past training jobs. We paid a few minutes for each of these jobs.
# Create a Sagemaker client
sm = sess.client('sagemaker')
for job in sm.list_training_jobs()['TrainingJobSummaries']:
    print(job['TrainingJobName'])

kcolvin-2022-09-02-18-19-20-018
gmcgregor-2022-09-02-18-14-38-411
georgelund-2022-09-02-18-14-11-475
kcolvin-2022-09-02-18-13-48-345
joshtolle-2022-09-02-18-13-43-993
mmacaraeg-2022-09-02-18-13-21-998
lou-2022-09-02-18-09-25-441
kcolvin-2022-09-02-17-35-53-093
student1-2022-08-31-17-01-27-906
kcolvin--2022-08-31-15-10-53-664


In [41]:
# We can also see the file the trained model was stored
response = s3.list_objects(Bucket = destination_bucket)
# Modify for your use
my_folder = 'model.tar.gz' # A filter
for object in response['Contents']:
    if my_folder in object['Key']: # This is looking for a substring in the object['Key'] string
        print(object['Key'])

data/george.lund/sagemaker/output/georgelund-2022-09-02-18-14-11-475/output/model.tar.gz
data/gmcgregor/sagemaker/output/gmcgregor-2022-09-02-18-14-38-411/output/model.tar.gz
data/joshtolle/sagemaker/output/joshtolle-2022-09-02-18-13-43-993/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin--2022-08-30-20-15-31-507/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin--2022-08-30-21-11-52-206/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin--2022-08-30-22-11-25-279/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin--2022-08-31-15-10-53-664/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin-2022-09-02-17-35-53-093/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin-2022-09-02-18-13-48-345/output/model.tar.gz
data/kcolvin/sagemaker/output/kcolvin-2022-09-02-18-19-20-018/output/model.tar.gz
data/kcolvin/sagemaker/output/sagemaker-xgboost-2022-08-30-17-47-33-158/output/model.tar.gz
data/lou/sagemaker/output/lou-2022-09-02-18-09-25-441/output/model.ta

### 8. Start the model: Also called a inference instance, predictor or endpoint
This takes about 5 or 6 minutes to start.

In [42]:
# Start inference instance (or predictor)
#
# Name your endpoint. This must be unique
my_endpoint = 'kcolvin-endpoint10'
#
start = time.time()
print("Executing...")
xgb_predictor = xgb_model.deploy(initial_instance_count=1,
                                 serializer = sagemaker.serializers.CSVSerializer(), # How to input data
                                 #deserializer= sagemaker.deserializers.JSONDeserializer(), # return JSON
                                 deserializer= sagemaker.deserializers.CSVDeserializer(), # return CSV
                                 instance_type='ml.t2.medium', # smallest possible instance
                                 endpoint_name = my_endpoint # must be a unique name
                                )
end = time.time()
print('\n',end - start, 'seconds')
xgb_predictor

Executing...
--------!
 241.865980386734 seconds


<sagemaker.predictor.Predictor at 0x7f0b37c18e50>

### 9. Use the endpoint to evaluate performance
Warning: The output of the model is a little confusing. Some coding required....

In [43]:
# Use the validate data to see how the model perfomred during training
#
# To predict, we need to drop 'target' from the validate data
X_test = validate.drop('target', axis = 1) 
X_test_array = X_test.to_numpy() # Need convert the dataframe to a numpy array
# Just predict one row of data
p = xgb_predictor.predict(X_test_array[0])
# Check out the datatype: list of lists of strings!
p

[['0.972118616104126']]

In [44]:
# Sagemaker XGBOOST returns a the probability the value is True. 
# Let's just convert this to a 0 or 1
# First, strip the lists, then convert to a floating point
p_prob = float(p[0][0])
# Now convert it to a 1 or 0
if p_prob < .5:
    adult = 0
else:
    adult = 1
print('The first abalone in the validate set is adult:', adult)

The first abalone in the validate set is adult: 1


In [46]:
X_test_array.shape

(284, 8)

In [47]:
# Predict for the whole array
y_pred = xgb_predictor.predict(X_test_array)
# Convert the list of list of strings to a list of integers (0,1)
y_lst = []
for i in y_pred:
    if float(i[0]) < .5:  # Could modify this, if wanted
        y_lst.append(0)
    else:
        y_lst.append(1)
# Now we have a list of predicted 1 (adults) or 0 (youth)
y_pred = pd.Series(y_lst) # convert it back to a pandas series
y_pred.head()

0    1
1    0
2    0
3    0
4    0
dtype: int64

In [48]:
# Performance from training data
# These metrics should look familiar
y_test = validate['target']
print('Model accuracy score:', accuracy_score(y_test,y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))
# What now?

Model accuracy score: 0.7323943661971831
Confusion Matrix:
 [[ 57  43]
 [ 33 151]]


### 10. Predict on unseen data
Remember, we held out 10% of the data to see how the model does on non-training data.

In [50]:
# Now, do the same thing using the test dataset
X_test = test.drop('target', axis = 1) 
X_test_array = X_test.to_numpy() # Need convert the dataframe to a numpy array
y_pred = xgb_predictor.predict(X_test_array)
y_lst = []
for i in y_pred:
    if float(i[0]) < .5:  # Could modify this, if wanted
        y_lst.append(0)
    else:
        y_lst.append(1)
y_pred = pd.Series(y_lst) # convert it back to a pandas series
# Performance from test data
y_test = test['target']
print('Model accuracy score:', accuracy_score(y_test,y_pred))
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

Model accuracy score: 0.7597173144876325
Confusion Matrix:
 [[ 57  43]
 [ 25 158]]


### 11. Delete your endpoint so we don't have to pay for it anymore...

In [91]:
# When you are ready, uncomment the next line and delete your endpoint
#sm.delete_endpoint(EndpointName = my_endpoint)

{'ResponseMetadata': {'RequestId': '44c0415f-b4a5-46c4-819e-421d9f146966',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '44c0415f-b4a5-46c4-819e-421d9f146966',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '0',
   'date': 'Wed, 31 Aug 2022 16:31:16 GMT'},
  'RetryAttempts': 0}}