## Train, Tune and Deploy XGBoost (prebuilt containers)

Here we will <b>train, tune, and deploy a model on Amazon SageMaker using the popular XGBoost ML algorithm with SageMaker's prebuilt containers</b>. SageMaker manages all of the underlying infrastructure to train your model at petabyte scale, and deploy it to production.

In this tutorial, you will assume the role of a machine learning developer working at a bank. You have been asked to develop a machine learning model to predict whether a customer will enroll for a certificate of deposit (CD). The model will be trained on the marketing dataset that contains information on customer demographics, responses to marketing events, and external factors.

The data has been labeled for your convenience and a column in the dataset identifies whether the customer is enrolled for a product offered by the bank. A version of this dataset is publicly available  from the ML repository curated by the University of California, Irvine.

### 1. Train and tune the model

In this step, we will train a machine learning model with the training dataset which was uploaded in an Amazon S3 bucket in the previous Lab (SageMaker Processing). 


In [17]:
# cell 01: Setup the environment
import sagemaker
bucket = sagemaker.Session().default_bucket()  # (re)define default bucket
print('Default bucket:', bucket)
prefix = 'sagemaker/DEMO-xgboost-dm'
 
# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

# (Re)define train - test - validation paths:
train_path = f"s3://{bucket}/{prefix}/train"
validation_path = f"s3://{bucket}/{prefix}/validation"
test_path = f"s3://{bucket}/{prefix}/test"

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml
Default bucket: sagemaker-eu-central-1-365644463685
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


First, we will <b>specify XGBoost ECR container location</b> for SageMaker's implementation of XGBoost.¶

In [18]:
# cell 10: Specify location of XGBoost ECR container
container = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

INFO:sagemaker.image_uris:Ignoring unnecessary instance type: None.


Then, we specify <b>train and validation data set location</b>. Because we're training with the CSV file format, we'll create `s3_input` so that our training function can use as a pointer to the files in S3, which also specify that the content type is CSV.

In [19]:
# cell 11: Specify train and validation set location
s3_input_train = sagemaker.inputs.TrainingInput(s3_data=train_path.format(bucket, prefix), content_type='csv')
s3_input_validation = sagemaker.inputs.TrainingInput(s3_data=validation_path.format(bucket, prefix), content_type='csv')

Now, <b>define the training parameters in SageMaker Estimator</b>. This includes:
1. The `xgboost` algorithm container
1. The IAM role to use
1. Training instance type and count
1. S3 location for output data
1. Algorithm hyperparameters

And then a `.fit()` function which specifies:
1. <b>S3 location for output data</b>.  In this case we have both a training and validation set which are passed in.

You will also define the tuning hyperparameters in set_hyperparameters and then call the “fit” method to train the model.

In [24]:
# cell 12: define the training parameters in SageMaker Estimator object
sess = sagemaker.Session()

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.xlarge',  #'ml.c5.2xlarge',  #default: 'ml.m4.xlarge' 
                                    output_path='s3://{}/{}/output'.format(bucket, prefix),
                                    #keep_alive_period_in_seconds=60*60,
                                    sagemaker_session=sess)
xgb.set_hyperparameters(max_depth=5,
                        eta=0.2,
                        gamma=4,
                        min_child_weight=6,
                        subsample=0.8,
                        silent=0,
                        objective='binary:logistic',
                        num_round=100)

xgb.fit({'train': s3_input_train, 'validation': s3_input_validation}) 

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


INFO:sagemaker:Creating training-job with name: xgboost-2023-11-16-14-32-10-312


2023-11-16 14:32:10 Starting - Starting the training job...
2023-11-16 14:32:26 Starting - Preparing the instances for training......
2023-11-16 14:33:36 Downloading - Downloading input data...
2023-11-16 14:34:06 Training - Downloading the training image......
2023-11-16 14:35:07 Uploading - Uploading generated training model[34mArguments: train[0m
[34m[2023-11-16:14:34:57:INFO] Running standalone xgboost training.[0m
[34m[2023-11-16:14:34:57:INFO] File size need to be processed in the node: 4.35mb. Available memory size in the node: 8561.76mb[0m
[34m[2023-11-16:14:34:57:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:34:57] S3DistributionType set as FullyReplicated[0m
[34m[14:34:57] 28831x59 matrix with 1701029 entries loaded from /opt/ml/input/data/train?format=csv&label_column=0&delimiter=,[0m
[34m[2023-11-16:14:34:57:INFO] Determined delimiter of CSV input is ','[0m
[34m[14:34:57] S3DistributionType set as FullyReplicated[0m
[34m[14:34:57] 8238x59 matrix


2023-11-16 14:35:23 Completed - Training job completed
Training seconds: 107
Billable seconds: 107


You can also <b>check from AWS console to verify the training job started</b> and wait until the status becomes “Completed” (by going to SageMaker -> Training jobs).


### 2. Deploy the model

In this step, we deploy the trained model to a real-time HTTPS endpoint. This process can take around 6-8 min. You can also choose our newer instance type such as “ml.m5.xlarge” for this deployment.



In [5]:
# cell 13: Deploy trained model to real-time endpoint (with pre-built container)
xgb_predictor = xgb.deploy(initial_instance_count=1,
                           instance_type='ml.c5.2xlarge')

INFO:sagemaker:Creating model with name: xgboost-2023-11-16-13-51-16-536
INFO:sagemaker:Creating endpoint-config with name xgboost-2023-11-16-13-51-16-536
INFO:sagemaker:Creating endpoint with name xgboost-2023-11-16-13-51-16-536


---!

Now we will <b>check the SageMaker endpoint deployment in the AWS console</b> view as well. In the SageMaker AWS Console, go to Inference -> "Endpoints" on the left pane. You will see the endpoint in “Creating” state, which will soon transition to “InService” state.



### 3. Predict and Evaluate model performance

In this step you will <b>reformat the CSV data, then run the model to create predictions</b>. There are many ways to compare the performance of a machine learning model, but let's start by simply comparing actual to predicted values (test set accuracy). In this case, we're simply predicting whether the customer subscribed to a term deposit (1) or not (0), which produces a simple confusion matrix that will help us evaluate model performance.

First we'll need to determine how we pass data into and receive data from our endpoint. Our data is currently stored as NumPy arrays in memory of our notebook instance. To send it in an HTTP POST request, we'll serialize it as a CSV string and then decode the resulting CSV.
Note: For inference with CSV format, SageMaker XGBoost requires that the data does NOT include the target variable.

In [6]:
# cell 14: Create csv serializer object
xgb_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [7]:
# # cell 15:
!aws s3 cp $test_path/test_x.csv /tmp/test_x.csv
!aws s3 cp $test_path/test_y.csv /tmp/test_y.csv

download: s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/test/test_x.csv to ../../../../tmp/test_x.csv
download: s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/test/test_y.csv to ../../../../tmp/test_y.csv


Now, we'll use a simple function to:
1. Loop over our test dataset
1. Split it into mini-batches of rows 
1. Convert those mini-batches to CSV string payloads (notice, we drop the target variable from our dataset first)
1. Retrieve mini-batch predictions by invoking the XGBoost endpoint
1. Collect predictions and convert from the CSV output our model provides into a NumPy array

In [8]:
# cell 16: Produce test set predictions
import numpy as np
import pandas as pd

def predict(data, predictor, rows=500):
    split_array = np.array_split(data, int(data.shape[0] / float(rows) + 1))
    predictions = ''
    for array in split_array:
        predictions = ','.join([predictions, predictor.predict(array).decode('utf-8')])

    return np.fromstring(predictions[1:], sep=',')

test_x = pd.read_csv('/tmp/test_x.csv', names=[f'{i}' for i in range(59)])
test_y = pd.read_csv('/tmp/test_y.csv', names=['y'])
predictions = predict(test_x.drop(test_x.columns[0], axis=1).to_numpy(), xgb_predictor)

Now we'll check our confusion matrix to see how well we predicted versus actuals.

In [9]:
# cell 17: Produce confusion matrix
pd.crosstab(index=test_y['y'].values, columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

predictions,0.0,1.0
actuals,Unnamed: 1_level_1,Unnamed: 2_level_1
0,3632,3
1,479,5


So, of the ~4000 potential customers, we predicted 136 would subscribe and 94 of them actually did.  We also had 389 subscribers who subscribed that we did not predict would.  This is less than desirable, but the model can (and should) be tuned to improve this.  Most importantly, note that with minimal effort, our model produced accuracies similar to those published [here](http://media.salford-systems.com/video/tutorial/2015/targeted_marketing.pdf).

_Note that because there is some element of randomness in the algorithm's subsample, your results may differ slightly from the text written above._

### 3. Automatic model-tuning

Amazon SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

For example, suppose that you want to solve a binary classification  problem on this marketing dataset. Your goal is to maximize the area under the curve (auc)  metric of the algorithm by training an XGBoost Algorithm model. You don't know which values of the eta, alpha, min_child_weight, and max_depth hyperparameters to use to train the best model. To find the best values for these hyperparameters, you can specify ranges of values that Amazon SageMaker hyperparameter tuning searches to find the combination of values that results in the training job that performs the best as measured by the objective metric that you chose. Hyperparameter tuning launches training jobs that use hyperparameter values in the ranges that you specified, and returns the training job with highest auc.

<b>We will tune four hyperparameters in this example:<b>

eta: Step size shrinkage used in updates to prevent overfitting. After each boosting step, you can directly get the weights of new features. The eta parameter actually shrinks the feature weights to make the boosting process more conservative.

min_child_weight: Minimum sum of instance weight (hessian) needed in a child. If the tree partition step results in a leaf node with the sum of instance weight less than min_child_weight, the building process gives up further partitioning. In linear regression models, this simply corresponds to a minimum number of instances needed in each node. The larger the algorithm, the more conservative it is.

alpha: L1 regularization term on weights. Increasing this value makes models more conservative.

max_depth: Maximum depth of a tree. Increasing this value makes the model more complex and likely to be overfitted.

In [10]:
# cell 18: Set search ranges for hyperparameters
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner

hyperparameter_ranges = {'eta': ContinuousParameter(0, 1),
                            'min_child_weight': ContinuousParameter(1, 10),
                            'alpha': ContinuousParameter(0, 2),
                            'max_depth': IntegerParameter(1, 10)}


Next we'll specify the objective metric that we'd like to tune and its definition, which includes the regular expression (Regex) needed to extract that metric from the CloudWatch logs of the training job. Since we are using built-in XGBoost algorithm here, it emits two predefined metrics: validation:auc and train:auc, and we elected to monitor validation:auc as you can see below. In this case, we only need to specify the metric name and do not need to provide regex. If you bring your own algorithm, your algorithm emits metrics by itself. In that case, you'll need to add a MetricDefinition object here to define the format of those metrics through regex, so that SageMaker knows how to extract those metrics from your CloudWatch logs.


In [11]:
# cell 19: Specify the objective metric (evaluation criterion) to be optimized
objective_metric_name = 'validation:auc'

Now, we'll create a HyperparameterTuner object, to which we pass:
- The XGBoost estimator we created above
- Our hyperparameter ranges
- Objective metric name and definition
- Tuning resource configurations such as Number of training jobs to run in total and how many training jobs can be run in parallel.

In [12]:
# cell 20: Create a HyperparameterTuner object using all above-defined parameters
tuner = HyperparameterTuner(xgb,
                            objective_metric_name,
                            hyperparameter_ranges,
                            max_jobs=20,
                            max_parallel_jobs=3)


Now we can launch a hyperparameter tuning job by calling fit() function. After the hyperparameter tuning job is created, we can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed.


In [13]:
# cell 21: Launch a hyperparameter tuning job by calling fit() function [about 30' - check console]
tuner.fit({'train': s3_input_train, 'validation': s3_input_validation})

INFO:sagemaker:Creating hyperparameter tuning job with name: xgboost-231116-1353


.......................................................................................................................................................................................!


Let's just run a quick check of the hyperparameter tuning jobs status by using below command. Output should be “InProgress” . That means job started successfully.


In [14]:
# cell 22: check status of the hyperparameter tuning jobs
boto3.client('sagemaker').describe_hyper_parameter_tuning_job(
HyperParameterTuningJobName=tuner.latest_tuning_job.job_name)['HyperParameterTuningJobStatus']

'Completed'

We can go to SageMaker console to track the progress of the hyperparameter tuning job until it is completed. It will take around 30 mins to gets completed (Training -> Hyperparameter tuning jobs).

Once the tuning job is completed, you can pick the training job with the best performance, deploy, predict and evaluate the model developed by the job as done before.


In [15]:
# cell 23: Return the best training job name
tuner.best_training_job()

'xgboost-231116-1353-003-0661e71f'

In [16]:
# cell 24: Deploy the best trained or user specified model to an Amazon SageMaker endpoint
tuner_predictor = tuner.deploy(initial_instance_count=1,
                           instance_type='ml.m4.xlarge')


2023-11-16 14:03:15 Starting - Preparing the instances for training
2023-11-16 14:03:15 Downloading - Downloading input data
2023-11-16 14:03:15 Training - Training image download completed. Training in progress.
2023-11-16 14:03:15 Stopping - Stopping the training job
2023-11-16 14:03:15 Uploading - Uploading generated training model
2023-11-16 14:03:15 Stopped - Resource reused by training job: xgboost-231116-1353-009-23a73979

INFO:sagemaker:Creating model with name: xgboost-2023-11-16-14-09-10-027





ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Could not find model data at s3://sagemaker-eu-central-1-365644463685/sagemaker/DEMO-xgboost-dm/output/xgboost-231116-1353-003-0661e71f/output/model.tar.gz.

In [None]:
# cell 25: Create a serializer
tuner_predictor.serializer = sagemaker.serializers.CSVSerializer()

In [None]:
# cell 26 - Invoke inference endpoint with test set data for predictions
predictions = predict(test_x.to_numpy(), tuner_predictor)

In [None]:
# cell 27
# Collect predictions and convert from the CSV output our model provides into a NumPy array
pd.crosstab(index=test_y['y'].values, columns=np.round(predictions), rownames=['actuals'], colnames=['predictions'])

### Clean-up

If you are done with this notebook, please run the cell below.  This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
# cell 28
xgb_predictor.delete_endpoint(delete_endpoint_config=True)

In [None]:
# cell 29
tuner_predictor.delete_endpoint(delete_endpoint_config=True)