# 3a - Training Amazon's XGBoost

## Introduction
In this notebook, we will train Amazon's XGBoost implementation, evaluate the training performance, and output model artifacts.

## Setup
First, we import Sageamker SDK dependencies as well as modules used in application below.
We also get relevant sessions and read in local environment data.

In [71]:
import json
import uuid
import boto3
import random
import tarfile
import pickle as pkl
import datetime as dt
import sagemaker as sm
import sagemaker.xgboost as xgb

In [72]:
sm_session = sm.Session()
role = sm.get_execution_role()
boto3_session = boto3.session.Session()
now = dt.datetime.now().strftime(r"%Y%m%dT%H%M%S")

In [33]:
# Get boto3 session attributes.
account = boto3_session.client("sts").get_caller_identity()["Account"]
region = boto3_session.region_name

# Create S3 resource and retrieve data bucket name.
s3_resource = boto3_session.resource("s3")
with open("/home/ec2-user/.aiml-bb/stack-data.json", "r") as f:
    data = json.load(f)
    data_bucket = data["data_bucket"]
    model_bucket = data["model_bucket"]

## Define resources for estimator

In [34]:
# Get XGBoost container image for current region.
xgb_container_image = sm.image_uris.retrieve("xgboost", region, "latest")

# Create a unique training job name.
training_job_name = f"xgboost-{str(uuid.uuid4())[:8]}"

In [35]:
train_input = sm.inputs.TrainingInput(
    s3_data=f"s3://{model_bucket}/preprocessing_output/train/", 
    content_type="csv"
)
validation_input = sm.inputs.TrainingInput(
    s3_data=f"s3://{model_bucket}/preprocessing_output/validation/",
    content_type="csv"
)

## Create and fit estimator
Here we create the `Estimator` object, and define default hyperparameters as well as ranges.
We then attach a `HyperparameterTuner` and fit the tuner.

In [47]:
# Create estimator running the XGBoost container.
xgb_estimator = sm.estimator.Estimator(
    xgb_container_image,
    role, 
    instance_count=1, 
    instance_type="ml.m5.12xlarge",
    volume_size=50,
    output_path=f"s3://{model_bucket}/sagemaker-xgboost/"
)

In [48]:
# Define starting hyperparameters for the model.
xgb_estimator.set_hyperparameters(
    eval_metric="auc",
    objective="binary:logistic",
    max_depth=5,
    eta=0.2,
    gamma=4,
    min_child_weight=6,
    subsample=0.8,
    silent=0,
    num_round=100
)
# Set ranges of XGBoost hyperparameters for tuning.
xgb_hyperparameter_ranges = {
    "eta": sm.tuner.ContinuousParameter(0, 1),
    "alpha": sm.tuner.ContinuousParameter(0, 2),
    "min_child_weight": sm.tuner.ContinuousParameter(1, 10),
    "max_depth": sm.tuner.IntegerParameter(1, 10)
}

In [49]:
# Create tuner and fit.
xgb_objective_metric_name = "validation:auc"
xgb_tuner = sm.tuner.HyperparameterTuner(
    xgb_estimator,
    xgb_objective_metric_name,
    xgb_hyperparameter_ranges,
    max_jobs=10,
    max_parallel_jobs=5,
    strategy="Bayesian"
)
xgb_tuner.fit(
    {"train": train_input, "validation": validation_input}
)

........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

## Select best model
View the analytics from the tuning job and select the best model.
We then store the best model in location that we can reference more easily in other notebooks.

In [54]:
# View analytics on tuning job results.
xgb_tuner_analytics = sm.HyperparameterTuningJobAnalytics(
    xgb_tuner.describe()["HyperParameterTuningJobName"]
)
xgb_tuner_analytics.dataframe().sort_values("FinalObjectiveValue")

Unnamed: 0,alpha,eta,max_depth,min_child_weight,TrainingJobName,TrainingJobStatus,FinalObjectiveValue,TrainingStartTime,TrainingEndTime,TrainingElapsedTimeSeconds
4,1.976319,0.077564,3.0,8.556083,xgboost-220126-2026-006-cfbfc4df,Completed,0.656425,2022-01-26 20:54:29+00:00,2022-01-26 21:21:13+00:00,1604.0
0,0.042032,0.05376,4.0,3.718979,xgboost-220126-2026-010-a8a6520e,Completed,0.656621,2022-01-26 21:24:19+00:00,2022-01-26 21:55:56+00:00,1897.0
9,1.128438,0.526651,1.0,2.209369,xgboost-220126-2026-001-dc8ccd8f,Completed,0.658205,2022-01-26 20:29:18+00:00,2022-01-26 20:50:46+00:00,1288.0
6,1.053283,0.175031,3.0,1.363933,xgboost-220126-2026-004-d3fb01e7,Completed,0.664995,2022-01-26 20:29:16+00:00,2022-01-26 20:56:07+00:00,1611.0
5,0.161095,0.407553,2.0,9.182872,xgboost-220126-2026-005-7a1b368f,Completed,0.665094,2022-01-26 20:29:39+00:00,2022-01-26 20:53:42+00:00,1443.0
3,0.350343,0.783599,2.0,9.896415,xgboost-220126-2026-007-54fa9be2,Completed,0.666874,2022-01-26 20:56:31+00:00,2022-01-26 21:20:31+00:00,1440.0
8,0.541318,0.217562,4.0,2.557321,xgboost-220126-2026-002-c9e0a844,Completed,0.671957,2022-01-26 20:29:30+00:00,2022-01-26 21:00:09+00:00,1839.0
1,1.083935,0.437011,4.0,3.119949,xgboost-220126-2026-009-63ddd00a,Completed,0.675611,2022-01-26 21:02:53+00:00,2022-01-26 21:33:04+00:00,1811.0
2,0.868387,0.26184,9.0,1.005081,xgboost-220126-2026-008-180ba864,Completed,0.691933,2022-01-26 20:59:02+00:00,2022-01-26 21:50:34+00:00,3092.0
7,1.865069,0.474299,10.0,8.813583,xgboost-220126-2026-003-1fc7f8bd,Completed,0.70186,2022-01-26 20:29:45+00:00,2022-01-26 21:26:31+00:00,3406.0


In [63]:
# Take best model and copy to location we can reference in other notebooks.
xgb_best_training_job_name = xgb_tuner.best_training_job()
xgb_best_training_job_key = f"sagemaker-xgboost/{xgb_best_training_job_name}/output/model.tar.gz"
copy_source = {
    "Bucket": model_bucket,
    "Key": xgb_best_training_job_key
}
s3_resource.Bucket(model_bucket).copy(
    copy_source, 
    "sagemaker-xgboost-tuned/model.tar.gz"
)

## Host and test model
To test the model, we must now create an endpoint that we can send the test data set aside during preprocessing.

In [79]:
xgb_model = sm.model.Model(
    image_uri=xgb_container_image,
    model_data=f"s3://{model_bucket}/sagemaker-xgboost-tuned/model.tar.gz",
    role=role
)
endpoint_name = f"xgb-test-endpt-{now}"
xgb_predictor = xgb_model.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.xlarge",
    endpoint_name=endpoint_name
)

-----!

In [None]:
# Iterate over testing data and compute statistics.
for obj in s3_resource.Bucket(model_bucket)

In [74]:
sm_session.delete_endpoint(endpoint_name)