## SageMaker XGBoost Algorithm

We are going to use the XGBoost algorithm. Documentation can be found here:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.

For a list of hyperparameters, have a look at:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html

A representation of gradient boosted trees algorithm can be found [here](https://www.researchgate.net/figure/A-simple-example-of-visualizing-gradient-boosting_fig5_326379229)


In [None]:
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
import numpy as np

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

# get the URI for the XGBoost container
container_image = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

# build a SageMaker estimator class
xgb_estimator = sagemaker.estimator.Estimator(
    container_image,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://{}/titanic/training'.format(bucket),
    sagemaker_session=sagemaker_session
)

# set the hyperparameters
xgb_estimator.set_hyperparameters(
    max_depth=6,
    eta=0.1,
    gamma=0,
    min_child_weight=6,
    subsample=0.7,
    verbosity=1,
    objective='multi:softmax',
    num_class=2,
    num_round=5
)

#### Uploading the Training Dataset

In [None]:
# Upload the dataset to our S3 bucket
input_train = sagemaker_session.upload_data(path='train.csv', key_prefix='titanic')
input_val = sagemaker_session.upload_data(path='val.csv', key_prefix='titanic')

### Start Training

In [None]:
# Now run training against the training and val sets created above
# Refer to the SageMaker training console

content_type = "csv"
train_input = TrainingInput(input_train, content_type=content_type)
validation_input = TrainingInput(input_val, content_type=content_type)

xgb_estimator.fit({
    'train': train_input,
    'validation': validation_input
})

### Save with Spot Instances

Let's try to use 'Spot' capacity to train our model. We can also use different hyperparameters to see if we can improve our model. Let's also use logistic regression

In [None]:
# build a SageMaker estimator class
xgb_estimator = sagemaker.estimator.Estimator(
    container_image,
    role,
    use_spot_instances=True,
    max_run=1200,
    max_wait=1800,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://{}/titanic/training'.format(bucket),
    sagemaker_session=sagemaker_session
)

# set the hyperparameters
xgb_estimator.set_hyperparameters(
    max_depth=6,
    eta=0.2,
    gamma=2,
    min_child_weight=2,
    subsample=0.8,
    verbosity=1,
    objective='binary:logistic',
    num_round=15
)

xgb_estimator.fit({
    'train': train_input,
    'validation': validation_input
})

### Train with Script Mode

Let's use the same estimator above, but provide our own script `./src/train.py`


In [None]:
from sagemaker.xgboost.estimator import XGBoost

# build a SageMaker estimator Framework class
xgb_estimator = XGBoost(
    role=role,
    framework_version='1.0-1',
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://{}/titanic/training'.format(bucket),
    entry_point="./src/train.py", ## OUR SCRIPT
    sagemaker_session=sagemaker_session,
    hyperparameters={
        "num_class": 2,
        "silent": 0,
        "objective": 'multi:softmax',
        "num_round": 10 
    })

xgb_estimator.fit({
    'train': train_input,
    'validation': validation_input
})

You can extract the trained model artefact locally. This could be eventually loaded back into an XGBoost framework Python object and used for re-training or for prediction.

In [None]:
!aws s3 cp {xgb_estimator.output_path}/{xgb_estimator.latest_training_job.job_name}/output/model.tar.gz .
!tar -xzvf model.tar.gz

### HyperParameter Tuning

We set the objective metric to be validation:merror, which is according to the [XGBoost documentation](https://xgboost.readthedocs.io/en/stable/parameter.html) measured by:

`merror: Multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases)`

In [None]:
from time import gmtime, strftime

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

objective_metric_name = "validation:merror"

hyperparameter_ranges = {
    "alpha": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
    "lambda": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
    "eta": ContinuousParameter(0, 1, scaling_type="Linear"),
    "gamma": ContinuousParameter(0, 10, scaling_type="Linear")
}

tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=9,
    max_parallel_jobs=3,
    strategy="Bayesian",
    objective_type='Minimize'
)

In [None]:
tuner.fit({
    'train': train_input,
    'validation': validation_input
    },
    job_name="xgb-randsearch-" + strftime("%Y%m%d-%H-%M-%S", gmtime()),
)

In [None]:
sagemaker.HyperparameterTuningJobAnalytics(
    tuner.latest_tuning_job.job_name
).dataframe()

In [None]:
tuner.best_training_job()

In [None]:
# Use this in the next notebook
tuner.latest_tuning_job.job_name

You can now move to [Lab3](./3-Deploy.ipynb)