## SageMaker XGBoost Algorithm

We are going to use the XGBoost algorithm. Documentation can be found here:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost.html

For CSV training, the algorithm assumes that the target variable is in the first column and that the CSV does not have a header record.

For a list of hyperparameters, have a look at:

https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html


In [None]:
import sagemaker
import boto3
from sagemaker import image_uris
from sagemaker.inputs import TrainingInput
import numpy as np

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = sagemaker_session.default_bucket()

# get the URI for the XGBoost container
container_image = sagemaker.image_uris.retrieve(region=boto3.Session().region_name, framework='xgboost', version='latest')

# build a SageMaker estimator class
xgb_estimator = sagemaker.estimator.Estimator(
    container_image,
    role,
    instance_count=1,
    instance_type='ml.m5.large',
    output_path='s3://{}/iris/output'.format(bucket),
    sagemaker_session=sagemaker_session
)

# set the hyperparameters
xgb_estimator.set_hyperparameters(
    max_depth=6,
    eta=0.1,
    gamma=0,
    min_child_weight=6,
    subsample=0.7,
    verbosity=1,
    objective='multi:softmax',
    num_class=2,
    num_round=5
)

#### Uploading the Training Dataset

In [None]:
# Upload the dataset to our S3 bucket
input_train = sagemaker_session.upload_data(path='train.csv', key_prefix='titanic')
input_val = sagemaker_session.upload_data(path='val.csv', key_prefix='titanic')

#### Start Training

In [None]:
# Now run training against the training and val sets created above
# Refer to the SageMaker training console

content_type = "csv"
train_input = TrainingInput(input_train, content_type=content_type)
validation_input = TrainingInput(input_val, content_type=content_type)

xgb_estimator.fit({
    'train': train_input,
    'validation': validation_input
})

In [None]:
# Save this result to be used in the next notebook
xgb_estimator.latest_training_job.job_name

### HyperParameter Tuning

In [None]:
from time import gmtime, strftime

from sagemaker.tuner import (
    IntegerParameter,
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
)

objective_metric_name = "validation:merror"

hyperparameter_ranges = {
    "alpha": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
    "lambda": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
    "eta": ContinuousParameter(0, 1, scaling_type="Linear"),
    "gamma": ContinuousParameter(0, 10, scaling_type="Linear")
}

tuner = HyperparameterTuner(
    xgb_estimator,
    objective_metric_name,
    hyperparameter_ranges,
    max_jobs=9,
    max_parallel_jobs=3,
    strategy="Bayesian",
    objective_type='Minimize'
)

In [None]:
tuner.fit({
    'train': train_input,
    'validation': validation_input
    },
    job_name="xgb-randsearch-" + strftime("%Y%m%d-%H-%M-%S", gmtime()),
)

In [None]:
sagemaker.HyperparameterTuningJobAnalytics(
    tuner.latest_tuning_job.job_name
).dataframe()

In [None]:
tuner.best_training_job()

In [None]:
# Use this in the next notebook
tuner.latest_tuning_job.job_name

You can now move to [Lab3](./3-Deploy.ipynb)