# XGBoost Cloud Training Template

This template uses _Amazon SageMaker's_ implementation of _XGBoost_ and its _Boto3_ SDK.

```Estimator``` and ```Model``` implementations for _MXNet, TensorFlow, Chainer, PyTorch, scikit-learn, Amazon SageMaker_ built-in algorithms, Reinforcement Learning, are included. There’s also an ```Estimator``` that runs _SageMaker_ compatible custom Docker containers, enabling you to run your own ML algorithms by using the _SageMaker Python SDK_.

**Note: To simply the process, ensure that S3 bucket is in the same region at SageMaker instance, e.g. us-east-2.**

## Import Packages

In [1]:
import numpy as np
import pandas as pd

import boto3
import re

import sagemaker
from sagemaker import get_execution_role
# SageMaker SDK documentation: https://sagemaker.readthedocs.io/en/latest/overview.html

# from sagemaker.amazon.amazon_estimator import get_image_uri

## Upload Data to S3 Bucket

In [2]:
# Specify the bucket here
bucket_name = 's3-2-ml-sagemaker'

training_folder = r'bikerental/training/'
validation_folder = r'bikerental/validation/'
test_folder = r'bikerental/test/'

s3_model_output_location = r's3://{0}/bikerental/model'.format(bucket_name)
s3_training_file_location = r's3://{0}/{1}'.format(bucket_name,training_folder)
s3_validation_file_location = r's3://{0}/{1}'.format(bucket_name,validation_folder)
s3_test_file_location = r's3://{0}/{1}'.format(bucket_name,test_folder)

In [3]:
print(s3_model_output_location)
print(s3_training_file_location)
print(s3_validation_file_location)
print(s3_test_file_location)

s3://s3-2-ml-sagemaker/bikerental/model
s3://s3-2-ml-sagemaker/bikerental/training/
s3://s3-2-ml-sagemaker/bikerental/validation/
s3://s3-2-ml-sagemaker/bikerental/test/


In [4]:
# files are referred as objects in S3.  
# file name is referred as key name in S3

# File stored in S3 is automatically replicated across 3 different availability zones 
# in the region where the bucket was created.

# http://boto3.readthedocs.io/en/latest/guide/s3.html
def write_to_s3(filename, bucket, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

In [5]:
write_to_s3('bikeTrainingv2.csv', 
            bucket_name,
            training_folder + 'bikeTrainingv2.csv')

write_to_s3('bikeValidationv2.csv',
            bucket_name,
            validation_folder + 'bikeValidationv2.csv')

write_to_s3('bikeTestv2.csv',
            bucket_name,
            test_folder + 'bikeTestv2.csv')

The files used are the ones optimized with _log1p(count)_ method (v2).

## Training Algorithm Docker Image

### SageMaker maintains a separate image for algorithm and region

_Common Parameters for Built-In Algorithms_
https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html

### Start a New Session with AWS

In [6]:
sess = sagemaker.Session()

In [7]:
role = get_execution_role()

In [8]:
# Role contains the permissions required to train, deploy models
# SageMaker Service is trusted to assume this role
print(role) # To show the ARN for the role

arn:aws:iam::399426528351:role/service-role/AmazonSageMaker-ExecutionRole-20200203T173955


In [9]:
# Sagemaker API now maintains the algorithm container mapping for us
# Specify the region, algorithm and version
# For this model, I am using specifically v0.90-2. Or put "latest"

region = "us-east-2" # configure region of the s3 bucket


container = sagemaker.amazon.amazon_estimator.get_image_uri(
    region,
    "xgboost", 
    "0.90-2")

print('Using SageMaker XGBoost container:\n{} ({})'.format(container, region))

Using SageMaker XGBoost container:
257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:0.90-2-cpu-py3 (us-east-2)


## Build Model

In [10]:
# Configure the training job
# Specify type and number of instances to use
# S3 location where final artifacts needs to be stored

#   Reference: http://sagemaker.readthedocs.io/en/latest/estimators.html

estimator = sagemaker.estimator.Estimator(
    container,
    role, 
    train_instance_count=1, 
    train_instance_type='ml.m4.xlarge',
    output_path=s3_model_output_location,
    sagemaker_session=sess,
    base_job_name ='xgboost-bikerental-v2')

In [11]:
# XGBoost Training Parameter Reference: 
#   https://github.com/dmlc/xgboost/blob/master/doc/parameter.md

# max_depth=5,eta=0.1,subsample=0.7,num_round=200
estimator.set_hyperparameters(max_depth=5,
                              objective="reg:squarederror",
                              eta=0.1,
                              num_round=200)

In [12]:
estimator.hyperparameters() # double-check of settings (in dictionary format)

{'max_depth': 5, 'objective': 'reg:squarederror', 'eta': 0.1, 'num_round': 200}

## Specify Training Data Location and Optionally Validation Data Location

Content type can csv or libsvm format for XGBoost

In [13]:
training_input_config = sagemaker.session.s3_input(
    s3_data=s3_training_file_location,
    content_type='csv',
    s3_data_type='S3Prefix')

validation_input_config = sagemaker.session.s3_input(
    s3_data=s3_validation_file_location,
    content_type='csv',
    s3_data_type='S3Prefix'
)

data_channels = {'train': training_input_config, 'validation': validation_input_config}

In [14]:
print(training_input_config.config)
print(validation_input_config.config)
# double-check the locations

{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://s3-2-ml-sagemaker/bikerental/training/', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'csv'}
{'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://s3-2-ml-sagemaker/bikerental/validation/', 'S3DataDistributionType': 'FullyReplicated'}}, 'ContentType': 'csv'}


## Train the Model

This will launch XGBoost's machine learning instance!

In [15]:
# XGBoost supports "train", "validation" channels
estimator.fit(data_channels)

2020-03-07 01:50:48 Starting - Starting the training job...
2020-03-07 01:50:49 Starting - Launching requested ML instances...
2020-03-07 01:51:47 Starting - Preparing the instances for training.........
2020-03-07 01:52:49 Downloading - Downloading input data...
2020-03-07 01:53:46 Training - Training image download completed. Training in progress.
2020-03-07 01:53:46 Uploading - Uploading generated training model.[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value reg:squarederror to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input 

## Deploy the Model

In [16]:
# Ref: http://sagemaker.readthedocs.io/en/latest/estimators.html
# predictor = estimator.deploy(initial_instance_count=1,
#                             instance_type='ml.m4.xlarge',
#                              endpoint_name = 'xgboost-bikerental-v2')

predictor = estimator.deploy(initial_instance_count=1,
                             instance_type='ml.m4.xlarge')

--------------!

## Run Predictions

In [17]:
from sagemaker.predictor import csv_serializer, json_deserializer

predictor.content_type = 'text/csv'
predictor.serializer = csv_serializer
predictor.deserializer = None

In [18]:
predictor.predict([[3,0,1,2,28.7,33.335,79,12.998,2011,7,7,3]])

b'3.7342963218688965'