## SageMaker XGBoost model applied to NYC Uber data

Our SageMaker XGBoost regression model predicts trip duration based on feature vector that includes source zone, destination zone, and month, day and hour for the pickup time. 

The first step in using SageMaker is to create a SageMaker execution role that contains permissions used by SageMaker. 

In [None]:
from sagemaker import get_execution_role
import boto3

# Create SageMaker role 
role = get_execution_role()

# get the url to the container image for using linear-learner
from sagemaker.amazon.amazon_estimator import get_image_uri
xgboost_image = get_image_uri(boto3.Session().region_name, 'xgboost')
print(xgboost_image)



### Convert data from CSV to protobuf recordIO Format

We have multiple CSV files available in S3 bucket. Each CSV file contains numeric columns for encoded origin zone, encoded destination zone, month, day, hour, trip distance in miles and trip duration in seconds. 

We will download the CSV files available in the source S3 bucket, split the csv files into training, validation and test data sets for SageMaker training and upload them to destination bucket to stage them for SageMaker training input.

In [None]:
import tempfile
import csv


# source bucket with CSV files
source_bucket='<source-s3-bucket>'
source_prefix='glue/output/uber_nyc'

# destination bucket to upload SageMaker training input data files
dest_bucket='<destination-s3-bucket>'
dest_prefix = 'sagemaker/input/uber_nyc'

s3 = boto3.client('s3')

response=s3.list_objects_v2(Bucket=source_bucket, Prefix=source_prefix)
contents=response['Contents']
count=len(contents)

sbucket = boto3.resource('s3').Bucket(source_bucket)
ntrain=int(count*0.80)
nval = int(count*0.18)
ntest = count - ntrain - nval

def stage_data(start, end, name):  
    for i in range(start, end, 1):
        item=contents[i]
        key =item['Key']    
        with tempfile.NamedTemporaryFile(mode='w+b', suffix='.csv', prefix='data-', delete=True) as csv_file:
            print(f'Download {key} file')
            sbucket.download_fileobj(key, csv_file)
            
            with open(csv_file.name, 'rb') as data_reader:
                dest_key = f'{dest_prefix}/{name}/part-{i}.csv'
                print(f'upload {dest_key} file')
                s3.upload_fileobj(data_reader, dest_bucket, dest_key)
                data_reader.close()
        
            csv_file.close()
            
stage_data(0, ntrain, 'train')
stage_data(ntrain, ntrain+nval, 'validation')
stage_data(ntrain+nval, count, 'test')


### Create data input channels

We will create train, validaiton and test input channels.

In [None]:
from sagemaker import s3_input

s3_train = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/train', content_type='csv')
s3_validation = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/validation', content_type='csv')
s3_test = s3_input(s3_data=f's3://{dest_bucket}/{dest_prefix}/test', content_type='csv')

output_path=f's3://{dest_bucket}/sagemaker/output/uber_nyc/xgboost'
data_channels = {'train': s3_train, 'validation': s3_validation, 'test': s3_test}

### Create SageMaker XGBoost Estimator

SageMaker Estimator class defines the SageMaker job for training XGBoost model.

In [None]:
from sagemaker.estimator import Estimator
from sagemaker import Session

sagemaker_session = Session()

xgb = Estimator(image_name=xgboost_image,
                            role=role, 
                            train_instance_count=1, 
                            train_instance_type='ml.c5.9xlarge',
                            output_path=output_path,
                            sagemaker_session=sagemaker_session)

xgb.set_hyperparameters(objective='reg:linear',
                       grow_policy='depthwise')


### Create SageMaker XGBoost Hyper-parameter Tuner ###
Before we train the model, we will tune the hyperparamters. Below we will tune two hyper-paramters, 'num_round' and 'max_depth'. 

In [None]:
from sagemaker.tuner import IntegerParameter
from sagemaker.tuner import HyperparameterTuner
from sagemaker.tuner import CategoricalParameter

objective_metric_name = "validation:rmse"

num_round = IntegerParameter(10,100)
max_depth = IntegerParameter(8,32)
hyperparameter_ranges={}
hyperparameter_ranges['num_round'] = num_round
hyperparameter_ranges['max_depth'] = max_depth

We define a hyper-parameter tuner that will use Bayesian search to minimize validation Root Mean Squere Error objective. We limit the maximum total hyper-parameter tuning jobs to 30 and concurrent jobs to 5.

In [None]:
hyperparameter_tuner=HyperparameterTuner(xgb, 
                                         objective_metric_name, 
                                         hyperparameter_ranges, 
                                         strategy='Bayesian', 
                                         objective_type='Minimize', 
                                         max_jobs=30, 
                                         max_parallel_jobs=5, 
                                         base_tuning_job_name='xgboost-tuning')

### Start Hyper-parameter Tuning ###
Below we launch the hyper-parameter tuner. You must use SageMaker console to monitor hyper-parameter tuning jobs.

In [None]:
hyperparameter_tuner.fit(inputs=data_channels, logs=True)

We use the best hyper-paramters found in SageMaker console to set the hyper-parameters for the training estimator.

In [None]:
xgb.set_hyperparameters(objective='reg:linear',
                        grow_policy='depthwise',
                        num_round=89, 
                        max_depth=10)

### Start Training Job ###
Below we launch the training job.

In [None]:
xgb.fit(inputs=data_channels, logs=True)

Once training is complete we deploy the training model to a SageMaker endpoint. This step can take a few minutes, so be patient.

In [None]:
xgb_predictor = xgb.deploy(initial_instance_count=1, 
               instance_type='ml.t2.xlarge')

### Deploy Model to Endpoint ###
Below we import the classes to make prediction with test data. These classes are used to serialize and de-serialize the data to SageMaker endpoint.

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

xgb_predictor.content_type = 'text/csv'
xgb_predictor.serializer = csv_serializer
xgb_predictor.deserializer = json_deserializer

### Make Predictions for Test Data ###
Below we download the test data and submit the test data to SageMaker deployed endpoint to make predctions on test data and compare the predictions to expected output.

In [None]:
import numpy as np
import tempfile

s3 = boto3.client('s3')
bucket = boto3.resource('s3').Bucket(dest_bucket)
with tempfile.NamedTemporaryFile(mode='w+b', suffix='.csv', prefix='data-', delete=True) as test_csv:
    key=f'{dest_prefix}/test/part-20.csv'
    print(f'download: {key}')
    bucket.download_fileobj(key, test_csv)
    print("read test csv file")
    array = np.genfromtxt(test_csv.name, delimiter=',', skip_header=False)
    np.random.shuffle(array)
    for i in range(100):
        print(f'test input: {array[i, 1:]}')
        result = xgb_predictor.predict(array[i, 1:])
        print(f'predicted: {result}')
        print(f'expected: {array[i,0]}')
    test_csv.close()