## SageMaker XGBoost model applied to NYC Uber data

Our SageMaker XGBoost regression model predicts trip duration based on feature vector that includes source zone, destination zone, and month, day and hour for the pickup time. 

The first step in using SageMaker is to create a SageMaker execution role that contains permissions used by SageMaker. 

In [None]:
from sagemaker import get_execution_role
import boto3

# Create SageMaker role 
role = get_execution_role()

# get the url to the container image for using linear-learner
from sagemaker.amazon.amazon_estimator import get_image_uri
xgboost_image = get_image_uri(boto3.Session().region_name, 'xgboost')
print(xgboost_image)

# destination bucket to upload protobuf recordIO files
dest_bucket='aws-ajayvohra-nyc-tlc-sagemaker'

### Convert data from CSV to protobuf recordIO Format

We have multiple CSV files available in S3 bucket. Each CSV file contains numeric columns for encoded origin zone, encoded destination zone, month, day, hour, trip distance in miles and trip duration in seconds. 

We will download  multiple CSV files available in the source S3 bucket, convert each file to [protobuf recordIO](https://mxnet.incubator.apache.org/versions/master/architecture/note_data_loading.html#data-format) data format and upload the new files to the destination S3 bucket.

During this conversion process, we will also split the csv files into training, validation and test data sets.

In [None]:
import tempfile
import csv

# source bucket with CSV files
source_bucket='aws-ajayvohra-nyc-tlc-glue'
source_prefix='uber'

s3 = boto3.client('s3')

response=s3.list_objects_v2(Bucket=source_bucket, Prefix=source_prefix)
contents=response['Contents']
count=len(contents)

sbucket = boto3.resource('s3').Bucket(source_bucket)
ntrain=int(count*0.8)
nval = int(count*0.2)
ntest = count - ntrain - nval

def transform_upload(start, end, name):
    data=tempfile.NamedTemporaryFile(mode='w+t', suffix='.csv', prefix='data-', delete=True)
    data_writer = open(data.name, 'w+t')
    
    for i in range(start, end, 1):
        item=contents[i]
        key =item['Key']    
        print("tranforming: "+key)
        with tempfile.NamedTemporaryFile(mode='w+b', suffix='.csv', prefix='data-', delete=True) as csv_file:
            sbucket.download_fileobj(key, csv_file)
            csv_file2=open(csv_file.name, 'rt')
            
            csv_writer=csv.writer(data_writer, delimiter=',')
            csv_reader = csv.reader(csv_file2, delimiter=',')
            header=True
            print("reading: "+csv_file.name)
            for row in csv_reader:
                if not header:
                    new_row=[row[-1]]+row[0:5]
                    csv_writer.writerow(new_row)
                else:
                    header=False
            csv_file2.close()
            csv_file.close()
    data_writer.close()
    
    with open(data.name, 'rb') as data_reader:
        print(f'upload {name} data file')
        s3.upload_fileobj(data_reader, dest_bucket, f'csv/{name}/data.csv')
        data_reader.close()
    
    data.close()

transform_upload(0, ntrain, 'train')
transform_upload(ntrain, ntrain+nval, 'validation')
transform_upload(ntrain+nval, count, 'test')


### Create data input channels

We will create train, validaiton and test input channels.

In [None]:
from sagemaker import s3_input

s3_train = s3_input(s3_data=f's3://{dest_bucket}/csv/train', content_type='csv')
s3_validation = s3_input(s3_data=f's3://{dest_bucket}/csv/validation', content_type='csv')
s3_test = s3_input(s3_data=f's3://{dest_bucket}/csv/test', content_type='csv')

output_path=f's3://{dest_bucket}/output/xgboost/model'

### Create SageMaker Linear Learner Estimator

SageMaker Estimator class defines the SageMaker job for training Linear Learner model.

In [None]:
from sagemaker.estimator import Estimator
from sagemaker import Session

sagemaker_session = Session()

linear_learner = Estimator(image_name=xgboost_image,
                            role=role, 
                            train_instance_count=1, 
                            train_instance_type='ml.m5.12xlarge',
                            output_path=output_path,
                            sagemaker_session=sagemaker_session)
linear_learner.set_hyperparameters(num_round=20, objective='reg:linear')


In [None]:
linear_learner.fit(inputs={'train': s3_train, 'validation': s3_validation, 'test': s3_test})