# SageMaker PySpark XGBoost Regression Example

1. [Introduction](#Introduction)
2. [Setup](#Setup)
3. [Data Cleansing](#Data-Cleansing)
4. [Feature Trend Analysis](#Feature-Trend-Analysis)
5. [Feature Enginnering](#Feature-Engineering)
6. [Split data into training and test dataset](#Split-data-into-training-and-test-dataset)
7. [Training and Hosting XGBoost Model](#Training-and-Hosting-XGBoost-Model)
8. [Run Predictions](#Run-Predictions)
9. [Clean up](#Clean-up)

## Introduction
This notebook will show how to perfrom Tips prediction using XGBoost algorithm on Amazon SageMaker through the SageMaker PySpark library. We will train on Amazon SageMaker using XGBoost on curated NYC Taxi dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.

Unlike the other notebooks that demonstrate XGBoost on Amazon SageMaker, this notebook uses a SparkSession to manipulate data, and uses the SageMaker Spark library to interact with SageMaker with Spark Estimators and Transformers.

You can visit SageMaker Spark's GitHub repository at https://github.com/aws/sagemaker-spark to learn more about SageMaker Spark.

You can visit XGBoost's GitHub repository at https://github.com/dmlc/xgboost to learn more about XGBoost

This notebook was created and tested on an ml.m4.xlarge notebook instance.

## Setup

1. Import Spark and Glue packages
2. Initialize GlueContext and SparkSession

In [None]:
import os

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from datetime import datetime
import sagemaker_pyspark
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler, IndexToString
from pyspark.sql.functions import *
from pyspark.ml import Pipeline

start_time = datetime.now()

sc=sc if 'sc' in vars() else SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

#### Get Current IAM Execution Role 

In [None]:
import boto3

roleName = 'AWSGlueServiceSageMakerNotebookRole-nyctaxi'

iam = boto3.client('iam')
role = iam.get_role(RoleName=roleName)
execution_role = role["Role"]['Arn']
print('IAM role arn: {}'.format(execution_role))

## Loading the Data

Read Glue Data catalog for yello taxi optimized Dataset 

In [None]:
nyctaxidyf = glueContext.create_dynamic_frame.from_catalog(database='nyctaxi',table_name='yellow_opt'\
                                                          ,push_down_predicate='pu_year=2017 and pu_month=1')

nyctaxidyf.printSchema()

Convert Glue Dynamic Frame to Spark Dataframe 

In [None]:
nyctaxidf = nyctaxidyf.toDF().limit(100000)

## Data Cleansing

1. Pickup and dropoff timestamps are not useful directly in training but using 'Day of month' and 'Day of week' are useful features
3. Remove samples with zero or negative total amount
4. Remove samples with negative tip amount
5. Remove all non-electronic transactions as most drivers do not report tips on cash transactions (payment_type = 2)
6. Remove all payments of type 'Dispute','No Charge','Unknown' (payment_type = 4 or 3, 5 )
6. Removed samples where tip was more than 100% of the fare amount as those are outliers and have a significant impact on algorithms which try to optimize MSE

In [None]:
from dateutil import parser
from pyspark.sql.types import IntegerType,StringType,ArrayType


nyctaxidf1 = nyctaxidf.withColumn('pickup_dow_str',date_format(col('pu_datetime'),'E'))\
    .withColumn('pickup_hr',hour(col('pu_datetime')))\
    .withColumn('dropoff_dow_str',date_format(col('do_datetime'),'E'))\
    .withColumn('dropoff_hr',hour(col('do_datetime')))\
    .withColumn('taxes',col('extra')+col('mta_tax')+col('tolls_amount')+col('improvement_surcharge'))\
    .filter( (nyctaxidf.total_amount > 0) & (nyctaxidf.fare_amount > 0))\
    .filter(nyctaxidf.payment_type == 1)\
    .filter(nyctaxidf.fare_amount > nyctaxidf.tip_amount)\
    .filter(nyctaxidf.tip_amount >= 0)\
    .dropna() 

print("Cleansed Dataset sample count:{}".format(nyctaxidf1.count()))

nyctaxidf2 = nyctaxidf1.select('tip_amount','pickup_dow_str','pickup_hr','pu_locationid',\
                               'dropoff_dow_str','dropoff_hr','do_locationid',\
                               'trip_distance','total_amount')


## Feature Trend Analysis

In this section we will observe trends in feature. This phase allows to remove any outlier samples that may skew training and predictions. This is very important aspect of Data science. We will be only looking at Tip_ratio distribution but other features(e.g. Geographical featues - Taxi_Zones, Temporal Features - Pickup/Drop off datetimes, Trip features - Trip distance, Number of passengers per ride) are also influential. Due to lack of time you can observe these features as homework

Register a Spark Temp View using Spark Dataframe

https://spark.apache.org/docs/latest/sql-programming-guide.html#running-sql-queries-programmatically

In [None]:
spark.catalog.dropTempView("nytaxi_view")
nyctaxidf2.createTempView("nytaxi_view")

In [None]:
%%sql -q -o tip_ratio_pd

select round(((tip_amount/total_amount * 100)),0) as tip_to_total_percent , count(*) as counts
from nytaxi_view 
  group by round(((tip_amount/total_amount * 100)),0)
        

In [None]:
%matplotlib inline
# Matplotlib + numpy initialization and imports
import matplotlib.pyplot as plt
import numpy as np

#display(tip_ratio_pd)
plt.bar(tip_ratio_pd['tip_to_total_percent'],tip_ratio_pd['counts'],width=0.8)
plt.title('Tip Ratio vs Number of Rides');
plt.xlabel('Tip Ratio')
plt.ylabel('Number of Rides')

#### Observation : Most riders tend to tip 5% to 30% 

Based on observation above we will only select samples with tip_ratio (tip/total) that are between 5% and 30% as outliers can skew the model training

In [None]:
nyctaxidf_filtered = spark.sql("select *\
                      from nytaxi_view\
                      where round(((tip_amount/total_amount * 100)),0) between 5 and 30 ")

## Feature Engineering

Perform feature engineering by converting Categorical features to Binary Vectors using OneHotEncoding and then assemble features in (label,feature) Vector

Refer to https://spark.apache.org/docs/2.1.0/ml-features.html for complete set of feature extraction utils

In [None]:
pickupdowIndexer = StringIndexer(inputCol='pickup_dow_str',outputCol='pickup_dayofweek').setHandleInvalid("keep")
dropoffdowIndexer = StringIndexer(inputCol='dropoff_dow_str',outputCol='dropoff_dayofweek').setHandleInvalid("keep")

pickupdayEncoder = OneHotEncoder(inputCol='pickup_dayofweek',outputCol='pickupdowVec')
dropoffdayEncoder = OneHotEncoder(inputCol='dropoff_dayofweek',outputCol='dropoffdowVec')

pickuphourEncoder = OneHotEncoder(inputCol='pickup_hr',outputCol='pickup_hrVec')
dropoffhourEncoder = OneHotEncoder(inputCol='dropoff_hr',outputCol='dropoff_hrVec')

pu_locationidEncoder = OneHotEncoder(inputCol='pu_locationid',outputCol='pu_locationidVec')
do_locationidEncoder = OneHotEncoder(inputCol='do_locationid',outputCol='do_locationidVec')

assembler = VectorAssembler(inputCols=['trip_distance','total_amount','pickupdowVec',\
                                       'dropoffdowVec','pickup_hrVec','dropoff_hrVec',\
                                       'pu_locationidVec','do_locationidVec'],outputCol='features')

pipeline = Pipeline(stages=[pickupdowIndexer,dropoffdowIndexer,pickupdayEncoder,\
                            dropoffdayEncoder,pickuphourEncoder,dropoffhourEncoder,\
                            pu_locationidEncoder,do_locationidEncoder,assembler])

model = pipeline.fit(nyctaxidf_filtered)

transformed_nyctaxidf = model.transform(nyctaxidf_filtered)

transformed_rdd = transformed_nyctaxidf.rdd.map(lambda x:(x.tip_amount,x.features))

transformed_2_nyctaxidf = transformed_rdd.toDF()

# Split data into training and test dataset

Based on observation above we will only select samples with tip_ratio (tip/total) that are between 5% and 30% as outliers can skew the model training

In [None]:
(trainDF,testDF) = transformed_2_nyctaxidf.toDF('label','features').randomSplit([.90,.10])   

print("Number of training samples: {}".format(trainDF.count()))
print("Number of test samples: {}".format(testDF.count()))

# Training and Hosting XGBoost Model
Now we create an XGBoostSageMakerEstimator, which uses the XGBoost Amazon SageMaker Algorithm to train on our input data, and uses the XGBoost Amazon SageMaker model image to host our model.

The following cell initializes XGBoostSageMakerEstimator and set hyperparameters. Refer to https://docs.aws.amazon.com/sagemaker/latest/dg/xgboost_hyperparameters.html for more information on hyperparameters. 

In [None]:
import random
from sagemaker_pyspark import IAMRole, S3DataPath
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
from sagemaker_pyspark.transformation import serializers

xgboost_estimator = XGBoostSageMakerEstimator(
    sagemakerRole=IAMRole(execution_role),
    trainingInstanceType='ml.m5.large',
    trainingInstanceCount=1,
    endpointInstanceType='ml.m4.xlarge',
    endpointInitialInstanceCount=1,
    trainingInstanceVolumeSizeInGB=20
 )


xgboost_estimator.setEta(0.2)
xgboost_estimator.setGamma(4)
xgboost_estimator.setMinChildWeight(6)
xgboost_estimator.setSilent(0)
xgboost_estimator.setObjective("reg:linear")
xgboost_estimator.setNumRound(50)
xgboost_estimator.setEvalMetric("rmse")

Calling fit() on this estimator will train our model on Amazon SageMaker, and then create an Amazon SageMaker Endpoint to host our model.

We can then use the SageMakerModel returned by this call to fit() to transform Dataframes using our hosted model.

The following cell runs a training job and creates an endpoint to host the resulting model, so this cell can take up to **twenty minutes to complete**.

**After running Cell below, while waiting for SageMaker Training to finish, in a separate browser tab, using AWS console, navigate to SageMaker -> Training Jobs to observe job statistics**

In [None]:
# train
model = xgboost_estimator.fit(trainDF)

# Run Predictions

Now we user the test Dataframe to call SageMaker Endpoint using transform() method

In [None]:
transformedData = model.transform(testDF)

transformedData.select('label','prediction').show(5)

Define Spark Temp View of Prediction results to plot graph

In [None]:
spark.catalog.dropTempView('predict_view')
transformedData.createTempView("predict_view")

Dump SQL output to pandas Dataframe

In [None]:
%%sql -q -o predict_pd

select monotonically_increasing_id() as id ,label, prediction from predict_view

Plot the graph of Actuals vs Prediction of Tips to understand how well our model is doing

**Disclaimer: The prediction accuracy may not be good as we have trained with 50 iterations but accuracy can be further greatly improved using hyperparameter tuning**

In [None]:
%matplotlib inline
# Matplotlib + numpy initialization and imports
import matplotlib.pyplot as plt
import numpy as np

plt.plot(predict_pd['id'],predict_pd['label'],'r--',predict_pd['id'],predict_pd['prediction'],'b--')
plt.title('Compare Predictions vs Actuals');

# Clean-up

Since we don't need to make any more inferences, now we delete the resources (endpoints, models, configurations, etc):

In [None]:
# Delete the resources
from sagemaker_pyspark import SageMakerResourceCleanup

def cleanUp(model):
    resource_cleanup = SageMakerResourceCleanup(model.sagemakerClient)
    resource_cleanup.deleteResources(model.getCreatedResources())

# Don't forget to include any models or pipeline models that you created in the notebook
models = [model]

# Delete regular SageMakerModels
for m in models:
    cleanUp(m)

In [None]:
end_time = datetime.now()

print("Total time to execute notebook:{} mins".format((end_time-start_time).total_seconds()/60))