## Regression with Amazon SageMaker Linear Learner algorithm for Taxi ride fare prediction
_**Single machine training for regression with Amazon SageMaker Linear Learner algorithm**_

## Introduction

This notebook demonstrates the use of Amazon SageMaker’s implementation of the Linear Learner algorithm to train and host a regression model to predict taxi fare. This notebook uses the [New York City Taxi and Limousine Commission (TLC) Trip Record Data] (https://registry.opendata.aws/nyc-tlc-trip-records-pds/#) to train the model. We are not using the whole dataset from above but a small subset of the dataset to train our model here. You will download this subset of data in below steps.



---
## Setup


This notebook was tested in Amazon SageMaker Studio on a ml.t3.medium instance with Python 3 (Data Science) kernel.

Let's start by specifying:
1. The S3 buckets and prefixes that you want to use for training data and model data. This should be within the same region as the Notebook Instance, training, and hosting.
1. The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

In [None]:
# cell 01
!pip install numpy==1.18.1

In [None]:
# cell 02
import os
import boto3
import re
import sagemaker
import numpy as np

In [None]:
# cell 03
role = sagemaker.get_execution_role()
sess = sagemaker.Session()
region = boto3.Session().region_name

# S3 bucket for training data.
# this will create bucket like 'Sagemaker-<region>-<Your AccountId>'
data_bucket=sess.default_bucket()
data_prefix = "1p-notebooks-datasets/taxi/text-csv"


# S3 bucket for saving code and model artifacts.
output_bucket = data_bucket
output_prefix = "sagemaker/DEMO-linear-learner-taxifare-regression"

### Before running the below cell make sure that you uploaded the nyc-taxi.csv file in Sagemaker Studio, provided to you, in the same folder where this Studio notebook is residing.
    


In [None]:
# cell 04
import boto3
FILE_TRAIN = "nyc-taxi.csv"
# s3 = boto3.client("s3")
# s3.download_file(data_bucket, f"{FILE_TRAIN}", FILE_TRAIN)

import pandas as pd  # Read in csv and store in a pandas dataframe

# df = pd.read_csv(FILE_TRAIN, sep=",", encoding="latin1")
df = pd.read_csv(FILE_TRAIN, sep=",", encoding="latin1", names=["fare_amount","vendor_id","pickup_datetime","dropoff_datetime","passenger_count","trip_distance","pickup_longitude","pickup_latitude","rate_code","store_and_fwd_flag","dropoff_longitude","dropoff_latitude","payment_type","surcharge","mta_tax","tip_amount","tolls_amount","total_amount"])
print(df.head(5))

In [None]:
# cell 05
df.info()

#### We have 18 features "fare_amount", "vendor_id", "pickup_datetime", "dropoff_datetime", "passenger_count", "trip_distance", "pickup_longitude", "pickup_latitude", "rate_code", "store_and_fwd_flag", "dropoff_longitude", "dropoff_latitude", "payment_type", "surcharge", "mta_tax", "tip_amount", "tolls_amount", "total_amount" in the dataset

Lets explore the dataset

In [None]:
# cell 06
# Frequency tables for each categorical feature
for column in df.select_dtypes(include=['object']).columns:
    display(pd.crosstab(index=df[column], columns='% observations', normalize='columns'))

# Histograms for each numeric features
display(df.describe())
%matplotlib inline
hist = df.hist(bins=30, sharey=True, figsize=(10, 10))

#### As we can see that store_and_fwd_flg column doesn't have much variance in it ( as 98% of the column values are N and 2% are Y) hence this column won't have much impact on target variable ( fare_amount ). Also from our domain knowledge we can see that payment_type column value doesn't impact on trip fare hence we can drop both of these features from dataset

In [None]:
# cell 07
df = df.drop(['payment_type', 'store_and_fwd_flag'], axis=1)
df.info()

#### we can see that in the dataset there are 2 features 'pickup_datetime' and 'dropoff_datetime' which depict when ride started and when did it end. As we know that taxi fare is highly dependent on duration of the drive hence as part of feature engineering we will create a feature which will calculate ride duration using these  features

In [None]:
# cell 08
df['dropoff_datetime']= pd.to_datetime(df['dropoff_datetime'])
df['pickup_datetime']= pd.to_datetime(df['pickup_datetime'])
df['journey_time'] = (df['dropoff_datetime'] - df['pickup_datetime'])
df['journey_time'] = df['journey_time'].dt.total_seconds()
df['journey_time']

#### after creation of 'journey_time feature' we can drop 'pickup_datetime' and 'dropoff_datetime' features

In [None]:
# cell 09
df = df.drop(['dropoff_datetime', 'pickup_datetime'], axis=1)
df.info()

#### As you can see that vedor_id is still a categorical feature and we need to chage it to float ( using dummuies 0) so that dataset can be passed to Liner learner algorithm

In [None]:
# cell 10
df = pd.get_dummies(df, dtype=float)
df.info()

#### Split the dataframe in train, test and validation

In [None]:
# cell 11
import numpy as np

train_data, validation_data, test_data = np.split(df.sample(frac=1, random_state=1729), [int(0.7 * len(df)), int(0.9 * len(df))])
train_data.to_csv('train.csv', header=False, index=False)
validation_data.to_csv('validation.csv', header=False, index=False)
test_data.to_csv('test.csv', header=False, index=False)

In [None]:
# cell 12
boto3.Session().resource('s3').Bucket(data_bucket).Object(os.path.join(data_prefix, 'train/train.csv')).upload_file('train.csv')
boto3.Session().resource('s3').Bucket(data_bucket).Object(os.path.join(data_prefix, 'validation/validation.csv')).upload_file('validation.csv')
boto3.Session().resource('s3').Bucket(data_bucket).Object(os.path.join(data_prefix, 'test/test.csv')).upload_file('test.csv')


---
Let us prepare the handshake between our data channels and the algorithm. To do this, we need to create the `sagemaker.session.s3_input` objects from our [data channels](https://sagemaker.readthedocs.io/en/v1.2.4/session.html#). These objects are then put in a simple dictionary, which the algorithm consumes. Notice that here we use a `content_type` as `text/csv` for the pre-processed file in the data_bucket. We use two channels here one for training and the second one for validation. The testing samples from above will be used on the prediction step.

In [None]:
# cell 13
# creating the inputs for the fit() function with the training and validation location
s3_train_data = f"s3://{data_bucket}/{data_prefix}/train"
print(f"training files will be taken from: {s3_train_data}")

s3_validation_data = f"s3://{data_bucket}/{data_prefix}/validation"
print(f"validtion files will be taken from: {s3_validation_data}")

s3_test_data = f"s3://{data_bucket}/{data_prefix}/test"
print(f"test files will be taken from: {s3_test_data}")

output_location = f"s3://{output_bucket}/{output_prefix}/output"
print(f"training artifacts output location: {output_location}")

# generating the session.s3_input() format for fit() accepted by the sdk
train_data = sagemaker.inputs.TrainingInput(
    s3_train_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)
validation_data = sagemaker.inputs.TrainingInput(
    s3_validation_data,
    distribution="FullyReplicated",
    content_type="text/csv",
    s3_data_type="S3Prefix",
    record_wrapping=None,
    compression=None,
)

## Training the Linear Learner model

First, we retrieve the image for the Linear Learner Algorithm according to the region.

Then we create an [estimator from the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) using the Linear Learner container image and we setup the training parameters and hyperparameters configuration.


In [None]:
# cell 14
# getting the linear learner image according to the region
from sagemaker.image_uris import retrieve

container = retrieve("linear-learner", boto3.Session().region_name, version="1")
print(container)

In [None]:
%%time
import boto3
import sagemaker
from time import gmtime, strftime

sess = sagemaker.Session()

job_name = "DEMO-linear-learner-taxifare-regression-" + strftime("%H-%M-%S", gmtime())
print("Training job", job_name)

linear = sagemaker.estimator.Estimator(
    container,
    role,
    input_mode="File",
    instance_count=1,
    instance_type="ml.m4.xlarge",
    output_path=output_location,
    sagemaker_session=sess,
)

linear.set_hyperparameters(
    epochs=16,
    wd=0.01,
    loss="absolute_loss",
    predictor_type="regressor",
    normalize_data=True,
    optimizer="adam",
    mini_batch_size=1000,
    lr_scheduler_step=100,
    lr_scheduler_factor=0.99,
    lr_scheduler_minimum_lr=0.0001,
    learning_rate=0.1,
)

---
After configuring the Estimator object and setting the hyperparameters for this object. The only remaining thing to do is to train the algorithm. The following cell will train the algorithm. Training the algorithm involves a few steps. Firstly, the instances that we requested while creating the Estimator classes are provisioned and are setup with the appropriate libraries. Then, the data from our channels are downloaded into the instance. Once this is done, the training job begins. The provisioning and data downloading will take time, depending on the size of the data. Therefore it might be a few minutes before we start getting data logs for our training jobs. The data logs will also print out Mean Average Precision (mAP) on the validation data, among other losses, for every run of the dataset once or one epoch. This metric is a proxy for the quality of the algorithm.

Once the job has finished a "Job complete" message will be printed. The trained model can be found in the S3 bucket that was setup as output_path in the estimator.

In [None]:
%%time
linear.fit(inputs={"train": train_data, "validation": validation_data}, job_name=job_name)

## Set up hosting for the model

Once the training is done, we can deploy the trained model as an Amazon SageMaker real-time hosted endpoint. This will allow us to make predictions (or inference) from the model. Note that we don't have to host on the same insantance (or type of instance) that we used to train. Training is a prolonged and compute heavy job that require a different of compute and memory requirements that hosting typically do not. We can choose any type of instance we want to host the model. In our case we chose the ml.m4.xlarge instance to train, but we choose to host the model on the less expensive cpu instance, ml.c4.xlarge. The endpoint deployment can be accomplished as follows:

In [None]:
%%time
# creating the endpoint out of the trained model
linear_predictor = linear.deploy(initial_instance_count=1, instance_type="ml.c4.xlarge")
print(f"\ncreated endpoint: {linear_predictor.endpoint_name}")

#### Copy the endpoint name of the deployed model from above and save it for later

## Inference

Now that the trained model is deployed at an endpoint that is up-and-running, we can use this endpoint for inference. To do this, we are going to configure the [predictor object](https://sagemaker.readthedocs.io/en/v1.2.4/predictors.html) to parse contents of type text/csv and deserialize the reply received from the endpoint to json format.


In [None]:
# cell 18
# configure the predictor to accept to serialize csv input and parse the reposne as json
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import JSONDeserializer

linear_predictor.serializer = CSVSerializer()
linear_predictor.deserializer = JSONDeserializer()

---
We then use the test file containing the records of the data that we kept to test the model prediction. By running below cell multiple times we are selecting random sample from the testing samples to perform inference with.

In [None]:
%%time
import json
from itertools import islice
import math
import struct
import boto3
import random

# downloading the test file from data_bucket
FILE_TEST = "test.csv"
s3 = boto3.client("s3")
s3.download_file(data_bucket, f"{data_prefix}/test/{FILE_TEST}", FILE_TEST)

# getting testing sample from our test file
test_data = [l for l in open(FILE_TEST, "r")]
sample = random.choice(test_data).split(",")
actual_fare = sample[0]
payload = sample[1:]  # removing actual age from the sample
payload = ",".join(map(str, payload))
print('payload: ', payload, type(payload))
# Invoke the predicor and analyise the result
result = linear_predictor.predict(payload)
print('Result:', result)
# extracting the prediction value
result = round(float(result["predictions"][0]["score"]), 2)


accuracy = str(round(100 - ((abs(float(result) - float(actual_fare)) / float(actual_fare)) * 100), 2))
print(f"Actual fare: {actual_fare}\nPrediction: {result}\nAccuracy: {accuracy}")

In [None]:
# cell 20
