# Build Predictive Model(s)

In this workbook, you will read the merged dataset you created previously and you will create pipelines to build a binary classification model to predict wether a trip has a tip or not.

Instructions:

1. Read in your merged dataset
2. Use transformes and encoders to perform feature engineering
3. Split into training and testing
4. Build `LogisticRegression` model(s) and train them using pipelines
5. Evaluate the performance of the model(s) using `BinaryClassificationMetrics`

You are welcome to add as many cells as you need below up until the next section. **You must include comments in your code.**

In [2]:
# Findspark import spark session
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("model-data").getOrCreate()

In [3]:
# Check spark session
spark

In [4]:
# Load dataset
data = spark.read.parquet("s3://anly502s3/a5_mergeData/")

# Print and show the schema of the dataset
data.printSchema()
data.show(5)

print("Number of records: " + str(data.count()))

root
 |-- medallion: string (nullable = true)
 |-- hack_license: string (nullable = true)
 |-- vendor_id: string (nullable = true)
 |-- rate_code: integer (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_time_in_secs: float (nullable = true)
 |-- trip_distance: float (nullable = true)
 |-- pickup_longitude: float (nullable = true)
 |-- pickup_latitude: float (nullable = true)
 |-- dropoff_longitude: float (nullable = true)
 |-- dropoff_latitude: float (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- fare_amount: float (nullable = true)
 |-- surcharge: float (nullable = true)
 |-- mta_tax: float (nullable = true)
 |-- tip_amount: float (nullable = true)
 |-- tolls_amount: float (nullable = true)
 |-- total_amount: float (nullable = true)

+--------------------+--------------------+---------+---

In [84]:
# Import all the packages I need
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler, Binarizer
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.evaluation import BinaryClassificationMetrics
from pyspark.ml import Pipeline, Model

In [69]:
# Prepare the feature column by first change the tip_amount coloumn from float type to double type
data_new = data.withColumn("tip_amount", data["tip_amount"].cast("double"))

# Set the threshold at 0, tip = 0 will return 0, and tip >0 will return 1
binarizer = Binarizer(threshold = 0, inputCol = "tip_amount", outputCol = "label")
binarizer.transform(data_new).show(3)

+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+------------------+------------+------------+-----+
|           medallion|        hack_license|vendor_id|rate_code|store_and_fwd_flag|    pickup_datetime|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|payment_type|fare_amount|surcharge|mta_tax|        tip_amount|tolls_amount|total_amount|label|
+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+------------+-----------+---------+-------+------------------+------------+------------+-----+
|

In [106]:
# Prepare training and testing data
splitted_data = data_new.randomSplit([0.8, 0.2], 810)
train_data = splitted_data[0]
test_data = splitted_data[1]

# Print the rows for each dataset
print("Number of training records: " + str(train_data.count()))
print("Number of testing records : " + str(test_data.count()))


Number of training records: 138551442
Number of testing records : 34633649


In [107]:
# Convert needed predictor of string type to numeric
strInd_medall = StringIndexer(inputCol="medallion", outputCol="medall_IX", handleInvalid = "skip")#.fit(data_new)
strInd_hack = StringIndexer(inputCol="hack_license", outputCol="hack_IX", handleInvalid = "skip")
strInd_vendor = StringIndexer(inputCol="vendor_id", outputCol="vendor_IX", handleInvalid = "skip")
strInd_flag = StringIndexer(inputCol="store_and_fwd_flag", outputCol="flag_IX", handleInvalid = "skip")
strInd_pmt = StringIndexer(inputCol="payment_type", outputCol="pmt_IX")

In [108]:
vectorAssembler_features = VectorAssembler(
    inputCols=["medall_IX", 
               "hack_IX", 
               "vendor_IX", 
               "rate_code",
               "flag_IX",
               "passenger_count",
               "trip_time_in_secs",
               "trip_distance",
               "pickup_longitude",
               "pickup_latitude",
               "dropoff_longitude",
               "dropoff_latitude",
               "pmt_IX",
               "fare_amount",
               "surcharge",
               "mta_tax",
               "tolls_amount"], 
    outputCol="features")
vectorAssembler_features


VectorAssembler_2fe17f6f6bcf

In [109]:
# Define estimators
lr = LogisticRegression(labelCol="label", featuresCol="features")


In [110]:
# Build the pipeline
pipeline_lr = Pipeline(stages=[strInd_medall, 
                                strInd_hack, 
                                strInd_vendor, 
                                strInd_flag, 
                                strInd_pmt,
                                binarizer,
                                vectorAssembler_features,
                                lr])

In [111]:
# Train data with logistic regression
model_lr = pipeline_glm.fit(train_data)


In [112]:
# Evaluate the model and check model accuracy USING BinaryClassificationEvaluator
predictions = model_lr.transform(test_data)
evaluatorLR = BinaryClassificationEvaluator()
accuracy = evaluatorLR.evaluate(predictions)

print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Accuracy = 0.989056
Test Error = 0.010944


## In the following cells, please provide the requested code and output. Do not change the order and/or structure of the cells.

In the following cell, print the Area Under the Curve (AUC) for your binary classifier.

In [104]:
print("Accuracy = %g" % accuracy)

Accuracy = 0.989059


In the following cell, provide the code that saves your model your S3 bucket.

In [105]:
model_lr.save("s3://anly502s3/a5_model/")

In [113]:
spark.stop()