---
# PySpark Machine Learning Models
---
* Use Spark to build ML models of your choice (classification – regression) in the attempt of solving your business problem. 
* Build at least 4 models, optimize them, test them, and include your results. You will not be graded on accuracy and it is more important that you are implementing your techniques correctly even if the model does not return the intended results. If your model does not provide sufficient accuracy levels, make sure to present your work while explaining possible reasons behind this shortfall. 

### **Column:**

* legId: An identifier for the flight.
* searchDate: The date (YYYY-MM-DD) on which this entry was taken from Expedia.
* flightDate: The date (YYYY-MM-DD) of the flight.
* startingAirport: Three-character IATA airport code for the initial location.
* destinationAirport: Three-character IATA airport code for the arrival location.
* fareBasisCode: The fare basis code.
* travelDuration: The travel duration in hours and minutes.
* elapsedDays: The number of elapsed days (usually 0).
* isBasicEconomy: Boolean for whether the ticket is for basic economy.
* isRefundable: Boolean for whether the ticket is refundable.
* isNonStop: Boolean for whether the flight is non-stop.
* baseFare: The price of the ticket (in USD).
* totalFare: The price of the ticket (in USD) including taxes and other fees.
* seatsRemaining: Integer for the number of seats remaining.
* totalTravelDistance: The total travel distance in miles. This data is sometimes missing.
* segmentsDepartureTimeEpochSeconds: String containing the departure time (Unix time) for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsDepartureTimeRaw: String containing the departure time (ISO 8601 format: YYYY-MM-DDThh:mm:ss.000±[hh]:00) for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsArrivalTimeEpochSeconds: String containing the arrival time (Unix time) for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsArrivalTimeRaw: String containing the arrival time (ISO 8601 format: YYYY-MM-DDThh:mm:ss.000±[hh]:00) for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsArrivalAirportCode: String containing the IATA airport code for the arrival location for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsDepartureAirportCode: String containing the IATA airport code for the departure location for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsAirlineName: String containing the name of the airline that services each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsAirlineCode: String containing the two-letter airline code that services each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsEquipmentDescription: String containing the type of airplane used for each leg of the trip (e.g. "Airbus A321" or "Boeing 737-800"). The entries for each of the legs are separated by '||'.
* segmentsDurationInSeconds: String containing the duration of the flight (in seconds) for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsDistance: String containing the distance traveled (in miles) for each leg of the trip. The entries for each of the legs are separated by '||'.
* segmentsCabinCode: String containing the cabin for each leg of the trip (e.g. "coach"). The entries for each of the legs are separated by '||'.

## Install Pyspark

In [27]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Importing Packages 

In [28]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

import pyspark
from pyspark.sql.functions import col, isnan, when, count
from pyspark.ml.feature import Imputer, VectorAssembler, StringIndexer
from pyspark.sql import functions as F
from pyspark.sql import types
from pyspark.sql.functions import mean
from pyspark.ml.regression import RandomForestRegressor, DecisionTreeRegressor, GBTRegressor, LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

## Connect to the Spark server
* Initializing a Spark Session

In [29]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

## Obtain the Data
* Reading the data
* Schema information of the data

In [30]:
data = spark.read.csv('FlightDataset.csv',
                     sep=',',
                     inferSchema=True,
                     header=True,
                     multiLine=True)
data.printSchema()

root
 |-- flightDate: string (nullable = true)
 |-- startingAirport: string (nullable = true)
 |-- destinationAirport: string (nullable = true)
 |-- elapsedDays: integer (nullable = true)
 |-- isBasicEconomy: boolean (nullable = true)
 |-- isRefundable: boolean (nullable = true)
 |-- isNonStop: boolean (nullable = true)
 |-- baseFare: double (nullable = true)
 |-- totalFare: double (nullable = true)
 |-- seatsRemaining: integer (nullable = true)
 |-- totalTravelDistance: integer (nullable = true)
 |-- segmentsDepartureTimeEpochSeconds: integer (nullable = true)
 |-- segmentsDepartureTimeRaw: timestamp (nullable = true)
 |-- segmentsArrivalTimeEpochSeconds: integer (nullable = true)
 |-- segmentsArrivalTimeRaw: timestamp (nullable = true)
 |-- segmentsArrivalAirportCode: string (nullable = true)
 |-- segmentsDepartureAirportCode: string (nullable = true)
 |-- segmentsAirlineName: string (nullable = true)
 |-- segmentsDurationInSeconds: integer (nullable = true)
 |-- segmentsDistance: in

## Shape of the dataset

In [31]:
print("Shape of the dataset: ", (data.count(), len(data.columns)))

Shape of the dataset:  (880990, 24)


## Show Dataset

In [32]:
data.show()

+----------+---------------+------------------+-----------+--------------+------------+---------+--------+---------+--------------+-------------------+---------------------------------+------------------------+-------------------------------+----------------------+--------------------------+----------------------------+-------------------+-------------------------+----------------+-----------------+-------------------------------+----------------------------------+------------------------+
|flightDate|startingAirport|destinationAirport|elapsedDays|isBasicEconomy|isRefundable|isNonStop|baseFare|totalFare|seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsDepartureTimeRaw|segmentsArrivalTimeEpochSeconds|segmentsArrivalTimeRaw|segmentsArrivalAirportCode|segmentsDepartureAirportCode|segmentsAirlineName|segmentsDurationInSeconds|segmentsDistance|segmentsCabinCode|segmentsArrivalTimeEpochInhours|segmentsDepartureTimeEpochInhours |segmentsDurationInhours |
+---------

## Data Preprocessing & Cleaning

* Change data type

In [33]:
# from boolean to string
data = data.withColumn('isBasicEconomy', F.col('isBasicEconomy').cast(types.StringType()))
data = data.withColumn('isRefundable', F.col('isRefundable').cast(types.StringType()))
data = data.withColumn('isNonStop', F.col('isNonStop').cast(types.StringType()))

In [34]:
# from timestamp to string
data = data.withColumn('segmentsDepartureTimeRaw', F.col('segmentsDepartureTimeRaw').cast(types.StringType()))
data = data.withColumn('segmentsArrivalTimeRaw', F.col('segmentsArrivalTimeRaw').cast(types.StringType()))

In [35]:
data.printSchema()

root
 |-- flightDate: string (nullable = true)
 |-- startingAirport: string (nullable = true)
 |-- destinationAirport: string (nullable = true)
 |-- elapsedDays: integer (nullable = true)
 |-- isBasicEconomy: string (nullable = true)
 |-- isRefundable: string (nullable = true)
 |-- isNonStop: string (nullable = true)
 |-- baseFare: double (nullable = true)
 |-- totalFare: double (nullable = true)
 |-- seatsRemaining: integer (nullable = true)
 |-- totalTravelDistance: integer (nullable = true)
 |-- segmentsDepartureTimeEpochSeconds: integer (nullable = true)
 |-- segmentsDepartureTimeRaw: string (nullable = true)
 |-- segmentsArrivalTimeEpochSeconds: integer (nullable = true)
 |-- segmentsArrivalTimeRaw: string (nullable = true)
 |-- segmentsArrivalAirportCode: string (nullable = true)
 |-- segmentsDepartureAirportCode: string (nullable = true)
 |-- segmentsAirlineName: string (nullable = true)
 |-- segmentsDurationInSeconds: integer (nullable = true)
 |-- segmentsDistance: integer (nu

* check for null values

In [36]:
# Find Count of Null, None, NaN of All DataFrame Columns
from pyspark.sql.functions import col,isnan, when, count
data.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in data.columns]
   ).show()

+----------+---------------+------------------+-----------+--------------+------------+---------+--------+---------+--------------+-------------------+---------------------------------+------------------------+-------------------------------+----------------------+--------------------------+----------------------------+-------------------+-------------------------+----------------+-----------------+-------------------------------+----------------------------------+------------------------+
|flightDate|startingAirport|destinationAirport|elapsedDays|isBasicEconomy|isRefundable|isNonStop|baseFare|totalFare|seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsDepartureTimeRaw|segmentsArrivalTimeEpochSeconds|segmentsArrivalTimeRaw|segmentsArrivalAirportCode|segmentsDepartureAirportCode|segmentsAirlineName|segmentsDurationInSeconds|segmentsDistance|segmentsCabinCode|segmentsArrivalTimeEpochInhours|segmentsDepartureTimeEpochInhours |segmentsDurationInhours |
+---------

## Descriptive Statistics

In [37]:
data.select('elapsedDays', 'baseFare', 'totalFare','seatsRemaining','totalTravelDistance','segmentsDepartureTimeEpochSeconds',\
           'segmentsArrivalTimeEpochSeconds','segmentsDurationInSeconds','segmentsDistance','segmentsArrivalTimeEpochInhours',\
            'segmentsDepartureTimeEpochInhours ','segmentsDurationInhours ').summary().show()

+-------+-------------------+------------------+------------------+-----------------+-------------------+---------------------------------+-------------------------------+-------------------------+-----------------+-------------------------------+----------------------------------+------------------------+
|summary|        elapsedDays|          baseFare|         totalFare|   seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsArrivalTimeEpochSeconds|segmentsDurationInSeconds| segmentsDistance|segmentsArrivalTimeEpochInhours|segmentsDepartureTimeEpochInhours |segmentsDurationInhours |
+-------+-------------------+------------------+------------------+-----------------+-------------------+---------------------------------+-------------------------------+-------------------------+-----------------+-------------------------------+----------------------------------+------------------------+
|  count|             880990|            880990|            880990|         

In [38]:
data.show()

+----------+---------------+------------------+-----------+--------------+------------+---------+--------+---------+--------------+-------------------+---------------------------------+------------------------+-------------------------------+----------------------+--------------------------+----------------------------+-------------------+-------------------------+----------------+-----------------+-------------------------------+----------------------------------+------------------------+
|flightDate|startingAirport|destinationAirport|elapsedDays|isBasicEconomy|isRefundable|isNonStop|baseFare|totalFare|seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsDepartureTimeRaw|segmentsArrivalTimeEpochSeconds|segmentsArrivalTimeRaw|segmentsArrivalAirportCode|segmentsDepartureAirportCode|segmentsAirlineName|segmentsDurationInSeconds|segmentsDistance|segmentsCabinCode|segmentsArrivalTimeEpochInhours|segmentsDepartureTimeEpochInhours |segmentsDurationInhours |
+---------

In [39]:
data.printSchema()

root
 |-- flightDate: string (nullable = true)
 |-- startingAirport: string (nullable = true)
 |-- destinationAirport: string (nullable = true)
 |-- elapsedDays: integer (nullable = true)
 |-- isBasicEconomy: string (nullable = true)
 |-- isRefundable: string (nullable = true)
 |-- isNonStop: string (nullable = true)
 |-- baseFare: double (nullable = true)
 |-- totalFare: double (nullable = true)
 |-- seatsRemaining: integer (nullable = true)
 |-- totalTravelDistance: integer (nullable = true)
 |-- segmentsDepartureTimeEpochSeconds: integer (nullable = true)
 |-- segmentsDepartureTimeRaw: string (nullable = true)
 |-- segmentsArrivalTimeEpochSeconds: integer (nullable = true)
 |-- segmentsArrivalTimeRaw: string (nullable = true)
 |-- segmentsArrivalAirportCode: string (nullable = true)
 |-- segmentsDepartureAirportCode: string (nullable = true)
 |-- segmentsAirlineName: string (nullable = true)
 |-- segmentsDurationInSeconds: integer (nullable = true)
 |-- segmentsDistance: integer (nu

* Here we will encode all the categorical columns using StringIndexer and drop the original columns
* label encode all the categorical columns and store them in different columns with the same name + '_idx', so category will become category_idx 

In [40]:
cat_cols = ['flightDate', 'startingAirport', 'destinationAirport',\
            'isBasicEconomy', 'isRefundable','isNonStop','segmentsDepartureTimeRaw',\
            'segmentsArrivalTimeRaw','segmentsArrivalAirportCode',\
            'segmentsDepartureAirportCode', 'segmentsAirlineName','segmentsCabinCode']

In [41]:
for col in cat_cols:
    indexer = StringIndexer(inputCol=col, outputCol=col+'_idx')
    data = indexer.fit(data).transform(data) #here we fit and transform the data altogether
    
data = data.drop(*cat_cols) #we will drop all the categorical columns we defined earlier

In [42]:
data.show()

+-----------+--------+---------+--------------+-------------------+---------------------------------+-------------------------------+-------------------------+----------------+-------------------------------+----------------------------------+------------------------+--------------+-------------------+----------------------+------------------+----------------+-------------+----------------------------+--------------------------+------------------------------+--------------------------------+-----------------------+---------------------+
|elapsedDays|baseFare|totalFare|seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsArrivalTimeEpochSeconds|segmentsDurationInSeconds|segmentsDistance|segmentsArrivalTimeEpochInhours|segmentsDepartureTimeEpochInhours |segmentsDurationInhours |flightDate_idx|startingAirport_idx|destinationAirport_idx|isBasicEconomy_idx|isRefundable_idx|isNonStop_idx|segmentsDepartureTimeRaw_idx|segmentsArrivalTimeRaw_idx|segmentsArrivalAirportCod

### Combining Feature Columns
* using Vector Assembler to combine all the features into one column called features. because Machine Learning algorithms in PySpark only take two columns; one that contains all the features, and the other than contains all the labels..

In [43]:
cols = data.columns
cols.remove('totalFare') #remove price -> we need this to be our label

#vector assembler will take all the columns and convert them into one column called features
assembler = VectorAssembler(inputCols=cols, outputCol='features')

#the .transform will apply the changes here
data = assembler.transform(data)

In [44]:
data.show()

+-----------+--------+---------+--------------+-------------------+---------------------------------+-------------------------------+-------------------------+----------------+-------------------------------+----------------------------------+------------------------+--------------+-------------------+----------------------+------------------+----------------+-------------+----------------------------+--------------------------+------------------------------+--------------------------------+-----------------------+---------------------+--------------------+
|elapsedDays|baseFare|totalFare|seatsRemaining|totalTravelDistance|segmentsDepartureTimeEpochSeconds|segmentsArrivalTimeEpochSeconds|segmentsDurationInSeconds|segmentsDistance|segmentsArrivalTimeEpochInhours|segmentsDepartureTimeEpochInhours |segmentsDurationInhours |flightDate_idx|startingAirport_idx|destinationAirport_idx|isBasicEconomy_idx|isRefundable_idx|isNonStop_idx|segmentsDepartureTimeRaw_idx|segmentsArrivalTimeRaw_idx|segm

## Split Data
* 80% in training set and 20% is testing set

In [60]:
flight_data = data.select(F.col('features'), F.col('totalFare').alias('label'))

flight_train, flight_test = flight_data.randomSplit([0.8, 0.2])

In [61]:
flight_data.show()

+--------------------+------+
|            features| label|
+--------------------+------+
|[0.0,217.67,9.0,9...| 248.6|
|[0.0,217.67,4.0,9...| 248.6|
|[0.0,217.67,9.0,9...| 248.6|
|[0.0,217.67,8.0,9...| 248.6|
|[0.0,217.67,9.0,9...| 248.6|
|[0.0,213.02,3.0,9...| 251.1|
|[0.0,213.02,3.0,9...| 251.1|
|[0.0,213.02,7.0,9...| 251.1|
|[0.0,213.02,7.0,9...| 251.1|
|[0.0,213.02,1.0,9...| 252.6|
|[0.0,213.02,3.0,1...| 252.6|
|[0.0,213.02,5.0,1...| 252.6|
|[0.0,213.02,3.0,1...| 252.6|
|[0.0,213.02,2.0,1...| 252.6|
|[0.0,260.47,1.0,9...|302.11|
|[0.0,260.47,1.0,9...|302.11|
|[0.0,260.47,1.0,9...|302.11|
|[0.0,260.47,1.0,9...|302.11|
|[0.0,260.47,1.0,9...|302.11|
|[1.0,258.6,9.0,94...| 307.2|
+--------------------+------+
only showing top 20 rows



## Model Building

### Initialize Evaluator and Grid

In [62]:
evaluator = RegressionEvaluator()

### Linear Regression Model 

In [63]:
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
r_lr = lr.fit(flight_train)

### Make predictions on the testing set using the trained model

In [64]:
pred_lr = r_lr.transform(flight_test)

In [65]:
pred_lr.show()

+--------------------+------+------------------+
|            features| label|        prediction|
+--------------------+------+------------------+
|(23,[1,2,3,4,5,6,...| 117.6|118.03737805620781|
|(23,[1,2,3,4,5,6,...|132.61|  133.183318815681|
|(23,[1,2,3,4,5,6,...| 137.6|138.05546134428994|
|(23,[1,2,3,4,5,6,...| 137.6|  138.030955144963|
|(23,[1,2,3,4,5,6,...| 147.6|148.19912191713786|
|(23,[1,2,3,4,5,6,...| 147.6|148.19912191713786|
|(23,[1,2,3,4,5,6,...| 157.6| 158.0765321462324|
|(23,[1,2,3,4,5,6,...| 163.6|164.22302828622344|
|(23,[1,2,3,4,5,6,...|171.61|172.08223934317988|
|(23,[1,2,3,4,5,6,...|176.61| 177.0875070436655|
|(23,[1,2,3,4,5,6,...|177.61| 178.2414921366118|
|(23,[1,2,3,4,5,6,...| 181.6|182.10553139759202|
|(23,[1,2,3,4,5,6,...| 182.6| 183.1450692936748|
|(23,[1,2,3,4,5,6,...| 191.6|192.26329142550316|
|(23,[1,2,3,4,5,6,...| 200.6| 201.1010424601105|
|(23,[1,2,3,4,5,6,...| 200.6| 201.1010424601105|
|(23,[1,2,3,4,5,6,...| 200.6| 201.1255486594374|
|(23,[1,2,3,4,5,6,..

### Random Forest Regression Model 

In [66]:
rf = RandomForestRegressor(featuresCol='features', labelCol='label', maxBins=1483)
r_rf = rf.fit(flight_train)

### Make predictions on the testing set using the trained model

In [67]:
pred_rf = r_rf.transform(flight_test)

In [68]:
pred_rf.show()

+--------------------+------+------------------+
|            features| label|        prediction|
+--------------------+------+------------------+
|(23,[1,2,3,4,5,6,...| 117.6|169.69188384812307|
|(23,[1,2,3,4,5,6,...|132.61|167.28919253444704|
|(23,[1,2,3,4,5,6,...| 137.6| 161.4016138643882|
|(23,[1,2,3,4,5,6,...| 137.6|158.15718381649572|
|(23,[1,2,3,4,5,6,...| 147.6|174.02242484944549|
|(23,[1,2,3,4,5,6,...| 147.6|174.02242484944549|
|(23,[1,2,3,4,5,6,...| 157.6|177.08987051806832|
|(23,[1,2,3,4,5,6,...| 163.6|166.28251170931384|
|(23,[1,2,3,4,5,6,...|171.61|196.00830849516657|
|(23,[1,2,3,4,5,6,...|176.61|196.00830849516657|
|(23,[1,2,3,4,5,6,...|177.61|183.53020717354076|
|(23,[1,2,3,4,5,6,...| 181.6| 182.6701378943761|
|(23,[1,2,3,4,5,6,...| 182.6| 187.3353620794656|
|(23,[1,2,3,4,5,6,...| 191.6| 181.5895118133684|
|(23,[1,2,3,4,5,6,...| 200.6|190.15782149292292|
|(23,[1,2,3,4,5,6,...| 200.6|190.15782149292292|
|(23,[1,2,3,4,5,6,...| 200.6|190.00396384374505|
|(23,[1,2,3,4,5,6,..

### Decision Tree Regression Model 

In [69]:
dt = DecisionTreeRegressor(featuresCol="features", labelCol='label', maxBins=1483)
r_dt = dt.fit(flight_train)

### Make predictions on the testing set using the trained model

In [70]:
pred_dt = r_dt.transform(flight_test)

In [71]:
pred_dt.show()

+--------------------+------+------------------+
|            features| label|        prediction|
+--------------------+------+------------------+
|(23,[1,2,3,4,5,6,...| 117.6|121.25001448638402|
|(23,[1,2,3,4,5,6,...|132.61|121.25001448638402|
|(23,[1,2,3,4,5,6,...| 137.6|139.96858776256445|
|(23,[1,2,3,4,5,6,...| 137.6|139.96858776256445|
|(23,[1,2,3,4,5,6,...| 147.6|139.96858776256445|
|(23,[1,2,3,4,5,6,...| 147.6|139.96858776256445|
|(23,[1,2,3,4,5,6,...| 157.6|156.34973035241737|
|(23,[1,2,3,4,5,6,...| 163.6|156.34973035241737|
|(23,[1,2,3,4,5,6,...|171.61| 173.9613824042483|
|(23,[1,2,3,4,5,6,...|176.61| 173.9613824042483|
|(23,[1,2,3,4,5,6,...|177.61| 173.9613824042483|
|(23,[1,2,3,4,5,6,...| 181.6| 173.9613824042483|
|(23,[1,2,3,4,5,6,...| 182.6| 173.9613824042483|
|(23,[1,2,3,4,5,6,...| 191.6|192.63440021554726|
|(23,[1,2,3,4,5,6,...| 200.6|192.63440021554726|
|(23,[1,2,3,4,5,6,...| 200.6|192.63440021554726|
|(23,[1,2,3,4,5,6,...| 200.6|192.63440021554726|
|(23,[1,2,3,4,5,6,..

### Gradient Boosted Tree Regression Model 

In [72]:
gbt = GBTRegressor(featuresCol="features", labelCol='label', maxIter=10, maxBins=1483)
r_gbt = gbt.fit(flight_train)

### Make predictions on the testing set using the trained model

In [73]:
pred_gbt = r_gbt.transform(flight_test)

In [74]:
pred_gbt.show()

+--------------------+------+------------------+
|            features| label|        prediction|
+--------------------+------+------------------+
|(23,[1,2,3,4,5,6,...| 117.6|124.42174153854407|
|(23,[1,2,3,4,5,6,...|132.61|125.77469226613337|
|(23,[1,2,3,4,5,6,...| 137.6|141.69123200597073|
|(23,[1,2,3,4,5,6,...| 137.6|141.69123200597073|
|(23,[1,2,3,4,5,6,...| 147.6|143.86354547811362|
|(23,[1,2,3,4,5,6,...| 147.6|143.86354547811362|
|(23,[1,2,3,4,5,6,...| 157.6|158.95609707822112|
|(23,[1,2,3,4,5,6,...| 163.6| 160.2054285238784|
|(23,[1,2,3,4,5,6,...|171.61| 177.0961043404126|
|(23,[1,2,3,4,5,6,...|176.61|177.34093388678065|
|(23,[1,2,3,4,5,6,...|177.61|177.01490336051776|
|(23,[1,2,3,4,5,6,...| 181.6|177.84929415721408|
|(23,[1,2,3,4,5,6,...| 182.6|177.84929415721408|
|(23,[1,2,3,4,5,6,...| 191.6|194.51360677131686|
|(23,[1,2,3,4,5,6,...| 200.6|196.80066270985486|
|(23,[1,2,3,4,5,6,...| 200.6|196.80066270985486|
|(23,[1,2,3,4,5,6,...| 200.6|196.80066270985486|
|(23,[1,2,3,4,5,6,..

## Models Evaluation

In [75]:
models = [pred_rf, pred_gbt, pred_dt, pred_lr] #list of models we have

evaluator_R = RegressionEvaluator(predictionCol='prediction', labelCol='label', metricName='r2')

evaluator_RMSE = RegressionEvaluator(predictionCol='prediction', labelCol='label', metricName='rmse')

evaluator_MAE = RegressionEvaluator(predictionCol='prediction', labelCol='label', metricName='mae')

# Empty lists that will store the scores for each metric for each model.
R2 = []
RMSE = []
MAE = []

# Simple loop to populate the empty lists with scores of models for each metric.
for model in models:
    R2.append(evaluator_R.evaluate(model))
    RMSE.append(evaluator_RMSE.evaluate(model))
    MAE.append(evaluator_MAE.evaluate(model))

In [76]:
# We will convert all lists created above into a dataframe for easy viewing.
Models_Evaluation = pd.DataFrame(list(zip(R2, RMSE, MAE)), 
                     columns = ['R-squared', 'Root Mean Squared Error', 'Mean Absolute Error'],
                     index = ['Random Forest Regressor', 'Gradient Boosted Trees Regressor', 'Decision Tree Regressor', 'Linear Regression'])

In [77]:
Models_Evaluation

Unnamed: 0,R-squared,Root Mean Squared Error,Mean Absolute Error
Random Forest Regressor,0.949554,34.144418,23.221323
Gradient Boosted Trees Regressor,0.998799,5.268007,4.300254
Decision Tree Regressor,0.997764,7.188836,5.818735
Linear Regression,0.999753,2.390109,1.682378


### Above are the results of the model. Each model gives a high accuracy result
### The highest and best among them is Linear Regression with an accuracy of 99.9%.