<h1 align="center"> Predicting late flight arrivals in US </h1>
<h2 align="center"> W261 - Final Project </h2>
<h5 align="center"> by Team 25: Adam Sohn, Chandra Shekar Bikkanur, Jayesh Parikh, Tucker Anderson</h5>

<h2 align="center"> Business Question:</h2>

Flights that arrive late to the destination due to weather delays, technical delays, security delays, airspace congestion, air traffic and etc. cause economic loss and inconvenience to the passengers. According to the Federal Aviation Administration(FAA) (https://www.faa.gov/nextgen/programs/weather/faq/) data, a flight is considered to be delayed if it is late by 15 minutes of scheduled arrival time. And 69% of all airline delays are due to adverse weather. From the airlines operations perspective, predicting if a flight is going to be delayed at the destination right after it departs will give the airlines a heads up to act and mitigate the losses. For this analysis, we are going to look into the airline's data along with weather data to **predict the arrival delay for a given flight (in minutes) right after it departs the origin station**. This prediction of arrival delay could be fed into other systems such as `MIL8` (application for flight status) or `iReebook` (rebooking/rescheduling portal) for mitigating the adverse delay impacts on the airline. 

For this analysis, we will be evaluating our regression model on the basis of \\( R^2 \\) value. \\( R^2 \\) values range from 0 (poor prediction) to 1 (best prediction) and provides us the percentage of variance in the dependent variable (`Arrival Delay`) explained by the predictive model considering all independent features together. For this analysis, the predictive model should be able to cover maximum variance of the dependent variable (`Arrival Delay`).

<h5>Import Libraries & Data:</h5>

For this analysis, we are going to import below modules/classes from `pyspark` and other data analysis libraries.

In [4]:
# Import libraries
import re
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import time
import functools
import dateutil.parser
import datetime
from math import atan2, cos, sin, radians, degrees

from pyspark.sql import functions as f
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType, NullType, ShortType, DateType, BooleanType, BinaryType, TimestampType
from pyspark.sql import SQLContext
from pyspark.sql.functions import col, concat, lit, udf
from pyspark.sql import DataFrameNaFunctions
sqlContext = SQLContext(sc)

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit, CrossValidator, CrossValidatorModel
from pyspark.ml.stat import ChiSquareTest

Now, we will import airlines data, weather data, stations data and airport codes data for the analysis. We have used below sources for our data collection:

+ Airlines: Bureau Of Transportation Statistics (https://www.transtats.bts.gov/)
+ Weather: National Centers for Environmental Information (https://www.ncei.noaa.gov/)
+ Stations: Databricks FileStore (dbfs:/mnt/mids-w261/data/DEMO8/gsod/stations.csv.gz)
+ Airport Location: Open Flights Organization (https://openflights.org/data.html)

In [6]:
#Read in airlines, weather, stations, airport codes dataset
airlines = spark.read.option("header", "true").parquet(f"dbfs:/mnt/mids-w261/data/datasets_final_project/parquet_airlines_data/201*.parquet")
weather_parquet = spark.read.option("header", "true")\
                      .parquet(f"dbfs:/mnt/mids-w261/data/datasets_final_project/new_weather_parquet_177/weather201*a.parquet")
stations = spark.read.option("header", "true").csv("dbfs:/mnt/mids-w261/data/DEMO8/gsod/stations.csv.gz")
airport_codes = spark.read.csv('/FileStore/tables/airport_codes.csv', header="true", inferSchema="true")
airport_codes = airport_codes.selectExpr("`IATA Code` as code", "Latitude as lat", "Longitude as lon")

<h2 align="center">EDA & Discussion of Challenges:</h2>

We will now conduct an exploratory data analysis on above data sets to get a deeper insight into the data and data structure.

<h5>Airlines Data:</h5>

In [10]:
airlines.printSchema() # Check the data structure of airlines dataset

In [11]:
airlines.count() # Check for total number of records in airlines dataset

In [12]:
len(airlines.columns) # Check for total number of columns in airlines dataset

In [13]:
display(airlines.sample(0.0000001, False)) # sample and display a fraction of records from airlines dataset

YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,FL_DATE,OP_UNIQUE_CARRIER,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
2016,1,2,27,6,2016-02-27,AA,14679,1467903,33570,SAN,"San Diego, CA",CA,6,California,91,14107,1410702,30466,PHX,"Phoenix, AZ",AZ,4,Arizona,81,1930,1919,-11.0,0.0,0.0,-1,1900-1959,15.0,1934,2126,7.0,2148,2133,-15.0,0.0,0.0,-1,2100-2159,False,False,78.0,74.0,52.0,1.0,304.0,2,,,,,
2019,2,4,22,1,2019-04-22,WN,11292,1129202,30325,DEN,"Denver, CO",CO,8,Colorado,82,10140,1014005,30140,ABQ,"Albuquerque, NM",NM,35,New Mexico,86,2240,52,132.0,132.0,1.0,8,2200-2259,9.0,101,158,5.0,2355,203,128.0,128.0,1.0,8,2300-2359,False,False,75.0,71.0,57.0,1.0,349.0,2,0.0,0.0,0.0,0.0,128.0
2018,2,6,28,4,2018-06-28,AA,14307,1430705,30721,PVD,"Providence, RI",RI,44,Rhode Island,15,11057,1105703,31057,CLT,"Charlotte, NC",NC,37,North Carolina,36,1116,1112,-4.0,0.0,0.0,-1,1100-1159,12.0,1124,1306,10.0,1330,1316,-14.0,0.0,0.0,-1,1300-1359,False,False,134.0,124.0,102.0,1.0,683.0,3,,,,,


In [14]:
display(airlines.describe()) # descriptive statistics of the airlines dataset 

summary,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,OP_UNIQUE_CARRIER,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN,ORIGIN_CITY_NAME,ORIGIN_STATE_ABR,ORIGIN_STATE_FIPS,ORIGIN_STATE_NM,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST,DEST_CITY_NAME,DEST_STATE_ABR,DEST_STATE_FIPS,DEST_STATE_NM,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,DEP_TIME_BLK,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,ARR_TIME_BLK,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
count,31746841.0,31746841.0,31746841.0,31746841.0,31746841.0,31746841,31746841.0,31746841.0,31746841.0,31746841,31746841,31746841,31746841.0,31746841,31746841.0,31746841.0,31746841.0,31746841.0,31746841,31746841,31746841,31746841.0,31746841,31746841.0,31746841.0,31274521.0,31269545.0,31269545.0,31269545.0,31269545.0,31746841,31260424.0,31260429.0,31244917.0,31244917.0,31746841.0,31244919.0,31176201.0,31176201.0,31176201.0,31176201.0,31746841,31746677.0,31178799.0,31178799.0,31746841.0,31746841.0,31746841.0,5799114.0,5799114.0,5799114.0,5799114.0,5799114.0
mean,2017.1512498204152,2.51748770846208,6.552106365480585,15.749554640727876,3.9346285509162944,,12668.724409461716,1266875.803290192,31729.315288031336,,,,26.35374732245013,,54.91906164774001,12668.666651116562,1266870.0274082704,31729.2951808339,,,,26.354102948384693,,54.919218135750896,1330.0884999550035,1334.2122192375064,9.855285614165476,12.909587811399238,0.1820794322398998,0.0360368850905889,,16.830789563186986,1356.9563268309594,1464.4766360877195,7.5604571777227,1488.9034405659447,1468.8957719173477,4.615475952313754,12.966188215170924,0.1860109575249402,-0.2096807112579239,,143.2167191860742,138.22906985609035,113.8502422431345,1.0,823.2170183483768,3.765292206553717,19.98459350859459,3.2259498606166392,15.44036813209742,0.0891679315150555,25.364284785572416
stddev,1.4316532810214986,1.1053295681781927,3.399430256141529,8.77423808835453,1.9917635387471784,,1526.7397787182156,152673.7066902925,1289.4588026200722,,,,16.539517798596844,,26.577828324534646,1526.721213157486,152671.85014169718,1289.419206153187,,,,16.53967926196839,,26.57807966993097,489.86848319644014,503.29228877418456,43.5052029370407,42.44165318434855,0.3859099860819423,2.161932356946247,,9.488981863443776,504.9367808166725,531.9873729297798,5.92997944817499,516.8048646426244,536.3586689058153,45.59418015238943,42.14088584758871,0.3891155176322491,2.2975645036344488,,74.73117735923346,74.33716296557806,72.24024903973572,0.0,607.6826683052022,2.392350188769286,59.30797970625765,26.81202538233581,34.73908233877255,2.9147981743398192,48.60358147038267
min,2015.0,1.0,1.0,1.0,1.0,9E,10135.0,1013503.0,30070.0,ABE,"Aberdeen, SD",AK,1.0,Alabama,1.0,10135.0,1013503.0,30070.0,ABE,"Aberdeen, SD",AK,1.0,Alabama,1.0,1.0,1.0,-234.0,0.0,0.0,-2.0,0001-0559,0.0,1.0,1.0,0.0,1.0,1.0,-238.0,0.0,0.0,-2.0,0001-0559,-99.0,14.0,4.0,1.0,21.0,1.0,0.0,0.0,0.0,0.0,0.0
max,2019.0,4.0,12.0,31.0,7.0,YX,16869.0,1686901.0,36133.0,YUM,"Yuma, AZ",WY,78.0,Wyoming,93.0,16869.0,1686901.0,36133.0,YUM,"Yuma, AZ",WY,78.0,Wyoming,93.0,2359.0,2400.0,2755.0,2755.0,1.0,12.0,2300-2359,227.0,2400.0,2400.0,414.0,2400.0,2400.0,2695.0,2695.0,1.0,12.0,2300-2359,948.0,1604.0,1557.0,1.0,5095.0,11.0,2695.0,2692.0,1848.0,1078.0,2454.0


From above descriptive statistics of the airlines dataset, we have a combination of categorical and numerical columns and also there are some columns with missing values.

As part of EDA, we need to check if there are any null/NaN values in the dataset. If there are null values in the dataset, we need to see what proportion of the data that is missing or has null values. This helps us in deciding the imputation strategy for missing values.

In [16]:
def nullDataFrame(df):
  '''
  Returns a pandas dataframe consisting of column names, null values and percentage of null values for the given datftame, 'df' 
  '''
  null_feature_list = []
  count = df.count()
  for column in df.columns:
    nulls = df.filter(df[column].isNull()).count()
    nulls_perct = np.round((nulls/count)*100, 2)
    null_feature_list.append([column, nulls, nulls_perct])
  nullCounts_df = pd.DataFrame(np.array(null_feature_list), columns=['Feature_Name', 'Null_Counts', 'Percentage_Null_Counts'])
  return nullCounts_df

airlines_raw_nullCounts_df = nullDataFrame(airlines)
airlines_raw_nullCounts_df

Unnamed: 0,Feature_Name,Null_Counts,Percentage_Null_Counts
0,YEAR,0,0.0
1,QUARTER,0,0.0
2,MONTH,0,0.0
3,DAY_OF_MONTH,0,0.0
4,DAY_OF_WEEK,0,0.0
5,FL_DATE,0,0.0
6,OP_UNIQUE_CARRIER,0,0.0
7,ORIGIN_AIRPORT_ID,0,0.0
8,ORIGIN_AIRPORT_SEQ_ID,0,0.0
9,ORIGIN_CITY_MARKET_ID,0,0.0


From above dataframe for null values, we can see that most of the columns have very little proportion of missing values (at most ~2%) except for `CARRIER_DELAY`, `WEATHER_DELAY`, `NAS_DELAY`, `SECURITY_DELAY` and `LATE_AIRCRAFT_DELAY` where the missing values amount to 81% of the data.

Let us now plot the histograms for all the numerical features in airlines dataset to see the data distribution. For this, we will take a fraction (0.0001) of the original airlines data and plot the histograms.

In [18]:
sample_airlines_df = airlines.sample(False, 0.0001, 2020) # Sample a fraction of the airlines data 
airlines_pandas_df = sample_airlines_df.toPandas() # Converting spark SQL dataframe to pandas dataframe for plotting Histograms

In [19]:
numeric_features = [x[0] for x in airlines.dtypes if x[1] == 'int' or x[1] == 'double'] # Retrieving only numeric features
airlines_pandas_df[numeric_features].hist(figsize=(30,30), bins=50)
display(plt.show())

From above Histograms for numeric features, we can see that there are some features with normal distribution, uniform distribution, right skewed distribution and left skewed distribution.

Let us now create a correlation matrix to see what all features are positively correlated, negatively correlated and not correlated with the dependent variable, `ARR_DELAY`.

In [21]:
airlines_pandas_df.corr() # Check the correlation of all the features with ARR_DELAY

Unnamed: 0,YEAR,QUARTER,MONTH,DAY_OF_MONTH,DAY_OF_WEEK,ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID,ORIGIN_CITY_MARKET_ID,ORIGIN_STATE_FIPS,ORIGIN_WAC,DEST_AIRPORT_ID,DEST_AIRPORT_SEQ_ID,DEST_CITY_MARKET_ID,DEST_STATE_FIPS,DEST_WAC,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,DEP_DELAY_NEW,DEP_DEL15,DEP_DELAY_GROUP,TAXI_OUT,WHEELS_OFF,WHEELS_ON,TAXI_IN,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,ARR_DELAY_NEW,ARR_DEL15,ARR_DELAY_GROUP,CANCELLED,DIVERTED,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,FLIGHTS,DISTANCE,DISTANCE_GROUP,CARRIER_DELAY,WEATHER_DELAY,NAS_DELAY,SECURITY_DELAY,LATE_AIRCRAFT_DELAY
YEAR,1.0,-0.023964,-0.022491,-0.007307,0.032029,-0.032237,-0.032234,-0.000268,0.036116,-0.043534,-0.004743,-0.00474,0.007205,0.044205,-0.053312,0.024911,0.017232,-0.027968,-0.0263,-0.015745,-0.019021,0.055078,0.016186,-0.002382,0.021339,0.002688,-0.00283,-0.02502,-0.021542,-0.00988,-0.02359,0.001103,-0.003404,0.017987,0.015116,0.006332,,0.004652,0.004249,-0.043138,-0.072671,0.040554,0.041112,-0.00913
QUARTER,-0.023964,1.0,0.971104,-0.000317,0.007619,-0.014616,-0.014616,-0.005211,-0.005331,-0.021934,-0.023697,-0.023697,-0.007301,0.01543,0.000916,-0.0315,-0.03806,-0.043209,-0.0427,-0.045323,-0.052936,-0.024982,-0.035004,-0.037289,0.011482,-0.035696,-0.037232,-0.043613,-0.043244,-0.05082,-0.054218,-0.045627,-0.008022,-0.006456,-0.008733,-0.006493,,-0.002265,0.000755,-0.058572,0.025753,-0.013076,0.03357,-0.003006
MONTH,-0.022491,0.971104,1.0,0.002864,0.011495,-0.009463,-0.009463,-0.007782,-0.000326,-0.025366,-0.022373,-0.022373,-0.006931,0.020669,0.002173,-0.031618,-0.040719,-0.040609,-0.039772,-0.040259,-0.048662,-0.021117,-0.038202,-0.044565,0.011818,-0.037658,-0.044487,-0.042914,-0.043124,-0.046117,-0.052005,-0.045996,-6e-06,-0.005699,-0.007302,-0.005587,,-0.002628,0.000531,-0.05389,0.029388,-0.012077,0.030341,-0.021316
DAY_OF_MONTH,-0.007307,-0.000317,0.002864,1.0,0.017867,0.014004,0.014004,0.010799,0.022461,0.016925,0.005518,0.005518,-0.002448,-0.000102,-0.001188,-0.014095,-0.013099,0.011964,0.011224,0.00334,0.00804,-0.008494,-0.022608,-0.031264,0.016419,-0.014676,-0.021905,0.014361,0.013408,0.013648,0.013173,0.011098,0.001873,0.007335,0.00926,0.009371,,0.006975,0.005997,-0.008347,-0.003131,-0.044797,-0.054705,0.079791
DAY_OF_WEEK,0.032029,0.007619,0.011495,0.017867,1.0,0.028071,0.028071,0.023936,-0.003751,0.006944,-0.010144,-0.010144,-0.018529,-0.013169,0.009163,-0.019032,-0.017127,0.009819,0.010484,0.005107,0.002475,-0.040331,-0.022003,-0.022137,0.015048,-0.014357,-0.018589,-0.000909,0.007808,0.009725,-0.000494,0.007637,-0.000473,0.048105,0.035618,0.041394,,0.050705,0.052722,0.027475,0.057209,-0.001755,-0.007218,-0.051268
ORIGIN_AIRPORT_ID,-0.032237,-0.014616,-0.009463,0.014004,0.028071,1.0,1.0,0.636801,-0.072113,0.299136,0.047763,0.047763,0.001126,-0.062905,0.148701,-0.03043,-0.025472,0.001796,0.002165,0.007496,0.000359,-0.021272,-0.030065,-0.001464,0.016093,-0.004275,-0.002272,0.005978,0.001159,-0.001144,0.005921,-0.004175,-0.008449,0.045136,0.046166,0.049366,,0.070136,0.074029,-0.018266,-0.028921,0.051608,-0.036194,-0.00527
ORIGIN_AIRPORT_SEQ_ID,-0.032234,-0.014616,-0.009463,0.014004,0.028071,1.0,1.0,0.6368,-0.072114,0.299136,0.047764,0.047764,0.001126,-0.062905,0.1487,-0.030429,-0.025471,0.001796,0.002165,0.007496,0.000359,-0.021271,-0.030065,-0.001464,0.016093,-0.004275,-0.002271,0.005978,0.001159,-0.001144,0.005921,-0.004175,-0.008449,0.045136,0.046167,0.049366,,0.070136,0.074029,-0.018267,-0.028922,0.051608,-0.036193,-0.00527
ORIGIN_CITY_MARKET_ID,-0.000268,-0.005211,-0.007782,0.010799,0.023936,0.636801,0.6368,1.0,0.042448,0.098846,-0.010251,-0.010251,-0.067665,-0.05622,0.038953,-0.02148,-0.017146,-0.007418,0.000686,-0.026361,-0.014995,-0.051242,-0.022566,-0.00422,0.068493,-0.008283,-0.007569,-0.003542,0.002458,-0.026217,-0.010814,0.014223,-0.007664,0.004067,0.004288,0.006368,,0.013098,0.016578,-0.004086,-0.01222,0.063343,0.02139,0.046291
ORIGIN_STATE_FIPS,0.036116,-0.005331,-0.000326,0.022461,-0.003751,-0.072113,-0.072114,0.042448,1.0,-0.03267,-0.053603,-0.053603,-0.077605,0.029607,-0.06508,-0.027567,-0.026426,-0.026261,-0.020406,-0.027378,-0.029905,0.030748,-0.020369,-0.012682,0.043254,-0.016571,-0.015601,-0.025081,-0.013111,-0.012128,-0.033599,0.024905,0.000184,0.011967,0.010703,0.003548,,-0.021189,-0.025378,0.034296,-0.022391,0.041051,-0.028536,-0.071792
ORIGIN_WAC,-0.043534,-0.021934,-0.025366,0.016925,0.006944,0.299136,0.299136,0.098846,-0.03267,1.0,0.123715,0.123715,0.040373,-0.050955,0.457458,-0.03756,-0.035965,-0.00643,-0.009046,-0.008122,-0.010665,-0.076174,-0.045583,0.0131,-0.009549,0.009247,0.015759,0.008487,-0.012983,-0.00634,-0.001769,-0.007525,0.012282,-0.04365,-0.041734,-0.031806,,0.027871,0.027277,0.021808,-0.025508,-0.006519,-0.034485,-0.040636


From above correlation matrix, we can see that `ARR_DELAY` is highly correlated with `DEP_DELAY`, `CARRIER_DELAY`, `WEATHER_DELAY`, `NAS_DELAY`,  and `LATE_AIRCRAFT_DELAY` among other features. For our supervised machine learning model, we will consider these features for training the model along with other features of relavence and domain

<h5>Weather Data:</h5>

In [24]:
weather_parquet.printSchema() # data structure of weather data

In [25]:
weather_parquet.count() # to check the number of records in weather data

In [26]:
len(weather_parquet.columns) # Check the number of columns in weather data

<h5>Airport Codes Data:</h5>

In [28]:
airport_codes.printSchema() # data structure of airport_codes data

In [29]:
airport_codes.count() # to check the number of records in airport_codes data

In [30]:
len(airport_codes.columns) # Check the number of columns in airport_codes data

In [31]:
display(airport_codes.sample(0.001, False)) # sample a fraction of records for data insight

code,lat,lon
FYT,17.91710091,19.11109924
LXR,25.671,32.7066
TLH,30.39649963,-84.35030365
TYF,60.1576004,12.99129963
MOI,-19.84250069,-157.7030029
KTR,-14.52110004,132.378006
\N,49.23570251,140.1931
LNX,54.824,32.025
\N,49.79059982,30.44140053


In [32]:
display(airport_codes.describe()) # descriptive statistics of the airport_codes dataset

summary,code,lat,lon
count,7698,7698.0,7698.0
mean,,25.808442484891227,-1.3905462050833943
stddev,,28.404945978641383,86.51916220191839
min,AAA,-90.0,-179.8769989
max,\N,89.5,179.951004


In [33]:
airport_codes_raw_nullCounts_df = nullDataFrame(airport_codes) # Check to see if null values are present in the airport_codes data set
airport_codes_raw_nullCounts_df

Unnamed: 0,Feature_Name,Null_Counts,Percentage_Null_Counts
0,code,0,0.0
1,lat,0,0.0
2,lon,0,0.0


<h2 align="center"> Feature Engineering:</h2>

<h5>Preprocess Airlines Data:</h5> 

Before we begin with our prediction models for `ARR_DELAY`, we need to clean the airlines data and fix the missing values. Below are the preprocessing steps we used: 
+ We will introduce new features in airlines dataset such as `IS_WEEKEND`, `DEP_RUSH_HOUR` and `ARR_RUSH_HOUR`. 
+ We will also introduce 2 new features, `ORIGIN_CARRIER` and `DEST_CARRIER` which are interaction terms with `OP_UNIQUE_CARRIER`. 
+ We will filter out (remove) airlines data that has `CANCELLED` and `DIVERTED` columns' values equal to True.
+ We will One-Hot Encode  `CARRIER_DELAY`, `WEATHER_DELAY`, `NAS_DELAY`, `SECURITY_DELAY` and `LATE_AIRCRAFT_DELAY` features instead of taking their actual numeric values.
+  We will remove all columns that are not relevant in predicting `ARR_DELAY`

In [36]:
def is_Weekend(x):
  """
  Function to determine if a given day of the week is a weekend_day(Friday, Saturday, Sunday)
  """
  if   x < 5: 
    return 0
  else: 
    return 1

def is_RushHour(x):
  """
  Function to determine if a given time of the day is rush hour (1600-2100)
  """
  if (x != None) and (x >= 1600) and (x <= 2100): 
    return 1
  else: 
    return 0
 
def preprocessAirlines(df):
  cols_to_keep = ['MONTH', 'DAY_OF_WEEK', 'FL_DATE', 'OP_UNIQUE_CARRIER', 'ORIGIN', 'DEST', 'DEP_DELAY', 'DEP_TIME_BLK', 'ARR_DELAY', 'ARR_TIME_BLK', 'CRS_ELAPSED_TIME', 'DISTANCE',  'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY', 'IS_WEEKEND', 'DEP_RUSH_HOUR', 'ARR_RUSH_HOUR', 'DEP_TIME', 'CRS_DEP_TIME', 'ARR_TIME', 'CRS_ARR_TIME']
  cols_to_remove = [x for x in df.columns if x not in cols_to_keep]
  df = df.orderBy("FL_DATE") 
  df = df.filter(df.CANCELLED == False)
  df = df.filter(df.DIVERTED == False)
  df = df.withColumn('CARRIER_DELAY', f.when(df.CARRIER_DELAY.isNotNull(), 1).otherwise(0))
  df = df.withColumn('WEATHER_DELAY', f.when(df.WEATHER_DELAY.isNotNull(), 1).otherwise(0))
  df = df.withColumn('NAS_DELAY', f.when(df.NAS_DELAY.isNotNull(), 1).otherwise(0))
  df = df.withColumn('SECURITY_DELAY', f.when(df.SECURITY_DELAY.isNotNull(), 1).otherwise(0))
  df = df.withColumn('LATE_AIRCRAFT_DELAY', f.when(df.LATE_AIRCRAFT_DELAY.isNotNull(), 1).otherwise(0))
  df = df.withColumn("IS_WEEKEND", f.udf(is_Weekend, IntegerType())("DAY_OF_WEEK"))
  df = df.withColumn("DEP_RUSH_HOUR", f.udf(is_RushHour, IntegerType())("DEP_TIME"))
  df = df.withColumn("ARR_RUSH_HOUR", f.udf(is_RushHour, IntegerType())("CRS_ARR_TIME"))
  df = df.fillna(0, subset=['ARR_DELAY', 'DEP_DELAY'])  
  df = df.withColumn('ORIGIN_CARRIER', concat(col("ORIGIN"), lit("_"), col("OP_UNIQUE_CARRIER")))
  df = df.withColumn('DEST_CARRIER', concat(col("DEST"), lit("_"), col("OP_UNIQUE_CARRIER")))
  preprocessAirlines_df = df.drop(*cols_to_remove)
  return preprocessAirlines_df

<h5>Airlines & Weather Data Merge:</h5>

Now, we will reduce the weather dataset to only consider Unites States data and merge with airlines data. We will be using airport_codes dataset which has the common feature to group and merge airlines and weather datasets.

In [38]:
def unionAll_fn(dfs):
    return functools.reduce(lambda df1,df2: df1.union(df2.select(df1.columns)), dfs) 

def US_fn(df):
    """
    Reduce df to US only to reduce size of dataset
    """
    # US is lat/long ranges according to format: [[(lat_low, lat_high),(long_low, long_high)], [(lat_low, lat_high),(long_low, long_high)]]
    US = [[(24,49),(-125,-67)],[(17,19),(-68,-65.5)], [(13,14),(144,145)], [(15,16),(145,146)], [(-15,-14), (-171,-170)], [(18,19),(-65.4,-64)], [(18,23),(-160,-154)], [(50,175),(-170,-103)]]  

    list_df = [] #empty list for parquet parts
    parquet_part = spark.range(0).drop("id") #empty spark df

    #Filtering for individual areas in US
    for item in US:
      parquet_part = df.filter((f.col('Latitude') > item[0][0]) & (f.col('Latitude') < item[0][1]) & (f.col('Longitude') > item[1][0]) & (f.col('Longitude') < item[1][1]))
      list_df.append(parquet_part)
    
    #Appending each individual US area
    parquet_us = unionAll_fn(list_df)

    return parquet_us

def reduce_split_cols_fn(weather_parquet_us):
    """
    Reduce weather dataset to columns of interest and return split columns with comma-separated values into multiple columns for each comma-separated value.
    """
    #Reduce weather dataset to columns of interest (high level) and return split columns with comma-separated values into multiple columns for each comma-separated value.
    weather_pre_split = weather_parquet_us.select('STATION','DATE','SOURCE','LATITUDE','LONGITUDE',f.split('WND', ',').alias('WND'),f.split('VIS', ',').alias('VIS'),f.split('SLP', ',').alias('SLP'),f.split('AA1', ',').alias('AA1'))
    df_sizes_WND = weather_pre_split.select(f.size('WND').alias('WND'))
    df_sizes_VIS = weather_pre_split.select(f.size('VIS').alias('VIS'))
    df_sizes_SLP = weather_pre_split.select(f.size('SLP').alias('SLP'))
    df_sizes_AA1 = weather_pre_split.select(f.size('AA1').alias('AA1'))
    df_max_WND = df_sizes_WND.agg(f.max('WND'))
    df_max_VIS = df_sizes_VIS.agg(f.max('VIS'))
    df_max_SLP = df_sizes_SLP.agg(f.max('SLP'))
    df_max_AA1 = df_sizes_AA1.agg(f.max('AA1'))
    nb_columns_WND = df_max_WND.collect()[0][0]
    nb_columns_VIS = df_max_VIS.collect()[0][0]
    nb_columns_SLP = df_max_SLP.collect()[0][0]
    nb_columns_AA1 = df_max_AA1.collect()[0][0]
    weather_post_split = weather_pre_split.select('STATION','DATE','SOURCE','LATITUDE','LONGITUDE',*[weather_pre_split['WND'][i] for i in range(nb_columns_WND)],*[weather_pre_split['VIS'][i] for i in range(nb_columns_VIS)],*[weather_pre_split['SLP'][i] for i in range(nb_columns_SLP)],*[weather_pre_split['AA1'][i] for i in range(nb_columns_AA1)])
  
    #Filtering out data with quality issues. All string values are indicative of quality issue
    fltr_msk = [
    f.col('WND[0]') != '999',
    f.col('WND[1]') != '2',
    f.col('WND[1]') != '3',
    f.col('WND[1]') != '6',
    f.col('WND[1]') != '7',
    f.col('WND[2]') != '9',
    f.col('WND[3]') != '9999',  
    f.col('WND[4]') != '2',
    f.col('WND[4]') != '3',
    f.col('WND[4]') != '6',
    f.col('WND[4]') != '7',
    f.col('VIS[0]') != '999999',
    f.col('VIS[1]') != '2',
    f.col('VIS[1]') != '3',
    f.col('VIS[1]') != '6',
    f.col('VIS[1]') != '7',
    f.col('VIS[2]') != '9',
    f.col('VIS[3]') != '2',
    f.col('VIS[3]') != '3',
    f.col('VIS[3]') != '6',
    f.col('VIS[3]') != '7',
    f.col('SLP[0]') != '99999',
    f.col('SLP[1]') != '2',
    f.col('SLP[1]') != '3',
    f.col('SLP[1]') != '6',
    f.col('SLP[1]') != '7',
    f.col('SLP[1]') != '9',
    f.col('AA1[0]') != '99',
    f.col('AA1[1]') != '9999',
    f.col('AA1[2]') != '9',
    f.col('AA1[3]') != '2',
    f.col('AA1[3]') != '3',
    f.col('AA1[3]') != '6',
    f.col('AA1[3]') != '7'
    ]
    weather_fltr = weather_post_split
    for i in fltr_msk:
      weather_fltr = weather_fltr.filter(i)

    #Reduce weather dataset to columns of interest (low level)
    weather_fltr_drop = weather_fltr.select('STATION','DATE','SOURCE','LATITUDE','LONGITUDE','WND[0]', 'WND[3]','VIS[0]','SLP[0]','AA1[0]')
    weather_fltr_drop = weather_fltr_drop.withColumnRenamed("DATE", "TIMESTAMP")

    return weather_fltr_drop

def distinct_station_fn(weather_fltr_drop):
    """
    For df input, return distinct stations for calculating closest stations to airports
    """
    weather_fltr_drop_distinct = weather_fltr_drop.select("STATION", "LATITUDE", "LONGITUDE").distinct()
    return weather_fltr_drop_distinct

def haversine_join_station_aircode_fn(airport_codes_df, weather_df):
    """
    For df input, return haversine distance
    """
    airport_codes_df.createOrReplaceTempView('airport_codes_us')
    weather_df.createOrReplaceTempView('stations_all')
    distance_query = "(SELECT airport_codes_us.code, stations_all.STATION, airport_codes_us.lat AS airport_lat, airport_codes_us.lon AS airport_lon, ( 3959 * acos(cos(radians(airport_codes_us.lat) ) * cos( radians( stations_all.LATITUDE ) ) * cos( radians( stations_all.LONGITUDE ) - radians(airport_codes_us.lon) ) + sin(radians(airport_codes_us.lat) ) * sin( radians( stations_all.LATITUDE ) ) ) ) AS airport_station_distance FROM airport_codes_us CROSS JOIN stations_all)"
    airports_stations_distance_all = spark.sql(distance_query)
    return airports_stations_distance_all
  
def airports_closest_stations_fn(airports_stations_distance_all):
    """
    For df input, return df with closest weather stations to airports
    """

    airports_stations_distance_all.createOrReplaceTempView('airports_stations_distance')
    closest_query = "(SELECT code AS airport_code, STATION AS station_name, airport_lat, airport_lon, airport_station_distance FROM airports_stations_distance ORDER BY airport_station_distance)"
    airports_closest_stations = spark.sql(closest_query)
  
    min_distance_query = "(SELECT code AS airport_code, STATION AS station_code, airport_lat, airport_lon, airport_station_distance FROM (SELECT *, row_number() over (partition by code order by airport_station_distance ASC) as seqnum from airports_stations_distance) airports_stations_distance where seqnum = 1)"
    airports_closest_station = spark.sql(min_distance_query)    

    MAX_ALLOWABLE_WEATHER_DISTANCE = 50.0
    airports_closest_station_filtered = airports_closest_station.filter(airports_closest_station.airport_station_distance < MAX_ALLOWABLE_WEATHER_DISTANCE)
    return airports_closest_station_filtered
  
def bearingClass_fn(flight_bearing, denominations=8):
    denom = 360/denominations
        
    if (int(flight_bearing) < 0 + denom/2) or (int(flight_bearing) > (7*denom) + (denom/2)):
      flight_bearing_class = "N"
    elif int(flight_bearing) <= denom + (denom/2):
      flight_bearing_class = "NW"
    elif int(flight_bearing) <= (2*denom) + (denom/2):
      flight_bearing_class = "W"
    elif int(flight_bearing) <= (3*denom) + (denom/2):
      flight_bearing_class = "SW"
    elif int(flight_bearing) <= (4*denom) + (denom/2):
      flight_bearing_class = "S"
    elif int(flight_bearing) <= (5*denom) + (denom/2):
      flight_bearing_class = "SE"
    elif int(flight_bearing) <= (6*denom) + (denom/2):
      flight_bearing_class = "E"
    elif int(flight_bearing) <= (7*denom) + (denom/2):
      flight_bearing_class = "NE"
    else:
      flight_bearing_class = "UNK"
      
    return flight_bearing_class
  
udfBearingClass_fn = udf(bearingClass_fn, StringType())

def bearingCalculation_fn(lat_a, lon_a, lat_b, lon_b):  
    lat_a_r, lat_b_r, lon_a_r, lon_b_r = radians(lat_a), radians(lat_b), radians(lon_a), radians(lon_b)
    delta_lon = lon_b - lon_a
    delta_lon_r = lon_b_r - lon_a_r
    X = cos(lat_b_r) * sin(delta_lon_r)
    Y = cos(lat_a_r) * sin(lat_b_r) - sin(lat_a_r) * cos(lat_b_r) * cos(delta_lon_r)
  
    flight_bearing = degrees(atan2(X, Y))
        
    flight_bearing_class = bearingClass_fn(flight_bearing)
  
    return flight_bearing_class
udfBearingCalculation_fn = udf(bearingCalculation_fn, StringType())

def join_closest_weather_airlines_fn(airlines_df, airports_closest_station_filtered):

    # add closest weather station to airlines dataset
    airlines_station_origin_filtered = airlines_df.join(airports_closest_station_filtered, airlines_df.ORIGIN == airports_closest_station_filtered.airport_code, how="inner")
    airlines_station_origin_filtered = airlines_station_origin_filtered.withColumnRenamed("station_code", "ORIGIN_STATION")
    airlines_station_origin_filtered = airlines_station_origin_filtered.withColumnRenamed("airport_station_distance", "ORIGIN_STATION_DISTANCE")
    airlines_station_origin_filtered = airlines_station_origin_filtered.withColumnRenamed("airport_lat", "ORIGIN_LAT")
    airlines_station_origin_filtered = airlines_station_origin_filtered.withColumnRenamed("airport_lon", "ORIGIN_LON")
    airlines_station_origin_filtered = airlines_station_origin_filtered.drop("airport_code")

    # add closest weather station to airlines dataset
    airlines_station_filtered = airlines_station_origin_filtered.join(airports_closest_station_filtered, airlines_station_origin_filtered.DEST == airports_closest_station_filtered.airport_code, how="inner")
    airlines_station_filtered = airlines_station_filtered.withColumnRenamed("station_code", "DEST_STATION")
    airlines_station_filtered = airlines_station_filtered.withColumnRenamed("airport_station_distance", "DEST_STATION_DISTANCE")
    airlines_station_filtered = airlines_station_filtered.withColumnRenamed("airport_lat", "DEST_LAT")
    airlines_station_filtered = airlines_station_filtered.withColumnRenamed("airport_lon", "DEST_LON")
    airlines_station_filtered = airlines_station_filtered.drop("airport_code")

    #add flight bearing angle in degrees from true north (consistent with wind direction)
    airlines_station_filtered = airlines_station_filtered.withColumn("FLIGHT_BEARING", udfBearingCalculation_fn("ORIGIN_LAT","ORIGIN_LON","DEST_LAT","DEST_LON"))
    return airlines_station_filtered

def flightDateTimeCalculation_fn(flight_date, flight_time):  
    timestamp_date = str(flight_date)
    timestamp_hour = str(flight_time).zfill(4)[:-2]
    timestamp_minute = str(flight_time).zfill(4)[-2:]
  
    timestamp = timestamp_date + 'T' + timestamp_hour + ':' + timestamp_minute# + ".000+0000"
    try:
      datetime_timestamp = dateutil.parser.isoparse(timestamp)
    except ValueError:
      timestamp = timestamp_date + 'T' + '00' + ':' + timestamp_minute# + ".000+0000"
      datetime_timestamp = dateutil.parser.isoparse(timestamp)
    
    return datetime_timestamp
  
def flightDateTimeCalculationArr_fn(flight_date, flight_time_dep, flight_time_arr):  
    timestamp_dep_date = str(flight_date)
    timestamp_arr_date = str(flight_date)
  
    
    timestamp_dep_hour = str(flight_time_dep).zfill(4)[:-2]
    timestamp_dep_minute = str(flight_time_dep).zfill(4)[-2:]
    timestamp_arr_hour = str(flight_time_arr).zfill(4)[:-2]
    timestamp_arr_minute = str(flight_time_arr).zfill(4)[-2:]
    
    timestamp_dep = timestamp_dep_hour + ':' + timestamp_dep_minute
    timestamp_arr = timestamp_arr_hour + ':' + timestamp_arr_minute
    
    timestamp_dep = timestamp_dep_date + 'T' + timestamp_dep_hour + ':' + timestamp_dep_minute# + ".000+0000"
    try:
      datetime_timestamp_dep = dateutil.parser.isoparse(timestamp_dep)
    except ValueError:
      timestamp_dep = timestamp_dep_date + 'T' + '00' + ':' + timestamp_dep_minute# + ".000+0000"
      datetime_timestamp_dep = dateutil.parser.isoparse(timestamp_dep)
    
    timestamp_arr = timestamp_arr_date + 'T' + timestamp_arr_hour + ':' + timestamp_arr_minute# + ".000+0000"
    try:
      datetime_timestamp_arr = dateutil.parser.isoparse(timestamp_arr)
    except ValueError:
      timestamp_arr = timestamp_arr_date + 'T' + '00' + ':' + timestamp_arr_minute# + ".000+0000"
      datetime_timestamp_arr = dateutil.parser.isoparse(timestamp_arr)
  
    # if flight arrived a later than when started, only works if flight was less than 24 hours long:
    if datetime_timestamp_dep > datetime_timestamp_arr:
      datetime_timestamp_arr = datetime_timestamp_arr + datetime.timedelta(days=1)

    return datetime_timestamp_arr

udfFlightDateTimeCalculation_fn = udf(flightDateTimeCalculation_fn, TimestampType())
udfFlightDateTimeCalculationArr_fn = udf(flightDateTimeCalculationArr_fn, TimestampType())

def airlines_station_datetime_fn(airlines_station_filtered):
    airlines_station_datetime = airlines_station_filtered.withColumn("CRS_DEP_TIMESTAMP", udfFlightDateTimeCalculation_fn("FL_DATE","CRS_DEP_TIME"))
    airlines_station_datetime = airlines_station_datetime.withColumn("CRS_ARR_TIMESTAMP", udfFlightDateTimeCalculationArr_fn("FL_DATE","CRS_DEP_TIME", "CRS_ARR_TIME"))
    return airlines_station_datetime

def airlines_station_datetime_unix_fn(airlines_station_datetime):
    airlines_station_datetime_unix = airlines_station_datetime.withColumn("CRS_DEP_TIMESTAMP_UNIX", f.unix_timestamp("CRS_DEP_TIMESTAMP"))
    airlines_station_datetime_unix = airlines_station_datetime_unix.withColumn("CRS_ARR_TIMESTAMP_UNIX", f.unix_timestamp("CRS_ARR_TIMESTAMP"))
    airlines_station_datetime_unix = airlines_station_datetime_unix.withColumn("DEP_HOUR", f.hour("CRS_DEP_TIMESTAMP"))
    airlines_station_datetime_unix = airlines_station_datetime_unix.withColumn("ARR_HOUR", f.hour("CRS_ARR_TIMESTAMP"))
    
    return airlines_station_datetime_unix
  
def weather_fltr_datetime_fn(weather_fltr_drop):
    weather_fltr_datetime = weather_fltr_drop.withColumn("DATE_TIMESTAMP_UNIX", f.unix_timestamp("TIMESTAMP"))
    weather_fltr_datetime = weather_fltr_datetime.withColumn('DATE', f.col("TIMESTAMP").cast(DateType()))
    weather_fltr_datetime = weather_fltr_datetime.withColumn("HOUR", f.hour("TIMESTAMP"))
    
    return weather_fltr_datetime

def weather_avg_fn(weather_fltr_datetime):
    weather_fltr_datetime.createOrReplaceTempView('weather_time')
    weather_avg_query = "(SELECT STATION, DATE, HOUR, ROUND(AVG(`WND[0]`),0) AS `WND[0]`, ROUND(AVG(`WND[3]`),0) AS `WND[3]`, ROUND(AVG(`VIS[0]`),0) AS `VIS[0]`, ROUND(AVG(`SLP[0]`),0) AS `SLP[0]`, ROUND(AVG(`AA1[0]`),0) AS `AA1[0]` FROM weather_time GROUP BY STATION, DATE, HOUR)"

    weather_avg = spark.sql(weather_avg_query)
    
    weather_avg = weather_avg.withColumn("WND_CLASS[0]", udfBearingClass_fn("WND[0]"))
    weather_avg = weather_avg.drop("WND[0]")
    
    return weather_avg
  
def weather_add_values_fn(weather_avg):
    weather_fltr_datetime_origin = weather_avg.withColumnRenamed("STATION", "ORIGIN_STATION_WEATHER")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("DATE", "ORIGIN_STATION_DATE")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("HOUR", "ORIGIN_STATION_HOUR")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("WND_CLASS[0]", "ORIGIN_STATION_WND[0]")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("WND[3]", "ORIGIN_STATION_WND[3]")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("VIS[0]", "ORIGIN_STATION_VIS[0]")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("SLP[0]", "ORIGIN_STATION_SLP[0]")
    weather_fltr_datetime_origin = weather_fltr_datetime_origin.withColumnRenamed("AA1[0]", "ORIGIN_STATION_AA1[0]")
    weather_fltr_datetime_dest = weather_avg.withColumnRenamed("STATION", "DEST_STATION_WEATHER")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("DATE", "DEST_STATION_DATE")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("HOUR", "DEST_STATION_HOUR")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("WND_CLASS[0]", "DEST_STATION_WND[0]")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("WND[3]", "DEST_STATION_WND[3]")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("VIS[0]", "DEST_STATION_VIS[0]")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("SLP[0]", "DEST_STATION_SLP[0]")
    weather_fltr_datetime_dest = weather_fltr_datetime_dest.withColumnRenamed("AA1[0]", "DEST_STATION_AA1[0]")
    return weather_fltr_datetime_origin, weather_fltr_datetime_dest
  
def departure_final_fn(airlines_station_datetime_unix):
    airlines_station_datetime_unix.createOrReplaceTempView("airports_weather")
    weather_fltr_datetime_origin.createOrReplaceTempView("origin_weather")
    origin_join_query = "(SELECT * FROM airports_weather a INNER JOIN origin_weather w ON a.ORIGIN_STATION = w.ORIGIN_STATION_WEATHER AND a.FL_DATE = w.ORIGIN_STATION_DATE AND a.DEP_HOUR = w.ORIGIN_STATION_HOUR)"

    departure_final = spark.sql(origin_join_query)
    return departure_final

def airlines_weather_final_trim_fn(departure_final):
    departure_final.createOrReplaceTempView("airports_weather_dest")
    weather_fltr_datetime_dest.createOrReplaceTempView("dest_weather")
    # chnaged to join on weather @ destination airport @ departure time
    dest_join_query = "(SELECT * FROM airports_weather_dest a INNER JOIN dest_weather w ON a.DEST_STATION = w.DEST_STATION_WEATHER AND a.FL_DATE = w.DEST_STATION_DATE AND a.DEP_HOUR = w.DEST_STATION_HOUR)"

    airlines_weather_final = spark.sql(dest_join_query)
    drop_cols = ['DEST_STATION_DATE', 'DEST_STATION_HOUR', 'ORIGIN_STATION_HOUR', 'ORIGIN_STATION_DATE', 'ORIGIN_LAT', 'ORIGIN_LON', 'DEST_LAT', 'DEST_LON', 'CRS_DEP_TIMESTAMP_UNIX', 'CRS_ARR_TIMESTAMP_UNIX', 'DEP_HOUR', 'ARR_HOUR', 'ORIGIN_STATION', 'DEST_STATION', 'ORIGIN_STATION_WEATHER', 'DEST_STATION_WEATHER']
    airlines_weather_final_trim = airlines_weather_final.drop(*drop_cols)
    return airlines_weather_final_trim
  
def airlines_weather_to_parquet_fn(airlines_weather_final_trim):
    dbutils.fs.rm("dbfs:/tmp/parquet/airlines_weather_final_4_7.parquet")
    airlines_weather_final_trim.write.parquet("dbfs:/tmp/parquet/airlines_weather_final_4_7.parquet")
    return None

In [39]:
# Merging airlines and weather data within United States
airlines_df =  preprocessAirlines(airlines)
weather_parquet_us = US_fn(weather_parquet)
weather_fltr_drop = reduce_split_cols_fn(weather_parquet_us)
weather_fltr_drop_distinct = distinct_station_fn(weather_fltr_drop)
airport_codes_us = US_fn(airport_codes)
airports_stations_distance_all = haversine_join_station_aircode_fn(airport_codes_us, weather_fltr_drop_distinct)
airports_closest_station_filtered = airports_closest_stations_fn(airports_stations_distance_all)
airports_closest_station_filtered = join_closest_weather_airlines_fn(airlines_df, airports_closest_station_filtered)
airlines_station_datetime = airlines_station_datetime_fn(airports_closest_station_filtered)
airlines_station_datetime_unix = airlines_station_datetime_unix_fn(airlines_station_datetime)  
weather_fltr_datetime = weather_fltr_datetime_fn(weather_fltr_drop)
weather_avg = weather_avg_fn(weather_fltr_datetime)
weather_fltr_datetime_origin, weather_fltr_datetime_dest = weather_add_values_fn(weather_avg)
departure_final = departure_final_fn(airlines_station_datetime_unix)
airlines_weather_final_trim = airlines_weather_final_trim_fn(departure_final)

In [40]:
airlines_preprocessed_nullCounts_df = nullDataFrame(airlines_df) # Check to see if there are any null values after the data merge
airlines_preprocessed_nullCounts_df

Unnamed: 0,Feature_Name,Null_Counts,Percentage_Null_Counts
0,MONTH,0,0.0
1,DAY_OF_WEEK,0,0.0
2,FL_DATE,0,0.0
3,OP_UNIQUE_CARRIER,0,0.0
4,ORIGIN,0,0.0
5,DEST,0,0.0
6,CRS_DEP_TIME,0,0.0
7,DEP_TIME,0,0.0
8,DEP_DELAY,0,0.0
9,DEP_TIME_BLK,0,0.0


After merging the airlines and weather datasets, we do not observe any null values.

We now have the cleaned, preprocced, and merged dataset ready to train our supervised regression algorithms.

In [42]:
#airlines_weather_final_trim.write.parquet("/FileStore/tables/airlines_weather_final_trim.parquet")
airlines_weather_final_trim = spark.read.parquet("/FileStore/tables/airlines_weather_final_trim.parquet")

In [43]:
airlines_weather_final_trim.printSchema() # Check the schema of the final dataset after merge

<h5>Data Split:</h5> 
We will now split the merged dataset into `train`, `validation` and `test` data for machine learning algorithms. 

We will split the dataset into 80%, 10% and 10% respectively for training, validation and testing.

In [45]:
airlines_train, airlines_val, airlines_test = airlines_weather_final_trim.randomSplit([0.8,0.1,0.1], seed = 2020)

In [46]:
train_cnt = airlines_train.count()
val_cnt = airlines_val.count()
test_cnt = airlines_test.count()
total_cnt = train_cnt + val_cnt + test_cnt
print('airlines_train records: {}\n airlines_val records: {}\n  airlines_test records: {}\n total records: {}'.format(train_cnt, val_cnt, test_cnt, total_cnt) ) # Check the number of records after data split 

<h5>Feature Selection:</h5> 

We will now remove any features that are not relevant in predicting `ARR_DELAY`.

In [48]:
def featureSelection(df):
  cols_to_keep = ['MONTH', 'DAY_OF_WEEK', 'DEP_DELAY', 'DEP_TIME_BLK', 'ARR_DELAY', 'ARR_TIME_BLK', 'CRS_ELAPSED_TIME', 'DISTANCE',  'CARRIER_DELAY', 'WEATHER_DELAY', 'NAS_DELAY', 'SECURITY_DELAY', 'LATE_AIRCRAFT_DELAY', 'IS_WEEKEND', 'DEP_RUSH_HOUR', 'ARR_RUSH_HOUR','FLIGHT_BEARING', 'ORIGIN_CARRIER', 'DEST_CARRIER', 'ORIGIN_STATION_WND_DIR', 'ORIGIN_STATION_VIS', 'ORIGIN_STATION_SLP','ORIGIN_STATION_AA1',    'ORIGIN_STATION_WND', 'DEST_STATION_WND_DIR', 'DEST_STATION_VIS', 'DEST_STATION_SLP', 'DEST_STATION_AA1',  'DEST_STATION_WND']
  cols_to_remove = [x for x in df.columns if x not in cols_to_keep]
  df = df.withColumnRenamed("ORIGIN_STATION_WND[0]", "ORIGIN_STATION_WND_DIR")
  df = df.withColumnRenamed("ORIGIN_STATION_VIS[0]", "ORIGIN_STATION_VIS")
  df = df.withColumnRenamed("ORIGIN_STATION_SLP[0]", "ORIGIN_STATION_SLP")
  df = df.withColumnRenamed("ORIGIN_STATION_AA1[0]", "ORIGIN_STATION_AA1")
  df = df.withColumnRenamed("ORIGIN_STATION_WND[3]", "ORIGIN_STATION_WND")
  
  df = df.withColumnRenamed("DEST_STATION_WND[0]", "DEST_STATION_WND_DIR")
  df = df.withColumnRenamed("DEST_STATION_VIS[0]", "DEST_STATION_VIS")
  df = df.withColumnRenamed("DEST_STATION_SLP[0]", "DEST_STATION_SLP")
  df = df.withColumnRenamed("DEST_STATION_AA1[0]", "DEST_STATION_AA1")
  df = df.withColumnRenamed("DEST_STATION_WND[3]", "DEST_STATION_WND")  
  
  featureSelection_df = df.drop(*cols_to_remove)
  return featureSelection_df

In [49]:
airlines_train_df =  featureSelection(airlines_train) # Select only the features required for the predictive models

In [50]:
numeric_features = [x[0] for x in airlines_train_df.dtypes if x[1] == 'int' or x[1] == 'double'] # Retrieving numeric features from the dataset
numeric_features.remove('ARR_DELAY') # As this will be our label/depndent variable, we will remove it from our traininig features
cat_features = ['MONTH', 'DAY_OF_WEEK'] # These are catogorical fetures in 'int' type
numeric_features = [x for x in numeric_features if x not in cat_features]
numeric_features

In [51]:
categorical_features = [x[0] for x in airlines_train_df.dtypes if x[1] == 'string']# Retrieving categorical features from the dataset
categorical_features = categorical_features + cat_features
categorical_features

In [52]:
stages = []
for categoricalCol in categorical_features:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Index', handleInvalid="keep") # Transforming categorical features to StringIndexer
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "classVec"]) # Transforming StringIndexers to One-Hot Encoded vectors
    stages += [stringIndexer, encoder]
assemblerInputs = [c + "classVec" for c in categorical_features] + numeric_features
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features", handleInvalid="keep") # Transforming all input variables into one Vector column called 'features' 
stages += [assembler]

In [53]:
pipeline = Pipeline().setStages(stages).fit(airlines_train_df) # Creating a pipeline to include all the stages and fitting the train data
vector_airlines_train_df = pipeline.transform(airlines_train_df) # Transforming the train data based on fitted pipeline's estimator 
vector_airlines_train_df.printSchema()

In [54]:
train_df = vector_airlines_train_df.select(col("ARR_DELAY").alias("label"), col("features")) # Renaming 'ARR_DELAY' to 'label'
train_df.show(2)

In [55]:
airlines_val_df =  featureSelection(airlines_val) # feature selecting for validation data
vector_airlines_val_df = pipeline.transform(airlines_val_df) # Transforming the validation data based on fitted pipeline's estimator
val_df = vector_airlines_val_df.select(col("ARR_DELAY").alias("label"), col("features"))  # Renaming 'ARR_DELAY' to 'label'

<h2 align="center">Algorithm Exploration:</h2>
To predict `ARR_DELAY` from the dataset, we are going to consider below supervised machine learning algorithms. 
1. Linear Regression
2. Decision Tree Regressor
3. Random Forest Regressor
4. Gradient Boosted Tree Regressor

For all these algorithms we will use 5-fold cross validation to reduce overfitting the data. Also, we will use grid search to find the best parameters for each algorithm by hyperparameter tuning. Once the model is trained on the train_data, we will save the model for later retrieval for inference. We will evaluate these models based on their   \\( R^2 \\) value to find the best algorithm. Once the best algorithm is selected, we will use the model for inference on the test data to get the  \\( R^2 \\) value.

#### Linear Regression:


We will train different variations of linear regression such as `OLS Regression`, `LASSO Regression`, `Ridge Regression` and `Elastic Net Regression`. We will accomplish this by hyperparameter tuning `elasticNetParam` parameter with different values such as 0.0 (leads to Ridge Regression), 0.5 (leads to Elastic Net Regression) and 1.0 (leads to LASSO Regression). Apart from training these variations of linear regression models, we will also fine tune for the best regularization parameter and number of iterations to train the model.

In [58]:
lr = LinearRegression(featuresCol = 'features', labelCol='label') #Initializing LinearRegression class
paramGrid_lr = ParamGridBuilder() \
   .addGrid(lr.regParam, []) \
   .addGrid(lr.maxIter, [10, 20]) \
   .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0]) \
   .build() # Building a parameter grid for hyperparameter tuning

crossval_lr = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid_lr,
                          evaluator=RegressionEvaluator(),
                          numFolds=5)  # Hyperparameter tuning the model and 5-Fold Cross validation 

cvModel_lr = crossval_lr.fit(train_df) # Fitting the model on train data

In [59]:
regression_evaluator_r2 = RegressionEvaluator(predictionCol="prediction", labelCol="label",metricName="r2") # R^2 metric
regression_evaluator_rmse = RegressionEvaluator(predictionCol="prediction", labelCol="label",metricName="rmse") # Root Mean Squared Error metric
regression_evaluator_mae = RegressionEvaluator(predictionCol="prediction", labelCol="label",metricName="mae") # Mean Absolute Error metric

regression_metrics_list = [] # Creating an emply list of metrics

In [60]:
cvModel_lr.write().overwrite().save('/FileStore/tables/cvModel_lr') # Save the CrossValidatorModel

In [61]:
saved_cvModel_lr = CrossValidatorModel.load('/FileStore/tables/cvModel_lr') # Load the CrossValidatorModel

In [62]:
# train_df evaluation metrics
lr_predictions_train = saved_cvModel_lr.transform(train_df) # prediction on train data
lr_train_r2 = regression_evaluator_r2.evaluate(lr_predictions_train)
lr_train_rmse = regression_evaluator_rmse.evaluate(lr_predictions_train)
lr_train_mae = regression_evaluator_mae.evaluate(lr_predictions_train)
regression_metrics_list.append(["LinearRegression_TrainData_CV", lr_train_r2, lr_train_rmse, lr_train_mae ])

#  val_df evaluation metrics
lr_predictions_val = saved_cvModel_lr.transform(val_df)  # prediction on validation data
lr_val_r2 = regression_evaluator_r2.evaluate(lr_predictions_val)
lr_val_rmse = regression_evaluator_rmse.evaluate(lr_predictions_val)
lr_val_mae = regression_evaluator_mae.evaluate(lr_predictions_val)
regression_metrics_list.append(["LinearRegression_ValData_CV", lr_val_r2, lr_val_rmse, lr_val_mae ])

In [63]:
bestLRModel = cvModel_lr.bestModel # Retrieving the best model
bestParams = bestLRModel.extractParamMap()
bestParams # Best parameters after hyperparameter tuning

#### Decision Tree Regressor:

`Decision Tree Regressor` is a tree based supervised machine learning algorithm where the model creates a tree of nodes/branches consisting of input features and the leaves are represented as predicted values. For parameter tuning, we will consider `maxBins`, `maxDepth` and `minInstancesPerNode`.

In [65]:
dt = DecisionTreeRegressor(featuresCol="features", labelCol='label')  #Initializing Decision Tree Regressor class

paramGrid_dt = ParamGridBuilder()\
    .addGrid(dt.maxBins, [16, 32]) \
    .addGrid(dt.maxDepth, [5, 10]) \
    .addGrid(dt.minInstancesPerNode, [1, 5]) \
    .build()   # Building a parameter grid for hyperparameter tuning


crossval_dt = CrossValidator(estimator=dt,
                          estimatorParamMaps=paramGrid_dt,
                          evaluator=RegressionEvaluator(),
                          numFolds=5)  # Hyperparameter tuning the model and 5-Fold Cross validation 

cvModel_dt = crossval_dt.fit(train_df) # Fitting the model on train data

In [66]:
cvModel_dt.write().overwrite().save('/FileStore/tables/cvModel_dt') # Save the trained CrossValidatorModel

In [67]:
saved_cvModel_dt = CrossValidatorModel.load('/FileStore/tables/cvModel_dt') # Load the trained CrossValidatorModel

In [68]:
# train_df evaluation metrics
dt_predictions_train = saved_cvModel_dt.transform(train_df) # prediction on train data
dt_train_r2 = regression_evaluator_r2.evaluate(dt_predictions_train)
dt_train_rmse = regression_evaluator_rmse.evaluate(dt_predictions_train)
dt_train_mae = regression_evaluator_mae.evaluate(dt_predictions_train)
regression_metrics_list.append(["DecisionTreeRegressor_TrainData_CV", dt_train_r2, dt_train_rmse, dt_train_mae ])

# val_df evaluation metrics
dt_predictions_val = saved_cvModel_dt.transform(val_df) # prediction on validation data
dt_val_r2 = regression_evaluator_r2.evaluate(dt_predictions_val)
dt_val_rmse = regression_evaluator_rmse.evaluate(dt_predictions_val)
dt_val_mae = regression_evaluator_mae.evaluate(dt_predictions_val)
regression_metrics_list.append(["DecisionTreeRegressor_ValData_CV", dt_val_r2, dt_val_rmse, dt_val_mae ])

In [69]:
bestDTModel = cvModel_dt.bestModel # Retrieving the best model
bestParams_dt = bestDTModel.extractParamMap()
bestParams_dt  # Best parameters after hyperparameter tuning

#### Random Forest Regressor:

Random Forest Regressor is an ensemble model (Bootstrap Aggregation) consisting of several decision trees. In this ensemble model, several decision trees are trained in parallel and the predictions made by all these decision trees are averaged to provide the final solution. Random Forest Regressor has very less variance compared to decision trees. We will hyperparameter tune for `maxBins`, `numTrees` and `minInstancesPerNode`.

In [71]:
rf = RandomForestRegressor(featuresCol="features", labelCol='label') #Initializing Random Forest Regressor class

paramGrid_rf = ParamGridBuilder()\
    .addGrid(rf.maxBins, [16, 32]) \
    .addGrid(rf.numTrees, [20, 40]) \
    .addGrid(rf.minInstancesPerNode, [1, 5]) \
    .build()   # Building a parameter grid for hyperparameter tuning


crossval_rf = CrossValidator(estimator=rf,
                          estimatorParamMaps=paramGrid_rf,
                          evaluator=RegressionEvaluator(),
                          numFolds=5)  # Hyperparameter tuning the model and 5-Fold Cross validation 


cvModel_rf = crossval_rf.fit(train_df) # Fitting the model on train data

In [72]:
cvModel_rf.write().overwrite().save('/FileStore/tables/cvModel_rf')   # Save the trained CrossValidatorModel

In [73]:
saved_cvModel_rf = CrossValidatorModel.load('/FileStore/tables/cvModel_rf') # Load the trained CrossValidatorModel

In [74]:
# train_df evaluation metrics
rf_predictions_train = saved_cvModel_rf.transform(train_df) # prediction on train data
rf_train_r2 = regression_evaluator_r2.evaluate(rf_predictions_train)
rf_train_rmse = regression_evaluator_rmse.evaluate(rf_predictions_train)
rf_train_mae = regression_evaluator_mae.evaluate(rf_predictions_train)
regression_metrics_list.append(["RandomForestRegressor_TrainData_CV", rf_train_r2, rf_train_rmse, rf_train_mae ])

# val_df evaluation metrics
rf_predictions_val = saved_cvModel_rf.transform(val_df) # prediction on validation data
rf_val_r2 = regression_evaluator_r2.evaluate(rf_predictions_val)
rf_val_rmse = regression_evaluator_rmse.evaluate(rf_predictions_val)
rf_val_mae = regression_evaluator_mae.evaluate(rf_predictions_val)
regression_metrics_list.append(["RandomForestRegressor_ValData_CV", rf_val_r2, rf_val_rmse, rf_val_mae ])

In [75]:
bestRFModel = cvModel_rf.bestModel # Retrieving the best model
bestParams_rf = bestRFModel.extractParamMap()
bestParams_rf  # Best parameters after hyperparameter tuning

#### Gradient-Boosted Tree Regressor:

Gradient Boosted Tree Regressor is an ensemble model that trains multiple learners (decision trees) to boost the weak learners’ predictions. We will hyperparameter tune for `maxBins`, `numTrees` and `minInstancesPerNode`.

In [77]:
 gbt = GBTRegressor(featuresCol="features", labelCol='label')  #Initializing Gradient Boosted Tree Regressor class

paramGrid_gbt = ParamGridBuilder()\
    .addGrid(gbt.maxBins, [10, 32]) \
    .addGrid(gbt.minInstancesPerNode, [1, 5]) \
    .build() # Building a parameter grid for hyperparameter tuning 

crossval_gbt = CrossValidator(estimator=gbt,
                          estimatorParamMaps=paramGrid_gbt,
                          evaluator=RegressionEvaluator(),
                          numFolds=5) # Hyperparameter tuning the model and 5-Fold Cross validation 

cvModel_gbt = crossval_gbt.fit(train_df) # Fitting the model on train data

In [78]:
cvModel_gbt.write().overwrite().save('/FileStore/tables/cvModel_gbt') # Save the trained CrossValidatorModel

In [79]:
saved_cvModel_gbt = CrossValidatorModel.load('/FileStore/tables/cvModel_gbt') # Load the trained CrossValidatorModel

In [80]:
# train_df evaluation metrics
gbt_predictions_train = saved_cvModel_gbt.transform(train_df) # prediction on train data
gbt_train_r2 = regression_evaluator_r2.evaluate(gbt_predictions_train)
gbt_train_rmse = regression_evaluator_rmse.evaluate(gbt_predictions_train)
gbt_train_mae = regression_evaluator_mae.evaluate(gbt_predictions_train)
regression_metrics_list.append(["GradientBoostedTreeRegressor_TrainData_CV", gbt_train_r2, gbt_train_rmse, gbt_train_mae ])

# val_df evaluation metrics
gbt_predictions_val = saved_cvModel_gbt.transform(val_df) # prediction on validation data
gbt_val_r2 = regression_evaluator_r2.evaluate(gbt_predictions_val)
gbt_val_rmse = regression_evaluator_rmse.evaluate(gbt_predictions_val)
gbt_val_mae = regression_evaluator_mae.evaluate(gbt_predictions_val)
regression_metrics_list.append(["GradientBoostedTreeRegressor_ValData_CV", gbt_val_r2, gbt_val_rmse, gbt_val_mae ])

In [81]:
bestGBTModel = cvModel_gbt.bestModel # Retrieving the best model
bestParams_gbt = bestGBTModel.extractParamMap()
bestParams_gbt # Best parameters after hyperparameter tuning

## Results:

In [83]:
regression_metrics_df = pd.DataFrame(regression_metrics_list, columns = ['Model_Data' , 'R^2', 'RMSE', 'MAE']) # Creating a Pandas Datframe out of the evaluation metrics
display(regression_metrics_df)

Model_Data,R^2,RMSE,MAE
LinearRegression_TrainData_CV,0.9556805965871404,13.422721977071546,9.542329680250475
LinearRegression_ValData_CV,0.9578701682031706,13.397124217962183,9.65773179158638
DecisionTreeRegressor_TrainData_CV,0.7762539878356634,29.88296529022086,11.616142687603576
DecisionTreeRegressor_ValData_CV,0.7428123335791165,33.48911533037165,11.402500854607558
RandomForestRegressor_TrainData_CV,0.70284374595401,34.934630735911,13.792047792327292
RandomForestRegressor_ValData_CV,0.6815277669348756,36.31649598510621,13.553556662155774
GradientBoostedTreeRegressor_TrainData_CV,0.7553940164461354,30.382756887520877,11.892415663469045
GradientBoostedTreeRegressor_ValData_CV,0.7765612911647015,33.29219070666156,12.084116229875963


From above results, we can see that Linear Regression performed well on the training and validation data. We will consider Linear Regression model for our inference on test data.

#### Inference:
We will now predict the `ARRIVAL_DELAY` for test data using Linear Regression model that we trained using cross validation and hyperparameter tuning.

In [86]:
vector_airlines_test_df = pipeline.transform(featureSelection(airlines_test)) # test data clean up, feature selection and vectorization in a pipeline transform
test_df = vector_airlines_test_df.select(col("ARR_DELAY").alias("label"), col("features")) #selecting only the columns needed for inference

In [87]:
# # test_df evaluation metrics
lr_predictions_test = saved_cvModel_lr.transform(test_df) # prediction on test data
lr_test_r2 = regression_evaluator_r2.evaluate(lr_predictions_test) # R^2 value of test data inference
lr_test_rmse = regression_evaluator_rmse.evaluate(lr_predictions_test) # Root Mean Squared Error value of test data inference
lr_test_mae = regression_evaluator_mae.evaluate(lr_predictions_test) # Mean Absolute Error value of test data inference
regression_metrics_list.append(["LinearRegression_TestData_CV", lr_test_r2, lr_test_rmse, lr_test_mae ])

In [88]:
regression_metrics_test_df = pd.DataFrame(regression_metrics_list, columns = ['Model_Data' , 'R^2', 'RMSE', 'MAE'])   # Creating a Pandas Datframe out of the evaluation metrics
display(regression_metrics_test_df)

Model_Data,R^2,RMSE,MAE
LinearRegression_TrainData_CV,0.9556805965871404,13.422721977071546,9.542329680250475
LinearRegression_ValData_CV,0.9578701682031706,13.397124217962183,9.65773179158638
DecisionTreeRegressor_TrainData_CV,0.7762539878356634,29.88296529022086,11.616142687603576
DecisionTreeRegressor_ValData_CV,0.7428123335791165,33.48911533037165,11.402500854607558
RandomForestRegressor_TrainData_CV,0.70284374595401,34.934630735911,13.792047792327292
RandomForestRegressor_ValData_CV,0.6815277669348756,36.31649598510621,13.553556662155774
GradientBoostedTreeRegressor_TrainData_CV,0.7553940164461354,30.382756887520877,11.892415663469045
GradientBoostedTreeRegressor_ValData_CV,0.7765612911647015,33.29219070666156,12.084116229875963
LinearRegression_TestData_CV,0.9529835434342336,13.605171894321128,9.520017398426202


From above \\( R^2 \\) value (0.9529) for test data inference, we can see that the Ridge Regression algorithm did not overfit the data as the train, validation and test scores are almost same. This \\( R^2 \\) value of 0.9529 shows that 95% of the variance in dependent variable (`ARR_DELAY`) was explained by the predictive model (Ridge Rregression) consisting of all the independent features combined together.

<h2 align="center">Algorithm Implementation:</h2>

For algorithm implementation details for Linear Regression with Ridge Regularization, please access below provided links on Datbricks or GitHub.

+ Link on Databricks: https://dbc-b1c912e7-d804.cloud.databricks.com/?o=7564214546094626#notebook/2791835342844630
+ Link on GitHub: https://github.com/UCB-w261/project-sp20-team-25/blob/master/W261_SPRING_TOY_EXAMPLE_TEAM25.ipynb

<h2 align="center">Conclusion:</h2>

Primary evaluation metric chosen was \\( R^2 \\)<br>
<b>Linear Regression with L2 penalty turned out to be best performing algorithm</b><br>
Based on \\(R^2 \\) value, ~ 95% of variance in the dependent variable (ARR_DELAY) is explained by Linear Regression model with L2 regularization with RMSE value of 13.60 mins (which would indicate arrival delay of \\( \pm 13.60 mins \\) ) and MAE value of 9.52 mins (\\( \pm 9.52 mins \\) of Arrival Delay) on unseen Test dataset

<h4>Learnings</h4>
<ul>
  <li>Spark dataframes were much more efficient than RDDs</li>
  <li>Parquet format worked better</li>
  <li>Sparse Matrix handled better in Spark (VectorAssembler, OneHotEncoderEstimator)</li>
  <li>Learning curve for Spark SQL</li>
  </ul>

<h2 align="center">Application of Course Concepts:</h2>

A Good Model is not the one that gives accurate predictions on the known data or training data but the one which gives good predictions on the new / unseen data and avoids overfitting and underfitting.
To avoid overfitting, <b>5-fold Crossvalidator</b> was used and grid search method was used for Hyperparameter tuning. 

As part of 5-fold cross validation, the dataset will be divided into 5 equal parts and the below process will run 5 times, each time with a different holdout set.
1. Take the group as a holdout or test data set
2. Take the remaining groups as a training data set
3. Fit a model on the training set and evaluate it on the test set
4. Retain the evaluation scores and models for comparison

At the end of the above process, select the best performed model along with its parameters based on the evaluation metric chosen.

Along with using 5-fold Crossvalidator, Regularization was also used to avoid overfitting<p>
  <b>Regularizations</b> are techniques used to reduce the error by fitting a function appropriately on the given training set and avoid overfitting. 

Linear Regression was run with regularization parameter (lambda) values of [0.1, 0.01, 0.001] and the best parameter was 0.01 (L2 Regression?)

<b>One hot encoding</b> is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction.
One hot encoding was used for categorical variables such as 'ORIGIN_CARRIER','DEST_CARRIER','FLIGHT_BEARING' etc. Also, one hot encoded \*\_DELAY (eg: WEATHER_DELAY,CARRIER_DELAY)

<h5>Decision Tree - Regression</h5>

Decision tree builds regression or classification models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data. 

Decision Tree Regression was used as one of the algorithms to predict Late Arrival Delay in the US