# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [94]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

# Airports and Time of Day Encoding

In [14]:
df_flights_test = pd.read_csv('../../../data/preprocessed_data/flights_test(raw_random).csv') 
df_flights_test.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,flights,distance
0,1577854800000,WN,WN,WN,5888,WN,N951WN,5888,13891,ONT,"Ontario, CA",14771,SFO,"San Francisco, CA",1810,1945,N,95,1,363
1,1577854800000,WN,WN,WN,6276,WN,N467WN,6276,13891,ONT,"Ontario, CA",14771,SFO,"San Francisco, CA",1150,1320,N,90,1,363
2,1577854800000,WN,WN,WN,4598,WN,N7885A,4598,13891,ONT,"Ontario, CA",14831,SJC,"San Jose, CA",2020,2130,N,70,1,333
3,1577854800000,WN,WN,WN,4761,WN,N551WN,4761,13891,ONT,"Ontario, CA",14831,SJC,"San Jose, CA",1340,1455,N,75,1,333
4,1577854800000,WN,WN,WN,5162,WN,N968WN,5162,13891,ONT,"Ontario, CA",14831,SJC,"San Jose, CA",915,1035,N,80,1,333


In [4]:
df_flights = pd.read_csv('../../../data/preprocessed_data/df_flights_clean.csv')
pd.set_option('display.max_columns', None)
df_flights.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,origin_city_name,dest,dest_city_name,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,actual_elapsed_time,air_time,distance
0,2019-05-10,WN,1912,WN,BWI,"Baltimore, MD",FLL,"Fort Lauderdale, FL",1105,1115.0,10.0,13.0,1128.0,1330.0,3.0,1345,1333.0,-12.0,138.0,122.0,925
1,2019-04-27,AA,3666,MQ,JFK,"New York, NY",CLE,"Cleveland, OH",1545,1548.0,3.0,35.0,1623.0,1743.0,5.0,1747,1748.0,1.0,120.0,80.0,425
2,2018-03-08,WN,588,WN,ORF,"Norfolk, VA",BWI,"Baltimore, MD",1000,1012.0,12.0,8.0,1020.0,1058.0,4.0,1100,1102.0,2.0,50.0,38.0,159
3,2018-04-05,AA,1618,AA,PHX,"Phoenix, AZ",MIA,"Miami, FL",1330,1332.0,2.0,11.0,1343.0,2019.0,7.0,2048,2026.0,-22.0,234.0,216.0,1972
4,2018-01-31,UA,4171,EV,CHA,"Chattanooga, TN",EWR,"Newark, NJ",600,547.0,-13.0,19.0,606.0,745.0,20.0,818,805.0,-13.0,138.0,99.0,718


In [19]:
#drop unnecessary columns
df_flights_feat = df_flights.drop(columns = ['op_unique_carrier', 'dep_time', 'arr_time', 'origin_city_name', 'dest_city_name', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in', 'actual_elapsed_time'], axis = 1)
df_flights_feat.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718


In [20]:
#create bins and labels for each time of day category
bins = [-1, 599, 1199, 1700, 2400]
labels = ['late night', 'morning', 'afternoon', 'evening']
df_flights_feat['crs_dep_time_cat'] = pd.cut(df_flights_feat['crs_dep_time'], bins=bins, labels=labels)
df_flights_feat.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,crs_dep_time_cat
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925,morning
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425,afternoon
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159,morning
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972,afternoon
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718,morning


In [32]:
# Label encode airline, and aiport columns
le = preprocessing.LabelEncoder()
le.fit(df_flights_feat['origin'])
df_flights_feat['origin_encoded'] = le.transform(df_flights_feat['origin'])
df_flights_feat['dest_encoded'] = le.transform(df_flights_feat['dest'])

label_enc = preprocessing.LabelEncoder()
label_enc.fit(df_flights_feat['mkt_carrier'])
df_flights_feat['airline_encoded'] = label_enc.transform(df_flights_feat['mkt_carrier'])

label_enc2 = preprocessing.LabelEncoder()
label_enc2.fit(df_flights_feat['mkt_carrier_fl_num'])
df_flights_feat['fl_num_encoded'] = label_enc2.transform(df_flights_feat['mkt_carrier_fl_num'])

df_flights_feat.head()


Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,crs_dep_time_cat,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925,morning,59,127,10,1908
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425,afternoon,185,72,0,3661
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159,morning,258,59,10,586
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972,afternoon,272,229,0,1614
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718,morning,66,119,8,4166


In [40]:
# get dummy variables for estimated departure time
df_flights_enc = pd.get_dummies(df_flights_feat, columns=['crs_dep_time_cat'], drop_first=False)
df_flights_enc.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925,59,127,10,1908,0,1,0,0
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425,185,72,0,3661,0,0,1,0
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159,258,59,10,586,0,1,0,0
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972,272,229,0,1614,0,0,1,0
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718,66,119,8,4166,0,1,0,0


# Average Delay for each Airline

In [57]:
#importing data
df_flights = pd.read_csv('../../../data/preprocessed_data/df_flights_clean.csv')

#creating arrival and departure delay datasets
arr_delay = df_flights_enc.arr_delay

#lower and upper limits
q1_arr_delay = arr_delay.quantile(0.25)
q3_arr_delay = arr_delay.quantile(0.75)
iqr_arr_delay = q3_arr_delay - q1_arr_delay
lower_limit_arr_delay = q1_arr_delay - 1.5 * iqr_arr_delay
upper_limit_arr_delay = q3_arr_delay + 1.5 * iqr_arr_delay

#remove arrival delays outliers
df_flights_no_outliers = df_flights_enc[(df_flights_enc.arr_delay>lower_limit_arr_delay) & (df_flights_enc.arr_delay<upper_limit_arr_delay)]


In [58]:
df_flights_no_outliers.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925,59,127,10,1908,0,1,0,0
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425,185,72,0,3661,0,0,1,0
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159,258,59,10,586,0,1,0,0
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972,272,229,0,1614,0,0,1,0
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718,66,119,8,4166,0,1,0,0


In [68]:
df_flights_no_outliers['avg_arr_delay'] = df_flights_no_outliers.loc[:, 'mkt_carrier']
df_flights_no_outliers

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_flights_no_outliers['avg_arr_delay'] = df_flights_no_outliers.loc[:, 'mkt_carrier']


Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,avg_arr_delay
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925,59,127,10,1908,0,1,0,0,WN
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425,185,72,0,3661,0,0,1,0,AA
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159,258,59,10,586,0,1,0,0,WN
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972,272,229,0,1614,0,0,1,0,AA
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718,66,119,8,4166,0,1,0,0,UA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
293916,2018-02-25,AA,3709,ORD,GRB,1319,-7.0,1414,-17.0,34.0,173,257,141,0,3704,0,0,1,0,AA
293917,2018-04-13,AA,3678,TLH,MIA,605,-2.0,731,-8.0,62.0,402,351,229,0,3673,0,1,0,0,AA
293918,2018-10-06,DL,3674,SLC,BZN,1105,-4.0,1230,-15.0,52.0,347,332,60,3,3669,0,1,0,0,DL
293919,2018-12-13,DL,892,ATL,LEX,1814,1.0,1927,-1.0,46.0,304,22,203,3,889,0,0,0,1,DL


In [104]:
#use groupby method to find mean of arrival delay for each carrier using dataset without outliers
avg_delay_outlier = df_flights_no_outliers.groupby(['mkt_carrier'])['arr_delay'].mean()
#convert series into dataframe and turn total_gallons into integer type
df_avg_delay_outlier = avg_delay_outlier.to_frame()
df_avg_delay_outlier.head()

Unnamed: 0_level_0,arr_delay
mkt_carrier,Unnamed: 1_level_1
AA,-4.730407
AS,-5.009705
B6,-5.768038
DL,-7.602095
F9,-5.123199


In [105]:
#use groupby method to find mean of arrival delay for each carrier
avg_delay = df_flights_enc.groupby(['mkt_carrier'])['arr_delay'].mean()
#convert series into dataframe and turn total_gallons into integer type
df_avg_delay = avg_delay.to_frame()
df_avg_delay.head()

Unnamed: 0_level_0,arr_delay
mkt_carrier,Unnamed: 1_level_1
AA,6.38164
AS,0.794956
B6,11.477069
DL,2.541737
F9,12.437104


In [87]:
#create bins and labels for categorization
bins_delay = [-1.0, 5.0, 10.0, 15.0]
labels_delay = ['low', 'medium', 'high']
df_avg_delay['avg_arr_delay'] = pd.cut(df_avg_delay['arr_delay'], bins=bins_delay, labels=labels_delay)
df_avg_delay.head()

Unnamed: 0_level_0,arr_delay,avg_arr_delay
mkt_carrier,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,6.38164,medium
AS,0.794956,low
B6,11.477069,high
DL,2.541737,low
F9,12.437104,high


In [106]:
# df_avg_delay.set_index('index', inplace=True)
df_avg_delay['mkt_carrier'] = df_avg_delay.index
df_avg_delay.head()

Unnamed: 0_level_0,arr_delay,mkt_carrier
mkt_carrier,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,6.38164,AA
AS,0.794956,AS
B6,11.477069,B6
DL,2.541737,DL
F9,12.437104,F9


In [90]:
#reset index of dataframe
df_avg_delay.reset_index(drop=True, inplace=True)
df_avg_delay.head()

In [92]:
# map avg_arr_delay values to main dataframe
df_flights_enc['avg_arr_delay'] = df_flights_enc['mkt_carrier'].map(df_avg_delay.set_index('mkt_carrier')['avg_arr_delay'])
df_flights_enc.head()

In [101]:
# encode average arrival delay using OrdinalEncoder
# reshape data
x = np.asarray(df_flights_enc['avg_arr_delay']).reshape(-1, 1)
# define ordinal encoding
ocode = OrdinalEncoder()
# fit data
ocode.fit(x)
# transform data
df_flights_enc['avg_arr_delay_enc'] = ocode.transform(x)

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,origin,dest,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,avg_arr_delay,avg_arr_delay_enc
0,2019-05-10,WN,1912,BWI,FLL,1105,10.0,1345,-12.0,122.0,925,59,127,10,1908,0,1,0,0,low,1.0
1,2019-04-27,AA,3666,JFK,CLE,1545,3.0,1747,1.0,80.0,425,185,72,0,3661,0,0,1,0,medium,2.0
2,2018-03-08,WN,588,ORF,BWI,1000,12.0,1100,2.0,38.0,159,258,59,10,586,0,1,0,0,low,1.0
3,2018-04-05,AA,1618,PHX,MIA,1330,2.0,2048,-22.0,216.0,1972,272,229,0,1614,0,0,1,0,medium,2.0
4,2018-01-31,UA,4171,CHA,EWR,600,-13.0,818,-13.0,99.0,718,66,119,8,4166,0,1,0,0,medium,2.0


In [102]:
#drop unnecessary columns
df_flights_enc_clean = df_flights_enc.drop(columns = ['mkt_carrier', 'mkt_carrier_fl_num', 'origin', 'dest'])
df_flights_enc_clean['arr_delay'].describe()

count    293922.000000
mean          5.547431
std          50.375212
min        -160.000000
25%         -15.000000
50%          -6.000000
75%           8.000000
max        2041.000000
Name: arr_delay, dtype: float64

In [103]:
df_flights_enc_clean.head()

Unnamed: 0,fl_date,crs_dep_time,dep_delay,crs_arr_time,arr_delay,air_time,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,avg_arr_delay,avg_arr_delay_enc
0,2019-05-10,1105,10.0,1345,-12.0,122.0,925,59,127,10,1908,0,1,0,0,low,1.0
1,2019-04-27,1545,3.0,1747,1.0,80.0,425,185,72,0,3661,0,0,1,0,medium,2.0
2,2018-03-08,1000,12.0,1100,2.0,38.0,159,258,59,10,586,0,1,0,0,low,1.0
3,2018-04-05,1330,2.0,2048,-22.0,216.0,1972,272,229,0,1614,0,0,1,0,medium,2.0
4,2018-01-31,600,-13.0,818,-13.0,99.0,718,66,119,8,4166,0,1,0,0,medium,2.0


### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.