# Machine Learning

In this file, instructions how to approach the challenge can be found.

We are going to work on different types of Machine Learning problems:

- **Regression Problem**: The goal is to predict delay of flights.
- **(Stretch) Multiclass Classification**: If the plane was delayed, we will predict what type of delay it is (will be).
- **(Stretch) Binary Classification**: The goal is to predict if the flight will be cancelled.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder

## Main Task: Regression Problem

The target variable is **ARR_DELAY**. We need to be careful which columns to use and which don't. For example, DEP_DELAY is going to be the perfect predictor, but we can't use it because in real-life scenario, we want to predict the delay before the flight takes of --> We can use average delay from earlier days but not the one from the actual flight we predict.  

For example, variables **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY** shouldn't be used directly as predictors as well. However, we can create various transformations from earlier values.

We will be evaluating your models by predicting the ARR_DELAY for all flights **1 week in advance**.

### Feature Engineering

Feature engineering will play a crucial role in this problems. We have only very little attributes so we need to create some features that will have some predictive power.

- weather: we can use some weather API to look for the weather in time of the scheduled departure and scheduled arrival.
- statistics (avg, mean, median, std, min, max...): we can take a look at previous delays and compute descriptive statistics
- airports encoding: we need to think about what to do with the airports and other categorical variables
- time of the day: the delay probably depends on the airport traffic which varies during the day.
- airport traffic
- unsupervised learning as feature engineering?
- **what are the additional options?**: Think about what we could do more to improve the model.

# Airports and Time of Day Encoding

In [2]:
df_flights_test = pd.read_csv('../../../data/raw_data/flights_test(raw_random).csv') 
df_flights_test.head()

Unnamed: 0,fl_date,mkt_unique_carrier,branded_code_share,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,tail_num,op_carrier_fl_num,origin_airport_id,origin,origin_city_name,dest_airport_id,dest,dest_city_name,crs_dep_time,crs_arr_time,dup,crs_elapsed_time,flights,distance
0,1578286800000,DL,DL_CODESHARE,DL,5137,9E,N134EV,5137,11973,GPT,"Gulfport/Biloxi, MS",10397,ATL,"Atlanta, GA",1235,1503,N,88,1,352
1,1577941200000,NK,NK,NK,267,NK,N521NK,267,14100,PHL,"Philadelphia, PA",12892,LAX,"Los Angeles, CA",1645,2006,N,381,1,2402
2,1578286800000,DL,DL,DL,1370,DL,N866DN,1370,13931,ORF,"Norfolk, VA",10397,ATL,"Atlanta, GA",1714,1915,N,121,1,516
3,1578373200000,UA,UA_CODESHARE,UA,3867,ZW,N465AW,3867,14711,SCE,"State College, PA",13930,ORD,"Chicago, IL",1730,1841,N,131,1,528
4,1578373200000,UA,UA_CODESHARE,UA,3415,YX,N730YX,3415,12266,IAH,"Houston, TX",15370,TUL,"Tulsa, OK",1445,1618,N,93,1,429


In [3]:
df_flights = pd.read_csv('../../../data/preprocessed_data/df_flights_clean.csv')
pd.set_option('display.max_columns', None)
df_flights.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,taxi_out,wheels_off,wheels_on,taxi_in,crs_arr_time,arr_time,arr_delay,distance
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,13.0,1128.0,1330.0,3.0,1345,1333.0,-12.0,925
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,35.0,1623.0,1743.0,5.0,1747,1748.0,1.0,425
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,8.0,1020.0,1058.0,4.0,1100,1102.0,2.0,159
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,11.0,1343.0,2019.0,7.0,2048,2026.0,-22.0,1972
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,19.0,606.0,745.0,20.0,818,805.0,-13.0,718


In [4]:
#drop unnecessary columns
df_flights_feat = df_flights.drop(columns = ['arr_time', 'taxi_out', 'wheels_off', 'wheels_on', 'taxi_in'], axis = 1)
df_flights_feat.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718


In [5]:
#create bins and labels for each time of day category
bins = [-1, 599, 1199, 1700, 2400]
labels = ['late night', 'morning', 'afternoon', 'evening']
df_flights_feat['crs_dep_time_cat'] = pd.cut(df_flights_feat['crs_dep_time'], bins=bins, labels=labels)
df_flights_feat['crs_arr_time_cat'] = pd.cut(df_flights_feat['crs_arr_time'], bins=bins, labels=labels)
df_flights_feat.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance,crs_dep_time_cat,crs_arr_time_cat
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925,morning,afternoon
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425,afternoon,evening
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159,morning,morning
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972,afternoon,evening
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718,morning,morning


In [6]:
# Label encode airline, and aiport columns
le = preprocessing.LabelEncoder()
le.fit(df_flights_feat['origin'])
df_flights_feat['origin_encoded'] = le.transform(df_flights_feat['origin'])
df_flights_feat['dest_encoded'] = le.transform(df_flights_feat['dest'])

label_enc = preprocessing.LabelEncoder()
label_enc.fit(df_flights_feat['mkt_carrier'])
df_flights_feat['airline_encoded'] = label_enc.transform(df_flights_feat['mkt_carrier'])

label_enc2 = preprocessing.LabelEncoder()
label_enc2.fit(df_flights_feat['mkt_carrier_fl_num'])
df_flights_feat['fl_num_encoded'] = label_enc2.transform(df_flights_feat['mkt_carrier_fl_num'])

df_flights_feat.head()


Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance,crs_dep_time_cat,crs_arr_time_cat,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925,morning,afternoon,59,127,10,1908
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425,afternoon,evening,185,72,0,3661
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159,morning,morning,258,59,10,586
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972,afternoon,evening,272,229,0,1614
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718,morning,morning,66,119,8,4166


In [7]:
# get dummy variables for estimated departure time
df_flights_enc = pd.get_dummies(df_flights_feat, columns=['crs_dep_time_cat'], drop_first=False)
df_flights_enc2 = pd.get_dummies(df_flights_enc, columns=['crs_arr_time_cat'], drop_first=False)
df_flights_enc2.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,crs_arr_time_cat_late night,crs_arr_time_cat_morning,crs_arr_time_cat_afternoon,crs_arr_time_cat_evening
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925,59,127,10,1908,0,1,0,0,0,0,1,0
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425,185,72,0,3661,0,0,1,0,0,0,0,1
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159,258,59,10,586,0,1,0,0,0,1,0,0
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972,272,229,0,1614,0,0,1,0,0,0,0,1
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718,66,119,8,4166,0,1,0,0,0,1,0,0


# Average Delay for each Airline

In [8]:
#creating arrival and departure delay datasets
arr_delay = df_flights_enc2.arr_delay

#lower and upper limits
q1_arr_delay = arr_delay.quantile(0.25)
q3_arr_delay = arr_delay.quantile(0.75)
iqr_arr_delay = q3_arr_delay - q1_arr_delay
lower_limit_arr_delay = q1_arr_delay - 1.5 * iqr_arr_delay
upper_limit_arr_delay = q3_arr_delay + 1.5 * iqr_arr_delay

#remove arrival delays outliers
df_flights_no_outliers = df_flights_enc2[(df_flights_enc2.arr_delay>lower_limit_arr_delay) & (df_flights_enc2.arr_delay<upper_limit_arr_delay)]


In [9]:
df_flights_no_outliers.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,crs_arr_time_cat_late night,crs_arr_time_cat_morning,crs_arr_time_cat_afternoon,crs_arr_time_cat_evening
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925,59,127,10,1908,0,1,0,0,0,0,1,0
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425,185,72,0,3661,0,0,1,0,0,0,0,1
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159,258,59,10,586,0,1,0,0,0,1,0,0
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972,272,229,0,1614,0,0,1,0,0,0,0,1
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718,66,119,8,4166,0,1,0,0,0,1,0,0


In [10]:
df_flights_no_outliers['avg_arr_delay'] = df_flights_no_outliers.loc[:, 'mkt_carrier']
df_flights_no_outliers.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_flights_no_outliers['avg_arr_delay'] = df_flights_no_outliers.loc[:, 'mkt_carrier']


Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,crs_arr_time_cat_late night,crs_arr_time_cat_morning,crs_arr_time_cat_afternoon,crs_arr_time_cat_evening,avg_arr_delay
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925,59,127,10,1908,0,1,0,0,0,0,1,0,WN
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425,185,72,0,3661,0,0,1,0,0,0,0,1,AA
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159,258,59,10,586,0,1,0,0,0,1,0,0,WN
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972,272,229,0,1614,0,0,1,0,0,0,0,1,AA
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718,66,119,8,4166,0,1,0,0,0,1,0,0,UA


In [11]:
#use groupby method to find mean of arrival delay for each carrier using dataset without outliers
avg_delay_outlier = df_flights_no_outliers.groupby(['mkt_carrier'])['arr_delay'].mean()
#convert series into dataframe and turn total_gallons into integer type
df_avg_delay_outlier = avg_delay_outlier.to_frame()
df_avg_delay_outlier.head()

Unnamed: 0_level_0,arr_delay
mkt_carrier,Unnamed: 1_level_1
AA,-4.730407
AS,-5.009705
B6,-5.768038
DL,-7.602095
F9,-5.123199


In [12]:
#use groupby method to find mean of arrival delay for each carrier
avg_delay = df_flights_enc.groupby(['mkt_carrier'])['arr_delay'].mean()
#convert series into dataframe and turn total_gallons into integer type
df_avg_delay = avg_delay.to_frame()
df_avg_delay.head()

Unnamed: 0_level_0,arr_delay
mkt_carrier,Unnamed: 1_level_1
AA,6.38164
AS,0.794956
B6,11.477069
DL,2.541737
F9,12.437104


In [13]:
#create bins and labels for categorization
bins_delay = [-1.0, 5.0, 10.0, 15.0]
labels_delay = ['low', 'medium', 'high']
df_avg_delay['avg_arr_delay'] = pd.cut(df_avg_delay['arr_delay'], bins=bins_delay, labels=labels_delay)
df_avg_delay.head()

Unnamed: 0_level_0,arr_delay,avg_arr_delay
mkt_carrier,Unnamed: 1_level_1,Unnamed: 2_level_1
AA,6.38164,medium
AS,0.794956,low
B6,11.477069,high
DL,2.541737,low
F9,12.437104,high


In [14]:
# df_avg_delay.set_index('index', inplace=True)
df_avg_delay['mkt_carrier'] = df_avg_delay.index
df_avg_delay.head()

Unnamed: 0_level_0,arr_delay,avg_arr_delay,mkt_carrier
mkt_carrier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AA,6.38164,medium,AA
AS,0.794956,low,AS
B6,11.477069,high,B6
DL,2.541737,low,DL
F9,12.437104,high,F9


In [15]:
#reset index of dataframe
df_avg_delay.reset_index(drop=True, inplace=True)
df_avg_delay.head()

Unnamed: 0,arr_delay,avg_arr_delay,mkt_carrier
0,6.38164,medium,AA
1,0.794956,low,AS
2,11.477069,high,B6
3,2.541737,low,DL
4,12.437104,high,F9


In [16]:
# map avg_arr_delay values to main dataframe
df_flights_enc2['avg_arr_delay'] = df_flights_enc2['mkt_carrier'].map(df_avg_delay.set_index('mkt_carrier')['avg_arr_delay'])
df_flights_enc2.head()

Unnamed: 0,fl_date,mkt_carrier,mkt_carrier_fl_num,op_unique_carrier,origin,dest,crs_dep_time,dep_time,dep_delay,crs_arr_time,arr_delay,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,crs_arr_time_cat_late night,crs_arr_time_cat_morning,crs_arr_time_cat_afternoon,crs_arr_time_cat_evening,avg_arr_delay
0,2019-05-10,WN,1912,WN,BWI,FLL,1105,1115.0,10.0,1345,-12.0,925,59,127,10,1908,0,1,0,0,0,0,1,0,low
1,2019-04-27,AA,3666,MQ,JFK,CLE,1545,1548.0,3.0,1747,1.0,425,185,72,0,3661,0,0,1,0,0,0,0,1,medium
2,2018-03-08,WN,588,WN,ORF,BWI,1000,1012.0,12.0,1100,2.0,159,258,59,10,586,0,1,0,0,0,1,0,0,low
3,2018-04-05,AA,1618,AA,PHX,MIA,1330,1332.0,2.0,2048,-22.0,1972,272,229,0,1614,0,0,1,0,0,0,0,1,medium
4,2018-01-31,UA,4171,EV,CHA,EWR,600,547.0,-13.0,818,-13.0,718,66,119,8,4166,0,1,0,0,0,1,0,0,medium


In [17]:
# encode average arrival delay using OrdinalEncoder
# reshape data
x = np.asarray(df_flights_enc2['avg_arr_delay']).reshape(-1, 1)
# define ordinal encoding
ocode = OrdinalEncoder()
# fit data
ocode.fit(x)
# transform data
df_flights_enc2['avg_arr_delay_enc'] = ocode.transform(x)

In [18]:
#drop unnecessary columns
df_flights_enc_clean = df_flights_enc2.drop(columns = ['mkt_carrier', 'mkt_carrier_fl_num',
                                                       'crs_dep_time', 'crs_arr_time', 'avg_arr_delay', 'dep_delay',
                                                        ])
df_flights_enc_clean['arr_delay'].describe()

count    293922.000000
mean          5.547431
std          50.375212
min        -160.000000
25%         -15.000000
50%          -6.000000
75%           8.000000
max        2041.000000
Name: arr_delay, dtype: float64

In [19]:
df_flights_enc_clean.head()

Unnamed: 0,fl_date,op_unique_carrier,origin,dest,dep_time,arr_delay,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,crs_arr_time_cat_late night,crs_arr_time_cat_morning,crs_arr_time_cat_afternoon,crs_arr_time_cat_evening,avg_arr_delay_enc
0,2019-05-10,WN,BWI,FLL,1115.0,-12.0,925,59,127,10,1908,0,1,0,0,0,0,1,0,1.0
1,2019-04-27,MQ,JFK,CLE,1548.0,1.0,425,185,72,0,3661,0,0,1,0,0,0,0,1,2.0
2,2018-03-08,WN,ORF,BWI,1012.0,2.0,159,258,59,10,586,0,1,0,0,0,1,0,0,1.0
3,2018-04-05,AA,PHX,MIA,1332.0,-22.0,1972,272,229,0,1614,0,0,1,0,0,0,0,1,2.0
4,2018-01-31,EV,CHA,EWR,547.0,-13.0,718,66,119,8,4166,0,1,0,0,0,1,0,0,2.0


In [20]:
#Further new features are engineering and added to dataframe data_f
data_f = df_flights_enc_clean
data_f.head()

Unnamed: 0,fl_date,op_unique_carrier,origin,dest,dep_time,arr_delay,distance,origin_encoded,dest_encoded,airline_encoded,fl_num_encoded,crs_dep_time_cat_late night,crs_dep_time_cat_morning,crs_dep_time_cat_afternoon,crs_dep_time_cat_evening,crs_arr_time_cat_late night,crs_arr_time_cat_morning,crs_arr_time_cat_afternoon,crs_arr_time_cat_evening,avg_arr_delay_enc
0,2019-05-10,WN,BWI,FLL,1115.0,-12.0,925,59,127,10,1908,0,1,0,0,0,0,1,0,1.0
1,2019-04-27,MQ,JFK,CLE,1548.0,1.0,425,185,72,0,3661,0,0,1,0,0,0,0,1,2.0
2,2018-03-08,WN,ORF,BWI,1012.0,2.0,159,258,59,10,586,0,1,0,0,0,1,0,0,1.0
3,2018-04-05,AA,PHX,MIA,1332.0,-22.0,1972,272,229,0,1614,0,0,1,0,0,0,0,1,2.0
4,2018-01-31,EV,CHA,EWR,547.0,-13.0,718,66,119,8,4166,0,1,0,0,0,1,0,0,2.0


In [21]:
data_p = pd.read_csv('../../../src/modules/data_exploration/passengers(raw).csv')
data_p.head()

Unnamed: 0,departures_scheduled,departures_performed,payload,seats,passengers,freight,mail,distance,ramp_to_ramp,air_time,unique_carrier,airline_id,unique_carrier_name,region,carrier,carrier_name,carrier_group,carrier_group_new,origin_airport_id,origin_city_market_id,origin,origin_city_name,origin_country,origin_country_name,dest_airport_id,dest_city_market_id,dest,dest_city_name,dest_country,dest_country_name,aircraft_group,aircraft_type,aircraft_config,year,month,distance_group,class,data_source
0,1,1,51700,187,138,0,548,1670,202,186,AA,19805,American Airlines Inc.,D,AA,American Airlines Inc.,3,3,14107,30466,PHX,"Phoenix, AZ",US,United States,11066,31066,CMH,"Columbus, OH",US,United States,6,699,1,2018,12,4,F,DU
1,85,82,1012700,4100,3550,699,0,691,9690,7791,MQ,20398,Envoy Air,D,MQ,Envoy Air,3,3,11298,30194,DFW,"Dallas/Fort Worth, TX",US,United States,13367,33367,MLI,"Moline, IL",US,United States,6,675,1,2016,4,2,F,DU
2,1,1,19092,69,66,0,0,1158,173,141,YX,20452,Republic Airline,D,YX,Republic Airline,3,3,13851,33851,OKC,"Oklahoma City, OK",US,United States,11278,30852,DCA,"Washington, DC",US,United States,6,677,1,2019,6,3,F,DU
3,31,31,535355,2139,1568,0,0,1304,5710,4887,OO,20304,SkyWest Airlines Inc.,D,OO,SkyWest Airlines Inc.,3,3,10372,30372,ASE,"Aspen, CO",US,United States,10397,30397,ATL,"Atlanta, GA",US,United States,6,631,1,2018,3,3,F,DU
4,5,5,186000,930,649,0,0,872,744,635,F9,20436,Frontier Airlines Inc.,D,F9,Frontier Airlines Inc.,3,3,13244,33244,MEM,"Memphis, TN",US,United States,11292,30325,DEN,"Denver, CO",US,United States,6,722,1,2019,4,2,F,DU


In [22]:
data_f['year'] = pd.DatetimeIndex(data_f['fl_date']).year
data_f['month'] = pd.DatetimeIndex(data_f['fl_date']).month
data_p.rename(columns = {'unique_carrier':'op_unique_carrier'}, inplace = True)

In [23]:
data_p = data_p.groupby(['year', 'month', 'origin', 'dest', 'op_unique_carrier'], as_index=False).agg(sum)[['year', 'month', 'origin', 'dest', 'op_unique_carrier', 'passengers']]
data_f = data_f.merge(data_p, how='left', on=['year', 'month', 'origin', 'dest', 'op_unique_carrier'])

In [24]:
flight_traffic = data_f.groupby(['year', 'month', 'origin'], as_index=False).count()[['year', 'month', 'origin','fl_date']]
flight_traffic.rename(columns = {'fl_date':'flight_traffic'}, inplace = True)

In [25]:
data_f = data_f.merge(flight_traffic, how='left', on=['year', 'month', 'origin'])

In [26]:
from pandas.tseries.holiday import USFederalHolidayCalendar as calendar
dr = pd.date_range(start='2018-01-01', end='2019-12-31')
holidays = calendar().holidays(start=dr.min(), end=dr.max())
weekends = pd.bdate_range(start="2018/01/01", end="2019/12/31", freq="C", weekmask="Sat Sun")
data_f['fl_date'] = pd.to_datetime(data_f['fl_date'])
data_f['holiday'] = data_f['fl_date'].isin(holidays) | data_f['fl_date'].isin(weekends)
data_f.head()

TypeError: Cannot interpret '<attribute 'dtype' of 'numpy.generic' objects>' as a data type

In [None]:
data_f.to_csv('df_flights_enc_clean.csv')

### Feature Selection / Dimensionality Reduction

We need to apply different selection techniques to find out which one will be the best for our problems.

- Original Features vs. PCA conponents?

### Modeling

Use different ML techniques to predict each problem.

- linear / logistic / multinomial logistic regression
- Naive Bayes
- Random Forest
- SVM
- XGBoost
- The ensemble of your own choice

# Decision Tree Modeling

In [None]:
data_f.head()

In [None]:
data_f['holiday_'] = data_f['holiday'].replace(['True','False'],[1, 0])


In [None]:
data_f['arr_delay_binary'] = data_f['arr_delay'] 
data_f['arr_delay_binary'] = data_f['arr_delay_binary'].mask(data_f['arr_delay_binary'] > 0, 1)
data_f['arr_delay_binary'] = data_f['arr_delay_binary'].mask(data_f['arr_delay_binary'] <= 0, 0)

df_f_sample = data_f.sample(frac=0.5)

In [None]:
data_f.head()

In [None]:
df_f_sample.shape

In [None]:
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

feature_cols = ['distance', 'origin_encoded', 'dest_encoded', 'airline_encoded', 'fl_num_encoded', 'crs_dep_time_cat_late night', 'crs_dep_time_cat_morning',
                'crs_dep_time_cat_afternoon', 'crs_dep_time_cat_evening', 'crs_arr_time_cat_late night', 'crs_arr_time_cat_morning', 'crs_arr_time_cat_afternoon',
               'crs_arr_time_cat_evening', 'avg_arr_delay_enc', 'year', 'month', 'flight_traffic', 'holiday']
min_col = ['distance', 'origin_encoded', 'dest_encoded', 'airline_encoded', 'avg_arr_delay_enc', 'month', 'flight_traffic', 'holiday']



X = data_f[feature_cols]
y = data_f.arr_delay

In [None]:
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test

In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

In [None]:
# from sklearn.tree import export_graphviz
# from six import StringIO
# from IPython.display import Image  
# import pydotplus

# dot_data = StringIO()
# export_graphviz(clf, out_file=dot_data,  
#                 filled=True, rounded=True,
#                 special_characters=True,feature_names = feature_cols)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
# graph.write_png('diabetes.png')
# Image(graph.create_png())

In [None]:
# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# Random Forest Modeling

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # data visualization
import seaborn as sns # statistical data visualization
%matplotlib inline
import warnings

warnings.filterwarnings('ignore')

In [None]:
# # import Random Forest classifier

# from sklearn.ensemble import RandomForestClassifier

# # instantiate the classifier 

# rfc = RandomForestClassifier(n_estimators=10, random_state=0)

# # fit the model

# rfc.fit(X_train, y_train)

# # Predict the Test set results

# y_pred = rfc.predict(X_test)

# # Check accuracy score 

# from sklearn.metrics import accuracy_score

# print('Model accuracy score with 10 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

# takes up too much ram


#Model accuracy score with 10 decision-trees : 0.0202

# XGBOOST Modeling

In [None]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

In [None]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

In [None]:
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

In [None]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)

In [None]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

In [None]:
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)

In [None]:
cv_results.head()

In [None]:
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)

In [None]:
xgb.plot_tree(xg_reg,num_trees=0)
plt.rcParams['figure.figsize'] = [1000, 1000]
# plt.show()

In [None]:
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [10, 10]

# SVM Modeling

In [None]:
#Import svm model
from sklearn import svm

#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel

#Train the model using the training sets
clf.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

In [None]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

# Naives Bayes Modeling

In [None]:
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
 
# making predictions on the testing set
y_pred = gnb.predict(X_test)
 
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

# Model Precision: what percentage of positive tuples are labeled as such?
print("Precision:",metrics.precision_score(y_test, y_pred))

# Model Recall: what percentage of positive tuples are labelled as such?
print("Recall:",metrics.recall_score(y_test, y_pred))

### Evaluation

You have data from 2018 and 2019 to develop models. Use different evaluation metrics for each problem and compare the performance of different models.

You are required to predict delays on **out of sample** data from **first 7 days (1st-7th) of January 2020** and to share the file with LighthouseLabs. Sample submission can be found in the file **_sample_submission.csv_**

======================================================================
## Stretch Tasks

### Multiclass Classification

The target variables are **CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, LATE_AIRCRAFT_DELAY**. We need to do additional transformations because these variables are not binary but continuos. For each flight that was delayed, we need to have one of these variables as 1 and others 0.

It can happen that we have two types of delays with more than 0 minutes. In this case, take the bigger one as 1 and others as 0.

### Binary Classification

The target variable is **CANCELLED**. The main problem here is going to be huge class imbalance. We have only very little cancelled flights with comparison to all flights. It is important to do the right sampling before training and to choose correct evaluation metrics.