# FEATURE ENGINEERING
In this section we will create new features useful for further predictions.

In [1]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt

In [2]:
pd.options.display.max_columns = 250
pd.set_option('mode.chained_assignment', None)

First of all, we have to load our data and sort it by normalized arrival time. That will facilitate aggregating data with just finished flights.

In [3]:
data = pd.read_pickle('data.pkl')
data.sort_values('ARRIVAL_TIME_NORMALIZED',  inplace = True)
data.reset_index(drop = True, inplace = True)

In [4]:
data.head()

Unnamed: 0,MONTH,DAY,DAY_OF_WEEK,AIRLINE,FLIGHT_NUMBER,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,SCHEDULED_DEPARTURE,DEPARTURE_TIME,DEPARTURE_DELAY,TAXI_OUT,WHEELS_OFF,SCHEDULED_TIME,ELAPSED_TIME,AIR_TIME,DISTANCE,WHEELS_ON,TAXI_IN,SCHEDULED_ARRIVAL,ARRIVAL_DELAY,AIR_SYSTEM_DELAY,SECURITY_DELAY,AIRLINE_DELAY,LATE_AIRCRAFT_DELAY,WEATHER_DELAY,STATE_ORIGIN,LATITUDE_ORIGIN,LONGITUDE_ORIGIN,STATE_DESTINATION,LATITUDE_DESTINATION,LONGITUDE_DESTINATION,SCHEDULED_DEPARTURE_HH,DEPARTURE_TIME_HH,SCHEDULED_ARRIVAL_HH,WHEELS_OFF_HH,ARRIVAL_TIME_HH,ARRIVAL_TIME,DEPARTURE_TIME_NORMALIZED,ARRIVAL_TIME_NORMALIZED,ROUTE,ROUTE_STATES
0,1,1,4,NK,451,N633NK,PBG,FLL,2015-01-01 01:55:00,2015-01-01 01:39:00,-16.0,10.0,149.0,208.0,191.0,174.0,1334,443.0,7.0,523,-33.0,,,,,,NY,44.69282,-73.45562,FL,26.07258,-80.15275,1,1,5,1,4,2015-01-01 04:50:00,2015-01-01 06:39:00,2015-01-01 09:50:00,PBG_FLL,NY_FL
1,1,1,4,B6,668,N653JB,PSE,MCO,2015-01-01 02:55:00,2015-01-01 02:48:00,-7.0,10.0,258.0,185.0,183.0,169.0,1179,447.0,4.0,500,-9.0,,,,,,PR,18.0083,-66.56301,FL,28.42889,-81.31603,2,2,5,2,4,2015-01-01 04:51:00,2015-01-01 06:48:00,2015-01-01 09:51:00,PSE_MCO,PR_FL
2,1,1,4,NK,647,N630NK,IAG,FLL,2015-01-01 02:00:00,2015-01-01 01:55:00,-5.0,21.0,216.0,184.0,178.0,149.0,1176,445.0,8.0,504,-11.0,,,,,,NY,43.10726,-78.94538,FL,26.07258,-80.15275,2,1,5,2,4,2015-01-01 04:53:00,2015-01-01 06:55:00,2015-01-01 09:53:00,IAG_FLL,NY_FL
3,1,1,4,DL,2336,N958DN,DEN,ATL,2015-01-01 00:30:00,2015-01-01 00:24:00,-6.0,12.0,36.0,173.0,149.0,133.0,1199,449.0,4.0,523,-30.0,,,,,,CO,39.85841,-104.667,GA,33.64044,-84.42694,0,0,5,0,4,2015-01-01 04:53:00,2015-01-01 07:24:00,2015-01-01 09:53:00,DEN_ATL,CO_GA
4,1,1,4,UA,1528,N76519,SJU,EWR,2015-01-01 01:54:00,2015-01-01 01:57:00,3.0,12.0,209.0,255.0,241.0,220.0,1608,449.0,9.0,509,-11.0,,,,,,PR,18.43942,-66.00183,NJ,40.6925,-74.16866,1,1,5,2,4,2015-01-01 04:58:00,2015-01-01 05:57:00,2015-01-01 09:58:00,SJU_EWR,PR_NJ


Let's start off creating new features by creating DELAY_REDUCTION, AIRLINE + ROUTE and AIRLINE + FLIGHT_NUMBER + ROUTE based on already existing columns.
* DELAY_REDUCTION is a difference between departure delay and arrival delay.
* AIRLINE_ROUTE enables us to distinguish arrival delay on a particular route by different airlines.
* AIRLINE_FLIGHT_NUMBER_ROUTE can identify particular repetitive flights.

In [5]:
data['DELAY_REDUCTION'] = data.DEPARTURE_DELAY - data.ARRIVAL_DELAY
data['AIRLINE_ROUTE'] = data.AIRLINE + '_' + data.ROUTE
data['AIRLINE_FLIGHT_NUMBER_ROUTE'] = data.AIRLINE + '_' + data.FLIGHT_NUMBER.astype('str') + '_' + data.ROUTE 

The predictions will be made for the flights with a given departure delay. Due to the fact that departure delay in most cases is clearly correlated with arrival delay, we will be creating new features based on values of delay reduction. It is more informative than arrival delay itself. The purpose of the other variables will be to improve the difference between departure and arrival delay. 

We will create new features based on different approaches:
* Mean of delay reduction for flights that arrived in last 1/5/24 hours grouped by various columns. Delay reduction by certain features from last flights should be really helpful in estimating expected arrival delay.
* Mean of taxi out/taxi for flights that arrived in last 1/5/24 hours grouped by origin airport/destination airport. We want to use only those columns that make sense in such use. Taxi out is a measurement of the time between departure from the origin airport gate and the moment when an airline takes off so the results grouped by origin airport should be the best indicator of expected taxi out. We will also check whether particular repetitive flights could have an influence on that. Similar approach is used for taxi in which is grouped by destination airport and particular flights.
* Mean of delay reduction based on various features on the whole train set. We can suspect based on former flights that the situation in the new flights could be similar. 
* Mean of previous 1/3/7 flights based on features that ensure that there can't be any flight that starts before the arrival of a different flight from particular group. We can base only on flights that have already finished.
* A percentage of flights with increased delay reduction grouped by various columns on the whole train set. It can provide information whether we should expect an increase or decrease in delay in the new flights.
* Differences between average delay reduction from last 1/5/24 hours or for last 1/3/7 flights and the mean on the whole train set. It will provide us with information on how different from average are the last results.

The first step, features based on last 1/5/24 hours, can be performed on the full data set so we don't have to split data into train and test set yet.

In [6]:
hours_vals = [1, 5, 24]
def delay_by_periods(data, col, col_agg = 'DELAY_REDUCTION'):
    for hours in hours_vals:
        data[col + '_' + col_agg + '_' + str(hours) + '_HOURS_MEAN'] = np.nan
    for col_val in data[col].unique():
        data_frac = data[data[col] == col_val]
        for i in range(len(data_frac)):
            for hours in hours_vals:
                delays = data_frac.loc[(data_frac.DEPARTURE_TIME_NORMALIZED.iloc[i] > data_frac.ARRIVAL_TIME_NORMALIZED) &
                        (data_frac.ARRIVAL_TIME_NORMALIZED > (data_frac.DEPARTURE_TIME_NORMALIZED.iloc[i] - pd.Timedelta(hours = hours))),
                        col_agg]
                data[col + '_' + col_agg + '_' + str(hours) + '_HOURS_MEAN'].iloc[data_frac.index[i]] = delays.mean()
    return data

In [7]:
cols_to_agg = ['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'AIRLINE', 'ROUTE_STATES', 'ROUTE', 'TAIL_NUMBER', 'AIRLINE_ROUTE']

for col in cols_to_agg:
    data = delay_by_periods(data, col)
    
data = delay_by_periods(data, 'ORIGIN_AIRPORT', 'TAXI_OUT')
data = delay_by_periods(data, 'DESTINATION_AIRPORT', 'TAXI_IN')

Let's also create average departure delay from last 24 hours grouped by AIRLINE_FLIGHT_NUMBER_ROUTE. It doesn't make sense to create variables also by 1-hour and 5-hours intervals because the same flights take place only once a day. 

In [8]:
hours_vals = [24]

data = delay_by_periods(data, 'AIRLINE_FLIGHT_NUMBER_ROUTE')

For next steps we have to divide our data into train and test set.
The train set will include data from January to March, leaving April for the test set.

In [9]:
train = data[data.MONTH <= 3]
test = data[data.MONTH == 4].reset_index(drop = True)
del data

In [10]:
def aggregates_on_delay(data, col, aggregate = 'mean', col_agg = 'DELAY_REDUCTION'):
    aggregate = train.groupby(col)[col_agg].agg([aggregate])\
                        .rename({aggregate: col + '_' + col_agg + '_' + aggregate.upper()}, axis = 1)
    data = pd.merge(data, aggregate, on = col, how = 'left')
    return data

In [11]:
cols_to_agg_delay_reduction = ['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'AIRLINE', 'ROUTE_STATES', 'ROUTE', 'TAIL_NUMBER', 
               'SCHEDULED_DEPARTURE_HH', 'SCHEDULED_ARRIVAL_HH', 'AIRLINE_ROUTE', 'AIRLINE_FLIGHT_NUMBER_ROUTE']

cols_to_agg_taxi_out = ['ORIGIN_AIRPORT', 'AIRLINE_FLIGHT_NUMBER_ROUTE']
cols_to_agg_taxi_in = ['DESTINATION_AIRPORT', 'AIRLINE_FLIGHT_NUMBER_ROUTE']

for col in cols_to_agg_delay_reduction:
    train = aggregates_on_delay(train, col, 'mean')
    test = aggregates_on_delay(test, col, 'mean')

for col in cols_to_agg_taxi_out: 
    train = aggregates_on_delay(train, col, 'mean', 'TAXI_OUT')
    test = aggregates_on_delay(test, col, 'mean', 'TAXI_OUT')        
    
for col in cols_to_agg_taxi_in: 
    train = aggregates_on_delay(train, col, 'mean', 'TAXI_IN')
    test = aggregates_on_delay(test, col, 'mean', 'TAXI_IN')

For a percentage of increased delay function we will set a minimum number of required previous values to 10. It will prevent us from situations where for a very small number of observations the vast majority of values have an increased delay, whereas on the whole the delay seems to decrease. A sample can't be too small to obtain reliable results.

In [12]:
def perc_of_increased_delay(data, col, min_n = 10):
    train_delayed = train.loc[train.DELAY_REDUCTION < 0, ['DELAY_REDUCTION', col]]
    if len(train_delayed) >= min_n:
        aggregate = ((train_delayed.groupby(col).count() / train[['DELAY_REDUCTION', col]].groupby(col).count())).\
        rename({'DELAY_REDUCTION': 'PERC_OF_INCREASED_DEL_' + col}, axis = 1)
    data = pd.merge(data, aggregate, on = col, how = 'left')
    return data

In [13]:
cols_to_agg_perc = ['ORIGIN_AIRPORT', 'DESTINATION_AIRPORT', 'AIRLINE', 'ROUTE_STATES', 'ROUTE', 'TAIL_NUMBER',
                    'AIRLINE_ROUTE', 'SCHEDULED_DEPARTURE_HH', 'SCHEDULED_ARRIVAL_HH', 'AIRLINE_FLIGHT_NUMBER_ROUTE']

for col in cols_to_agg_perc:
    train = perc_of_increased_delay(train, col)
    test = perc_of_increased_delay(test, col)

The individual flights can be obtained from AIRLINE_FLIGHT_NUMBER_ROUTE and TAIL_NUMBER variables.

In [14]:
def rolling_previous(train_set, test_set, col, n, aggregate = 'mean', col_agg = 'DELAY_REDUCTION'):
    data = pd.concat([train_set, test_set], axis = 0).reset_index(drop = True)
    data[col + '_' + col_agg + '_ROLLING_' + str(n) + '_' + aggregate.upper()] =\
        data.groupby(col)[col_agg].transform(lambda x: x.shift(1).rolling(n, n).agg(aggregate))
    train_set = data[data.MONTH <= 3]
    test_set = data[data.MONTH == 4].reset_index(drop = True)
    return train_set, test_set

In [15]:
for n in [1, 3, 7]:
    train, test = rolling_previous(train, test, col = 'AIRLINE_FLIGHT_NUMBER_ROUTE', n = n)
    train, test = rolling_previous(train, test, col = 'AIRLINE_FLIGHT_NUMBER_ROUTE', n = n, col_agg = 'TAXI_IN')
    train, test = rolling_previous(train, test, col = 'AIRLINE_FLIGHT_NUMBER_ROUTE', n = n, col_agg = 'TAXI_OUT')
    train, test = rolling_previous(train, test, col = 'TAIL_NUMBER', n = n)

The next step includes differences between means of delay_reduction/taxi_in/taxi_out from some period and on the whole. After that, we will no longer need means on the whole. The information from those variables will be already included in variables from some periods and the differences.

In [16]:
for hour in [1, 5, 24]:
    for col in train.columns[(train.columns.str.contains(str(hour) + '_HOURS')) & (train.columns.str.contains('DELAY'))]:
        train[col + '_DIFF'] = train[col] - train[col.replace(str(hour) + '_HOURS_', '')]
        test[col + '_DIFF'] = test[col] - test[col.replace(str(hour) + '_HOURS_', '')]
        
    for col in train.columns[(train.columns.str.contains(str(hour) + '_HOURS')) & (train.columns.str.contains('TAXI_IN'))]:
        train[col + '_DIFF'] = train[col] - train[col.replace(str(hour) + '_HOURS_', '')]
        test[col + '_DIFF'] = test[col] - test[col.replace(str(hour) + '_HOURS_', '')]
    
    for col in train.columns[(train.columns.str.contains(str(hour) + '_HOURS')) & (train.columns.str.contains('TAXI_OUT'))]:
        train[col + '_DIFF'] = train[col] - train[col.replace(str(hour) + '_HOURS_', '')]
        test[col + '_DIFF'] = test[col] - test[col.replace(str(hour) + '_HOURS_', '')]

In [17]:
for previous in [1, 3, 7]:
    for col in train.columns[(train.columns.str.contains('ROLLING_' + str(previous)))]:
        train[col + '_DIFF'] = train[col] - train[col.replace('_ROLLING_' + str(previous) , '')]
        test[col + '_DIFF'] = test[col] - test[col.replace('_ROLLING_' + str(previous), '')]

In [18]:
train.drop(['ORIGIN_AIRPORT_DELAY_REDUCTION_MEAN',
       'DESTINATION_AIRPORT_DELAY_REDUCTION_MEAN',
       'AIRLINE_DELAY_REDUCTION_MEAN', 'ROUTE_STATES_DELAY_REDUCTION_MEAN',
       'ROUTE_DELAY_REDUCTION_MEAN', 'TAIL_NUMBER_DELAY_REDUCTION_MEAN',
       'SCHEDULED_DEPARTURE_HH_DELAY_REDUCTION_MEAN',
       'SCHEDULED_ARRIVAL_HH_DELAY_REDUCTION_MEAN',
       'AIRLINE_ROUTE_DELAY_REDUCTION_MEAN',
       'AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_MEAN',
       'ORIGIN_AIRPORT_TAXI_OUT_MEAN',
       'AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_MEAN',
       'DESTINATION_AIRPORT_TAXI_IN_MEAN',
       'AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_MEAN'], axis = 1, inplace = True)

test.drop(['ORIGIN_AIRPORT_DELAY_REDUCTION_MEAN',
       'DESTINATION_AIRPORT_DELAY_REDUCTION_MEAN',
       'AIRLINE_DELAY_REDUCTION_MEAN', 'ROUTE_STATES_DELAY_REDUCTION_MEAN',
       'ROUTE_DELAY_REDUCTION_MEAN', 'TAIL_NUMBER_DELAY_REDUCTION_MEAN',
       'SCHEDULED_DEPARTURE_HH_DELAY_REDUCTION_MEAN',
       'SCHEDULED_ARRIVAL_HH_DELAY_REDUCTION_MEAN',
       'AIRLINE_ROUTE_DELAY_REDUCTION_MEAN',
       'AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_MEAN',
       'ORIGIN_AIRPORT_TAXI_OUT_MEAN',
       'AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_MEAN',
       'DESTINATION_AIRPORT_TAXI_IN_MEAN',
       'AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_MEAN'], axis = 1, inplace = True)

Let's also drop other variables that won't be used in models' creation.

In [19]:
train.drop(['MONTH', 'DAY', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'TAXI_OUT', 'WHEELS_OFF',
                    'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL',
                    'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 
                    'WEATHER_DELAY', 'DEPARTURE_TIME_HH', 'WHEELS_OFF_HH', 'ARRIVAL_TIME_HH',
                    'ARRIVAL_TIME', 'DEPARTURE_TIME_NORMALIZED', 'ARRIVAL_TIME_NORMALIZED', 'DELAY_REDUCTION',
                    'FLIGHT_NUMBER'], axis = 1, inplace = True)

test.drop(['MONTH', 'DAY', 'SCHEDULED_DEPARTURE', 'DEPARTURE_TIME', 'TAXI_OUT', 'WHEELS_OFF',
                    'SCHEDULED_TIME', 'ELAPSED_TIME', 'AIR_TIME', 'WHEELS_ON', 'TAXI_IN', 'SCHEDULED_ARRIVAL',
                    'AIR_SYSTEM_DELAY', 'SECURITY_DELAY', 'AIRLINE_DELAY', 'LATE_AIRCRAFT_DELAY', 
                    'WEATHER_DELAY', 'DEPARTURE_TIME_HH', 'WHEELS_OFF_HH', 'ARRIVAL_TIME_HH',
                    'ARRIVAL_TIME', 'DEPARTURE_TIME_NORMALIZED', 'ARRIVAL_TIME_NORMALIZED', 'DELAY_REDUCTION',
                    'FLIGHT_NUMBER'], axis = 1, inplace = True)

In [6]:
print('Shape of the train set:', train.shape)
print('Shape of the test set:', test.shape)

Shape of the train set: (1356814, 110)
Shape of the test set: (479251, 110)


In [7]:
train.head()

Unnamed: 0,DAY_OF_WEEK,AIRLINE,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DEPARTURE_DELAY,DISTANCE,ARRIVAL_DELAY,STATE_ORIGIN,LATITUDE_ORIGIN,LONGITUDE_ORIGIN,STATE_DESTINATION,LATITUDE_DESTINATION,LONGITUDE_DESTINATION,SCHEDULED_DEPARTURE_HH,SCHEDULED_ARRIVAL_HH,ROUTE,ROUTE_STATES,ORIGIN_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN,ORIGIN_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN,DESTINATION_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN,DESTINATION_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN,AIRLINE_DELAY_REDUCTION_1_HOURS_MEAN,AIRLINE_DELAY_REDUCTION_5_HOURS_MEAN,ROUTE_STATES_DELAY_REDUCTION_1_HOURS_MEAN,ROUTE_STATES_DELAY_REDUCTION_5_HOURS_MEAN,ROUTE_DELAY_REDUCTION_1_HOURS_MEAN,ROUTE_DELAY_REDUCTION_5_HOURS_MEAN,TAIL_NUMBER_DELAY_REDUCTION_1_HOURS_MEAN,TAIL_NUMBER_DELAY_REDUCTION_5_HOURS_MEAN,ORIGIN_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN,DESTINATION_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN,AIRLINE_DELAY_REDUCTION_24_HOURS_MEAN,ROUTE_STATES_DELAY_REDUCTION_24_HOURS_MEAN,ROUTE_DELAY_REDUCTION_24_HOURS_MEAN,TAIL_NUMBER_DELAY_REDUCTION_24_HOURS_MEAN,AIRLINE_ROUTE,AIRLINE_FLIGHT_NUMBER_ROUTE,AIRLINE_ROUTE_DELAY_REDUCTION_1_HOURS_MEAN,AIRLINE_ROUTE_DELAY_REDUCTION_5_HOURS_MEAN,AIRLINE_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN,DESTINATION_AIRPORT_TAXI_IN_1_HOURS_MEAN,DESTINATION_AIRPORT_TAXI_IN_5_HOURS_MEAN,DESTINATION_AIRPORT_TAXI_IN_24_HOURS_MEAN,PERC_OF_INCREASED_DEL_ORIGIN_AIRPORT,PERC_OF_INCREASED_DEL_DESTINATION_AIRPORT,PERC_OF_INCREASED_DEL_AIRLINE,PERC_OF_INCREASED_DEL_ROUTE_STATES,PERC_OF_INCREASED_DEL_ROUTE,PERC_OF_INCREASED_DEL_TAIL_NUMBER,PERC_OF_INCREASED_DEL_SCHEDULED_DEPARTURE_HH,PERC_OF_INCREASED_DEL_SCHEDULED_ARRIVAL_HH,PERC_OF_INCREASED_DEL_AIRLINE_ROUTE,PERC_OF_INCREASED_DEL_AIRLINE_FLIGHT_NUMBER_ROUTE,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_1_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_3_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_7_MEAN,ORIGIN_AIRPORT_TAXI_OUT_1_HOURS_MEAN,ORIGIN_AIRPORT_TAXI_OUT_5_HOURS_MEAN,ORIGIN_AIRPORT_TAXI_OUT_24_HOURS_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_1_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_3_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_7_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_1_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_3_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_7_MEAN,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_1_MEAN,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_3_MEAN,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_7_MEAN,ORIGIN_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,AIRLINE_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,ROUTE_STATES_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,ROUTE_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,AIRLINE_ROUTE_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_TAXI_IN_1_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_TAXI_OUT_1_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,AIRLINE_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,ROUTE_STATES_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,ROUTE_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,AIRLINE_ROUTE_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_TAXI_IN_5_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_TAXI_OUT_5_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,AIRLINE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,ROUTE_STATES_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,ROUTE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,AIRLINE_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_TAXI_IN_24_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_TAXI_OUT_24_HOURS_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_1_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_1_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_1_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_1_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_3_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_3_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_3_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_3_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_7_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_7_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_7_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_7_MEAN_DIFF
0,4,NK,N633NK,PBG,FLL,-16.0,1334,-33.0,NY,44.69282,-73.45562,FL,26.07258,-80.15275,1,5,PBG_FLL,NY_FL,,,,,,,,,,,,,,,,,,,NK_PBG_FLL,NK_451_PBG_FLL,,,,,,,,0.356322,0.258135,0.36375,0.308834,0.356322,0.3725,0.377837,0.304152,0.356322,0.356322,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,4,B6,N653JB,PSE,MCO,-7.0,1179,-9.0,PR,18.0083,-66.56301,FL,28.42889,-81.31603,2,5,PSE_MCO,PR_FL,,,,,,,,,,,,,,,,,,,B6_PSE_MCO,B6_668_PSE_MCO,,,,,,,,0.385027,0.275322,0.336859,0.299888,0.382353,0.378981,0.377301,0.304152,0.382353,0.416667,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,4,DL,N958DN,DEN,ATL,-6.0,1199,-30.0,CO,39.85841,-104.667,GA,33.64044,-84.42694,0,5,DEN_ATL,CO_GA,,,,,,,,,,,,,,,,,,,DL_DEN_ATL,DL_2336_DEN_ATL,,,,,,,,0.275941,0.254317,0.22284,0.209451,0.212521,0.217507,0.346747,0.304152,0.173841,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,4,NK,N630NK,IAG,FLL,-5.0,1176,-11.0,NY,43.10726,-78.94538,FL,26.07258,-80.15275,2,5,IAG_FLL,NY_FL,,,,,,,,,,,,,,,,,,,NK_IAG_FLL,NK_647_IAG_FLL,,,,,,,,0.375,0.258135,0.36375,0.308834,0.365854,0.395577,0.377301,0.304152,0.365854,0.365854,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,4,UA,N76519,SJU,EWR,3.0,1608,-11.0,PR,18.43942,-66.00183,NJ,40.6925,-74.16866,1,5,SJU_EWR,PR_NJ,,,,,,,,,,,,,,,,,,,UA_SJU_EWR,UA_1528_SJU_EWR,,,,,,,,0.307278,0.279543,0.216393,0.301408,0.300341,0.205021,0.377837,0.304152,0.307692,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [8]:
test.head()

Unnamed: 0,DAY_OF_WEEK,AIRLINE,TAIL_NUMBER,ORIGIN_AIRPORT,DESTINATION_AIRPORT,DEPARTURE_DELAY,DISTANCE,ARRIVAL_DELAY,STATE_ORIGIN,LATITUDE_ORIGIN,LONGITUDE_ORIGIN,STATE_DESTINATION,LATITUDE_DESTINATION,LONGITUDE_DESTINATION,SCHEDULED_DEPARTURE_HH,SCHEDULED_ARRIVAL_HH,ROUTE,ROUTE_STATES,ORIGIN_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN,ORIGIN_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN,DESTINATION_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN,DESTINATION_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN,AIRLINE_DELAY_REDUCTION_1_HOURS_MEAN,AIRLINE_DELAY_REDUCTION_5_HOURS_MEAN,ROUTE_STATES_DELAY_REDUCTION_1_HOURS_MEAN,ROUTE_STATES_DELAY_REDUCTION_5_HOURS_MEAN,ROUTE_DELAY_REDUCTION_1_HOURS_MEAN,ROUTE_DELAY_REDUCTION_5_HOURS_MEAN,TAIL_NUMBER_DELAY_REDUCTION_1_HOURS_MEAN,TAIL_NUMBER_DELAY_REDUCTION_5_HOURS_MEAN,ORIGIN_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN,DESTINATION_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN,AIRLINE_DELAY_REDUCTION_24_HOURS_MEAN,ROUTE_STATES_DELAY_REDUCTION_24_HOURS_MEAN,ROUTE_DELAY_REDUCTION_24_HOURS_MEAN,TAIL_NUMBER_DELAY_REDUCTION_24_HOURS_MEAN,AIRLINE_ROUTE,AIRLINE_FLIGHT_NUMBER_ROUTE,AIRLINE_ROUTE_DELAY_REDUCTION_1_HOURS_MEAN,AIRLINE_ROUTE_DELAY_REDUCTION_5_HOURS_MEAN,AIRLINE_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN,DESTINATION_AIRPORT_TAXI_IN_1_HOURS_MEAN,DESTINATION_AIRPORT_TAXI_IN_5_HOURS_MEAN,DESTINATION_AIRPORT_TAXI_IN_24_HOURS_MEAN,PERC_OF_INCREASED_DEL_ORIGIN_AIRPORT,PERC_OF_INCREASED_DEL_DESTINATION_AIRPORT,PERC_OF_INCREASED_DEL_AIRLINE,PERC_OF_INCREASED_DEL_ROUTE_STATES,PERC_OF_INCREASED_DEL_ROUTE,PERC_OF_INCREASED_DEL_TAIL_NUMBER,PERC_OF_INCREASED_DEL_SCHEDULED_DEPARTURE_HH,PERC_OF_INCREASED_DEL_SCHEDULED_ARRIVAL_HH,PERC_OF_INCREASED_DEL_AIRLINE_ROUTE,PERC_OF_INCREASED_DEL_AIRLINE_FLIGHT_NUMBER_ROUTE,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_1_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_3_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_7_MEAN,ORIGIN_AIRPORT_TAXI_OUT_1_HOURS_MEAN,ORIGIN_AIRPORT_TAXI_OUT_5_HOURS_MEAN,ORIGIN_AIRPORT_TAXI_OUT_24_HOURS_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_1_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_3_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_7_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_1_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_3_MEAN,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_7_MEAN,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_1_MEAN,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_3_MEAN,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_7_MEAN,ORIGIN_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,AIRLINE_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,ROUTE_STATES_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,ROUTE_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,AIRLINE_ROUTE_DELAY_REDUCTION_1_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_TAXI_IN_1_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_TAXI_OUT_1_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,AIRLINE_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,ROUTE_STATES_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,ROUTE_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,AIRLINE_ROUTE_DELAY_REDUCTION_5_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_TAXI_IN_5_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_TAXI_OUT_5_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,AIRLINE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,ROUTE_STATES_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,ROUTE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,AIRLINE_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_24_HOURS_MEAN_DIFF,DESTINATION_AIRPORT_TAXI_IN_24_HOURS_MEAN_DIFF,ORIGIN_AIRPORT_TAXI_OUT_24_HOURS_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_1_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_1_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_1_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_1_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_3_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_3_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_3_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_3_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_DELAY_REDUCTION_ROLLING_7_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_IN_ROLLING_7_MEAN_DIFF,AIRLINE_FLIGHT_NUMBER_ROUTE_TAXI_OUT_ROLLING_7_MEAN_DIFF,TAIL_NUMBER_DELAY_REDUCTION_ROLLING_7_MEAN_DIFF
0,3,NK,N613NK,IAG,FLL,-12.0,1176,-31.0,NY,43.10726,-78.94538,FL,26.07258,-80.15275,2,5,IAG_FLL,NY_FL,,,19.333333,12.357143,3.416667,4.278481,19.333333,16.735294,,,-8.0,-8.0,16.0,14.430279,3.696667,16.335079,16.0,2.5,NK_IAG_FLL,NK_647_IAG_FLL,,,16.0,16.0,4.0,4.410714,5.298805,0.375,0.258135,0.36375,0.308834,0.365854,0.326648,0.377301,0.304152,0.365854,0.365854,16.0,14.0,4.571429,,,11.0,5.0,7.333333,8.857143,11.0,10.333333,9.142857,-8.0,4.666667,2.285714,,13.127885,1.833765,13.889732,,-9.862464,,-2.257,,,6.151695,2.695579,11.291692,,-9.862464,,-1.846285,,16.135417,8.224831,2.113765,10.891477,16.036585,0.637536,16.036585,16.036585,-0.958195,-4.072917,16.036585,-6.890244,-4.731707,-9.862464,14.036585,-4.556911,-5.398374,2.804202,4.608014,-3.033101,-6.58885,0.42325
1,3,B6,N665JB,SJU,JFK,-4.0,1598,-20.0,PR,18.43942,-66.00183,NY,40.63975,-73.77893,1,5,SJU_JFK,PR_NY,15.5,-3.0,3.454545,6.528571,10.642857,7.301676,15.5,5.5,15.5,5.5,,14.0,1.253521,5.842105,7.356952,14.294118,13.666667,6.5,B6_SJU_JFK,B6_1204_SJU_JFK,10.0,8.5,15.875,13.0,10.0,11.514286,9.677193,0.307278,0.337974,0.336859,0.367756,0.366125,0.355556,0.377837,0.304152,0.35284,0.490909,13.0,14.333333,-3.714286,10.0,12.727273,13.746479,10.0,8.0,9.285714,11.0,11.666667,16.0,14.0,16.333333,17.428571,11.288583,0.026126,7.945688,12.378158,12.274166,,6.394148,0.130555,-3.911217,-7.211417,3.100152,4.604507,2.378158,2.274166,12.02963,4.894148,1.64484,-1.183944,-2.957896,2.413685,4.659783,11.172275,10.440833,4.52963,12.269148,14.363636,-0.192252,-0.164738,14.363636,1.363636,-2.418182,12.02963,15.69697,-0.636364,-1.751515,14.362963,-2.350649,0.649351,2.581818,15.458201
2,3,UA,N37273,SJU,EWR,6.0,1608,-13.0,PR,18.43942,-66.00183,NJ,40.6925,-74.16866,0,5,SJU_EWR,PR_NJ,15.5,-3.0,6.4,-4.388235,10.453125,8.039062,,-11.5,,-11.5,15.0,15.0,1.444444,2.12844,9.215837,5.0,2.0,15.0,UA_SJU_EWR,UA_1261_SJU_EWR,,-24.0,2.333333,,6.6,10.105882,9.464832,0.307278,0.279543,0.216393,0.301408,0.300341,0.232227,0.346747,0.304152,0.307692,,,,,10.0,12.727273,13.763889,,,,,,,15.0,7.666667,9.428571,11.288583,1.251773,2.399367,,,7.023697,,-3.450774,-3.911217,-7.211417,-9.536462,-0.014696,-16.857746,-16.602389,7.023697,-29.553846,0.055108,-1.183944,-2.766973,-3.019787,1.162079,-0.357746,-3.102389,7.023697,-3.220513,,-0.585942,-0.147328,,,,7.023697,,,,-0.309637,,,,1.452268
3,3,DL,N557NW,DEN,ATL,-8.0,1199,-11.0,CO,39.85841,-104.667,GA,33.64044,-84.42694,0,5,DEN_ATL,CO_GA,0.636364,5.595092,,2.488372,1.896552,3.900862,,6.5,,6.5,,4.0,6.940778,8.227695,6.922259,8.0,7.588235,5.833333,DL_DEN_ATL,DL_1214_DEN_ATL,,8.0,10.75,14.0,,11.244186,8.29368,0.275941,0.254317,0.22284,0.209451,0.212521,0.194373,0.346747,0.304152,0.173841,0.103448,14.0,11.333333,9.714286,9.272727,13.202454,13.527919,5.0,5.333333,6.0,11.0,12.666667,13.714286,4.0,7.333333,6.142857,-3.743105,,-4.965448,,,,,,-7.512678,1.215623,-3.268542,-2.961137,-1.479566,-1.540362,-3.309463,-2.705298,2.507256,-3.582951,2.561309,2.470781,0.060259,0.020434,-0.452127,-1.47613,0.044702,6.310345,-0.44325,-3.257487,6.310345,-1.586207,-4.931034,-3.309463,3.643678,-1.252874,-3.264368,0.02387,2.024631,-0.586207,-2.216749,-1.166606
4,3,NK,N624NK,PBG,FLL,-12.0,1334,-20.0,NY,44.69282,-73.45562,FL,26.07258,-80.15275,1,5,PBG_FLL,NY_FL,,,18.5,12.189655,2.538462,3.853659,18.5,16.333333,,,0.0,1.0,18.0,14.404,3.647841,16.310526,18.0,2.0,NK_PBG_FLL,NK_451_PBG_FLL,,,18.0,18.0,4.0,4.448276,5.304,0.356322,0.258135,0.36375,0.308834,0.356322,0.367847,0.377837,0.304152,0.356322,0.356322,18.0,14.666667,0.571429,,,13.0,8.0,9.0,10.571429,13.0,11.0,13.142857,0.0,2.0,3.428571,,12.294552,0.955559,13.056398,,-1.433243,,-2.257,,,5.984207,2.270756,10.889732,,-0.433243,,-1.808724,,14.321839,8.198552,2.064938,10.866925,14.321839,0.566757,14.321839,14.321839,-0.953,-2.597701,14.321839,-0.574713,-2.597701,-1.433243,10.988506,0.425287,-4.597701,0.566757,-3.106732,1.996716,-2.454844,1.995329


In [9]:
train.to_pickle('train.pkl')
test.to_pickle('test.pkl')