# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from keras.callbacks import EarlyStopping

warnings.simplefilter(action='ignore', category=FutureWarning)

Using TensorFlow backend.


## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(644232, 12)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier', 'dep_hour'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier,dep_hour
0,10,3,3,5228,ONT,SFO,-12.0,0.0,363.0,Delta,11
1,11,7,3,1443,BNA,DAL,-7.0,0.0,623.0,SouthWest,15
2,12,14,5,4072,LGA,CLE,-12.0,0.0,419.0,United,15
3,12,9,7,331,JFK,LAX,-17.0,0.0,2475.0,American,11
4,12,17,1,3539,SLC,GEG,-19.0,0.0,546.0,Delta,15


In [5]:
data = data[data['delay_indicator'] == 1]
data = data[data['arr_delay']<=120]

## Target Variable and Features Matrix ##

In order to fit a logistic regression we need to use our **ARR_DELAY** as our discreet target variable. We also need to drop DELAY_INDICATOR from our features as our target variable was efficiently engineered from it.

In [6]:
#Target variable
y = data['arr_delay']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

In [7]:
#Baseline model accuracy
y.mean()

41.329520619577565

In [8]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dep_hour               int64
dtype: object

In [9]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(287809, 712)

In [10]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

In [11]:
#As an additional step, we need to scale our feature matrices modeling
ss = StandardScaler()
X_train = pd.DataFrame(ss.fit_transform(X_train),columns = X.columns)
# X_train = ss.fit_transform(X_train)
X_test = pd.DataFrame(ss.transform(X_test),columns = X.columns)

## FFNN ##

Let's creat and compile a Forward Feeding Neural Network with 2 hidden layers consisting of 512 and 256 neurons respective.

In [15]:
#Model's topography

model = Sequential()

model.add(Dense(512,
         input_shape = (712,),
         activation = 'relu'))

model.add(Dense(256,
               activation='relu'))

model.add(Dense(1,
               activation = None))

#Early stopping feature
early_stop = EarlyStopping(monitor='val_loss', min_delta=10, patience=5, verbose=1, mode='auto')

model.compile(loss='mse',
              optimizer='adam',
              metrics=['mse']
             )

#Saving
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_test, y_test),
    epochs=10,
    batch_size=512,
    verbose=2,
    callbacks=[early_stop]
)

Train on 215856 samples, validate on 71953 samples
Epoch 1/10
215856/215856 - 16s - loss: 699.0941 - mse: 699.0938 - val_loss: 648.9366 - val_mse: 648.9366
Epoch 2/10
215856/215856 - 14s - loss: 641.5759 - mse: 641.5758 - val_loss: 647.2638 - val_mse: 647.2637
Epoch 3/10
215856/215856 - 14s - loss: 639.4631 - mse: 639.4633 - val_loss: 645.7061 - val_mse: 645.7061
Epoch 4/10
215856/215856 - 14s - loss: 637.1404 - mse: 637.1406 - val_loss: 645.3549 - val_mse: 645.3549
Epoch 5/10
215856/215856 - 14s - loss: 635.5403 - mse: 635.5404 - val_loss: 645.2430 - val_mse: 645.2431
Epoch 6/10
215856/215856 - 14s - loss: 632.9087 - mse: 632.9088 - val_loss: 644.5767 - val_mse: 644.5767
Epoch 00006: early stopping


As we could see, the model's performance is quite far from the expectations. It's quite complex computationally, but it's mse is around 640 which corresponds to rmse of 25, meaning that we could predict a delayed flight's actual delay's length within +/- 25 minutes interval. Looking back at an average delay of 40 minutes - for that particular case we will be able to predict an interval of 15 to 65 minutes - which is quite far from satisfactory. 