# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

from keras.callbacks import EarlyStopping
from keras.layers import Dropout

Using TensorFlow backend.


## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(644232, 11)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier
0,10,3,3,4195,LEX,ORD,-15.0,0.0,323.0,American
1,11,5,1,6002,DFW,CHA,-7.0,0.0,695.0,American
2,11,4,7,1937,ORD,AUS,-24.0,0.0,977.0,United
3,10,13,6,948,SFO,DEN,-3.0,0.0,967.0,United
4,11,9,5,1026,HOU,ABQ,-8.0,0.0,759.0,SouthWest


## Target Variable and Features Matrix ##

In order to fit a logistic regression we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it.

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

In [6]:
#Baseline model accuracy
y.mean()

0.5

In [7]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dtype: object

In [8]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(644232, 712)

In [9]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

In [10]:
#As an additional step, we need to scale our feature matrices modeling
ss = StandardScaler()
X_train = pd.DataFrame(ss.fit_transform(X_train),columns = X.columns)
# X_train = ss.fit_transform(X_train)
X_test = pd.DataFrame(ss.transform(X_test),columns = X.columns)

## FFNN ##

In [11]:
model = Sequential()

model.add(Dense(128,
         input_shape = (712,),
         activation = 'relu'))

# model.add(Dropout(0.25))

model.add(Dense(64,
               activation='relu'))

# model.add(Dropout(0.25))

model.add(Dense(1,
               activation = 'sigmoid'))

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=5, verbose=1, mode='auto')

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy']
             )

In [12]:
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_test, y_test),
    epochs=100,
    batch_size=512,
    verbose=2,
    callbacks=[early_stop]
)

Train on 483174 samples, validate on 161058 samples
Epoch 1/100
483174/483174 - 23s - loss: 0.6822 - accuracy: 0.5672 - val_loss: 0.6755 - val_accuracy: 0.5765
Epoch 2/100
483174/483174 - 27s - loss: 0.6702 - accuracy: 0.5869 - val_loss: 0.6691 - val_accuracy: 0.5888
Epoch 3/100
483174/483174 - 24s - loss: 0.6644 - accuracy: 0.5972 - val_loss: 0.6661 - val_accuracy: 0.5940
Epoch 4/100
483174/483174 - 23s - loss: 0.6590 - accuracy: 0.6059 - val_loss: 0.6611 - val_accuracy: 0.6031
Epoch 5/100
483174/483174 - 22s - loss: 0.6524 - accuracy: 0.6142 - val_loss: 0.6585 - val_accuracy: 0.6072
Epoch 6/100
483174/483174 - 23s - loss: 0.6466 - accuracy: 0.6216 - val_loss: 0.6553 - val_accuracy: 0.6119
Epoch 7/100
483174/483174 - 23s - loss: 0.6414 - accuracy: 0.6291 - val_loss: 0.6541 - val_accuracy: 0.6123
Epoch 8/100
483174/483174 - 23s - loss: 0.6369 - accuracy: 0.6347 - val_loss: 0.6523 - val_accuracy: 0.6161
Epoch 9/100
483174/483174 - 24s - loss: 0.6327 - accuracy: 0.6399 - val_loss: 0.6506