# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

from keras.callbacks import EarlyStopping
from keras.layers import Dropout

Using TensorFlow backend.


## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(650556, 11)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'distance',
       'carrier', 'delay_indicator'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,distance,carrier,delay_indicator
0,11,29,4,3539,GEG,SLC,-13.0,546.0,Delta,0
1,11,3,6,3614,TPA,RDU,-14.0,587.0,Delta,0
2,12,19,3,3013,LAX,PDX,-3.0,834.0,SouthWest,0
3,11,1,4,3557,IAH,CVG,-13.0,871.0,American,0
4,12,11,2,4903,IAH,DTW,-19.0,1075.0,Delta,0


## Target Variable and Features Matrix ##

In order to fit a logistic regression we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it.

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

In [6]:
#Baseline model accuracy
y.mean()

0.5

In [7]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dtype: object

In [8]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(650556, 711)

In [9]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

In [10]:
# #As an additional step, we need to scale our feature matrices modeling
# ss = StandardScaler()
# X_train = ss.fit_transform(X_train)
# X_test = ss.transform(X_test)

## FFNN ##

In [11]:
model = Sequential()

model.add(Dense(128,
         input_shape = (711,),
         activation = 'relu'))

# model.add(Dropout(0.25))

model.add(Dense(128,
               activation='relu'))

# model.add(Dropout(0.25))

model.add(Dense(1,
               activation = 'sigmoid'))

early_stop = EarlyStopping(monitor='val_loss', min_delta=0.001, patience=5, verbose=1, mode='auto')

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy']
             )

In [None]:
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_test, y_test),
    epochs=10,
    batch_size=512,
    verbose=2,
    callbacks=[early_stop]
)

Train on 487917 samples, validate on 162639 samples
Epoch 1/10
487917/487917 - 46s - loss: 1.9725 - accuracy: 0.5142 - val_loss: 0.7403 - val_accuracy: 0.5257
Epoch 2/10
487917/487917 - 38s - loss: 0.7510 - accuracy: 0.5331 - val_loss: 0.7459 - val_accuracy: 0.5305
Epoch 3/10
487917/487917 - 51s - loss: 0.7253 - accuracy: 0.5408 - val_loss: 0.7160 - val_accuracy: 0.5226
Epoch 4/10
