# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical

from keras.callbacks import EarlyStopping
from keras.layers import Dropout

Using TensorFlow backend.


## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(644232, 11)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier
0,10,3,3,4195,LEX,ORD,-15.0,0.0,323.0,American
1,11,5,1,6002,DFW,CHA,-7.0,0.0,695.0,American
2,11,4,7,1937,ORD,AUS,-24.0,0.0,977.0,United
3,10,13,6,948,SFO,DEN,-3.0,0.0,967.0,United
4,11,9,5,1026,HOU,ABQ,-8.0,0.0,759.0,SouthWest


In [5]:
data = data[data['delay_indicator'] == 1]
data = data[data['arr_delay']<=120]

## Target Variable and Features Matrix ##

In order to fit a logistic regression we need to use our **ARR_DELAY** as our discreet target variable. We also need to drop DELAY_INDICATOR from our features as our target variable was efficiently engineered from it.

In [6]:
#Target variable
y = data['arr_delay']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

In [7]:
#Baseline model accuracy
y.mean()

41.329520619577565

In [8]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dtype: object

In [9]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(287809, 711)

In [10]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

In [11]:
#As an additional step, we need to scale our feature matrices modeling
ss = StandardScaler()
X_train = pd.DataFrame(ss.fit_transform(X_train),columns = X.columns)
# X_train = ss.fit_transform(X_train)
X_test = pd.DataFrame(ss.transform(X_test),columns = X.columns)

## FFNN ##

In [12]:
model = Sequential()

model.add(Dense(512,
         input_shape = (711,),
         activation = 'relu'))

# model.add(Dropout(0.25))

model.add(Dense(256,
               activation='relu'))

# model.add(Dropout(0.25))

model.add(Dense(1,
               activation = None))

early_stop = EarlyStopping(monitor='val_loss', min_delta=10, patience=5, verbose=1, mode='auto')

model.compile(loss='mse',
              optimizer='adam',
              metrics=['mse']
             )

In [13]:
history = model.fit(
    X_train,
    y_train,
    validation_data=(X_test, y_test),
    epochs=10,
    batch_size=512,
    verbose=2,
    callbacks=[early_stop]
)

Train on 215856 samples, validate on 71953 samples
Epoch 1/10
215856/215856 - 19s - loss: 700.5732 - mse: 700.5735 - val_loss: 652.8702 - val_mse: 652.8704
Epoch 2/10
215856/215856 - 17s - loss: 646.1203 - mse: 646.1203 - val_loss: 652.9156 - val_mse: 652.9154
Epoch 3/10
215856/215856 - 17s - loss: 643.9416 - mse: 643.9416 - val_loss: 652.3562 - val_mse: 652.3560
Epoch 4/10
215856/215856 - 17s - loss: 641.3621 - mse: 641.3619 - val_loss: 649.8003 - val_mse: 649.8001
Epoch 5/10
215856/215856 - 14s - loss: 639.3518 - mse: 639.3517 - val_loss: 651.1601 - val_mse: 651.1600
Epoch 6/10
215856/215856 - 14s - loss: 637.4871 - mse: 637.4871 - val_loss: 649.3446 - val_mse: 649.3446
Epoch 00006: early stopping


In [14]:
np.sqrt(640)

25.298221281347036