# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)
- [Decision-Tree-Classifier](#Decision-Tree-Classifier)
- [Bagging-Classifier](#Bagging-Classifier)
- [Random Forest Classifier](#Random-Forest-Classifier)

## Importing Necessary Libraries ##

In [29]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import xgboost as xgb

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(650556, 11)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'distance',
       'carrier', 'delay_indicator'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,distance,carrier,delay_indicator
0,11,29,4,3539,GEG,SLC,-13.0,546.0,Delta,0
1,11,3,6,3614,TPA,RDU,-14.0,587.0,Delta,0
2,12,19,3,3013,LAX,PDX,-3.0,834.0,SouthWest,0
3,11,1,4,3557,IAH,CVG,-13.0,871.0,American,0
4,12,11,2,4903,IAH,DTW,-19.0,1075.0,Delta,0


## Target Variable and Features Matrix ##

In order to fit a Decision Tree Classifier we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it. 

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

#Baseline model accuracy
y.mean()

0.5

In [6]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dtype: object

In [7]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(650556, 711)

In [8]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

## Decision Tree Classifier ##

A Decision Tree Classifier is the model to be fitted first, the simplest one out of the family and still quite efficient. 

In [10]:
#Instantiating the model
tree = DecisionTreeClassifier(random_state=1519)

#Fitting the model
result_tree_cvec = tree.fit(X_train, y_train)

#Accuracy score training set
round(result_tree_cvec.score(X_train, y_train),4)

1.0

In [11]:
#Accuracy score testing set
round(result_tree_cvec.score(X_test, y_test),4)

0.6103394634743204

## Bagging Classifier ##

An attempt to improve Decision Tree Classifier's performance using an ensemble model.

In [13]:
#Instantiating Bagging Classifier
bag = BaggingClassifier(random_state=1519)

#Fitting the model
results_bag = bag.fit(X_train, y_train)

#Accuracy score training set
round(results_bag.score(X_train, y_train),4)

0.9760799480239467

In [14]:
#Accuracy score testing set
round(results_bag.score(X_test, y_test),4)

0.6428040015002552

## Random Forest Classifier ##

An attempt to further improve Decision Tree Classifier's performance.

In [15]:
#Instantiating Random Forest Classifier
forest = RandomForestClassifier(random_state=1519)

#Fitting the model
results_forest = forest.fit(X_train, y_train)

#Accuracy score training set
round(results_forest.score(X_train, y_train),4)

0.9789349418036264

In [16]:
#Accuracy score testing set
round(results_forest.score(X_test, y_test),4)

0.6247947909173076

In [17]:
#Initializing a pipleline for gridsearching best Random Forest paramaters
pipe = Pipeline(steps = [('model', RandomForestClassifier(random_state=1519))])

#Hyperparameters
hyperparams = {'model__max_depth':np.linspace(2,10,6),
               'model__n_estimators':[5,10],
                'model__min_samples_split':[2,3,4]
                   }
#Initializing GridSearch with 3-fold cross-validation
gs = GridSearchCV(pipe,
                  hyperparams,
                  n_jobs=-1,
                  verbose=1,
                      cv=3)

#Fitting GridSearch and saving results
results = gs.fit(X_train,y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  9.6min finished


In [18]:
#Best gridsearched Random Forest accuracy score on testing set
round(results.best_score_,4)

0.5859

In [19]:
#Best gridsearched Random Forest accuracy score on training set
round(results.score(X_train,y_train),4)

0.5911

In [20]:
#Best gridsearched model's parameters
results.best_params_

{'model__max_depth': 10.0,
 'model__min_samples_split': 3,
 'model__n_estimators': 10}

## AdaBoost Classifier ##

Since our optimized Random Forest model improved a bit, and it is now quite well-fitted, but it's performance is still not satisfactory, let's try a boosting technique.

In [26]:
#Instantiating an AdaBoost Clasifier with Decision Tree Classifier as estimator and n_estimators=10
ada = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=10, min_samples_split=3,n_estimators=10), 
                        random_state=1519)

#Fitting the model
results_ada = ada.fit(X_train, y_train)

#Accuracy score training set
round(results_ada.score(X_train, y_train),4)

0.6825

In [34]:
#Accuracy score testing set
round(results_ada.score(X_test, y_test),4)

0.633

## XGBoost ##

In [32]:
model=xgb.XGBClassifier(max_depth=10,
                        n_estimators=10,
                        n_jobs=-1,
                        random_state=1519,
                       verbosity=1,
                       )
model.fit(X_train, y_train)
model.score(X_train, y_train)

[15:57:36] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1540 extra nodes, 0 pruned nodes, max_depth=10
[15:58:17] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1230 extra nodes, 0 pruned nodes, max_depth=10
[15:58:56] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1434 extra nodes, 0 pruned nodes, max_depth=10
[15:59:34] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1346 extra nodes, 0 pruned nodes, max_depth=10
[16:00:13] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1294 extra nodes, 0 pruned nodes, max_depth=10
[16:01:00] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1186 extra nodes, 0 pruned nodes, max_depth=10
[16:01:38] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1206 extra nodes, 0 pruned nodes, max_depth=10
[16:02:18] INFO: src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 1230 extra nodes, 0 pruned nodes, max_depth=10
[16:02:57] INFO: src/tree/update

0.6413795789037889

In [33]:
model.score(X_test,y_test)

0.6322345808815844

## Feature Importances ##

In [38]:
feat_imp = pd.DataFrame(results_ada.feature_importances_)

In [39]:
feat_imp.columns = ['coef']

In [46]:
feat_imp[['coef']].sort_values('coef', ascending=False ).head(10)

Unnamed: 0,coef
1,0.26309
2,0.155008
3,0.138084
0,0.09159
4,0.029548
704,0.01119
703,0.008547
710,0.006976
708,0.005911
464,0.005724


In [79]:
imp_features = feat_imp[feat_imp['coef']> .001]['coef'].sort_values(ascending=False).index
imp_features

Int64Index([  1,   2,   3,   0,   4, 704, 703, 710, 708, 464, 705, 245, 707,
            653, 444,  94, 373, 593,  95,  24, 115, 544, 443, 195, 651, 305,
            101, 231, 709, 314, 450,  47, 303, 423, 580, 702, 706, 662,  74,
            511, 396, 607, 162, 535, 210, 218,  93, 533, 186, 442, 258, 293,
            184,  59, 259, 208, 606, 641,  90, 472, 557, 525, 151, 509, 408,
            567, 207, 502, 602, 254, 176, 123, 439],
           dtype='int64')

In [80]:
X[X.columns[imp_features]].columns

Index(['day_of_month', 'day_of_week', 'op_carrier_fl_num', 'month', 'distance',
       'carrier_Delta', 'carrier_American', 'carrier_United',
       'carrier_SouthWest', 'dest_EWR', 'carrier_Frontier Airlines',
       'origin_ORD', 'carrier_JetBlue', 'dest_SFO', 'dest_DFW', 'origin_DEN',
       'dest_ATL', 'dest_ORD', 'origin_DFW', 'origin_ATL', 'origin_EWR',
       'dest_LGA', 'dest_DEN', 'origin_LGA', 'dest_SEA', 'origin_SFO',
       'origin_DTW', 'origin_MSP', 'carrier_Spirit Airlines', 'origin_SLC',
       'dest_DTW', 'origin_BOS', 'origin_SEA', 'dest_CLT', 'dest_MSP',
       'carrier_Allegiant Air', 'carrier_Hawaiian Airlines', 'dest_SLC',
       'origin_CLT', 'dest_IAH', 'dest_BOS', 'dest_PHX', 'origin_IAH',
       'dest_LAX', 'origin_MDW', 'origin_MIA', 'origin_DCA', 'dest_LAS',
       'origin_LAX', 'dest_DCA', 'origin_PHL', 'origin_SAN', 'origin_LAS',
       'origin_BWI', 'origin_PHX', 'origin_MCO', 'dest_PHL', 'dest_SAN',
       'origin_DAL', 'dest_FLL', 'dest_MCO', 'dest_JFK'

In [81]:
X.columns

Index(['month', 'day_of_month', 'day_of_week', 'op_carrier_fl_num', 'distance',
       'origin_ABI', 'origin_ABQ', 'origin_ABR', 'origin_ABY', 'origin_ACK',
       ...
       'dest_YUM', 'carrier_Allegiant Air', 'carrier_American',
       'carrier_Delta', 'carrier_Frontier Airlines',
       'carrier_Hawaiian Airlines', 'carrier_JetBlue', 'carrier_SouthWest',
       'carrier_Spirit Airlines', 'carrier_United'],
      dtype='object', length=711)

In [86]:
dict_feat={}
for i in range(len(imp_features)):
    dict_feat.update({X.columns[imp_features[i]]:feat_imp['coef'][i]})

In [96]:
dict_feat

dict_keys(['day_of_month', 'day_of_week', 'op_carrier_fl_num', 'month', 'distance', 'carrier_Delta', 'carrier_American', 'carrier_United', 'carrier_SouthWest', 'dest_EWR', 'carrier_Frontier Airlines', 'origin_ORD', 'carrier_JetBlue', 'dest_SFO', 'dest_DFW', 'origin_DEN', 'dest_ATL', 'dest_ORD', 'origin_DFW', 'origin_ATL', 'origin_EWR', 'dest_LGA', 'dest_DEN', 'origin_LGA', 'dest_SEA', 'origin_SFO', 'origin_DTW', 'origin_MSP', 'carrier_Spirit Airlines', 'origin_SLC', 'dest_DTW', 'origin_BOS', 'origin_SEA', 'dest_CLT', 'dest_MSP', 'carrier_Allegiant Air', 'carrier_Hawaiian Airlines', 'dest_SLC', 'origin_CLT', 'dest_IAH', 'dest_BOS', 'dest_PHX', 'origin_IAH', 'dest_LAX', 'origin_MDW', 'origin_MIA', 'origin_DCA', 'dest_LAS', 'origin_LAX', 'dest_DCA', 'origin_PHL', 'origin_SAN', 'origin_LAS', 'origin_BWI', 'origin_PHX', 'origin_MCO', 'dest_PHL', 'dest_SAN', 'origin_DAL', 'dest_FLL', 'dest_MCO', 'dest_JFK', 'origin_HNL', 'dest_IAD', 'dest_BWI', 'dest_MIA', 'origin_MCI', 'dest_HOU', 'dest_PDX

In [97]:
df = pd.DataFrame.from_dict(data = dict_feat, orient='index', columns=['value'])

In [101]:
df.shape

(73, 1)