# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)
- [Decision-Tree-Classifier](#Decision-Tree-Classifier)
- [Bagging-Classifier](#Bagging-Classifier)
- [Random Forest Classifier](#Random-Forest-Classifier)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import xgboost as xgb

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/~data_ready.csv')

#Checking size
data.shape

(644232, 12)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier', 'dep_hour'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier,dep_hour
0,10,3,3,5228,ONT,SFO,-12.0,0.0,363.0,Delta,11
1,11,7,3,1443,BNA,DAL,-7.0,0.0,623.0,SouthWest,15
2,12,14,5,4072,LGA,CLE,-12.0,0.0,419.0,United,15
3,12,9,7,331,JFK,LAX,-17.0,0.0,2475.0,American,11
4,12,17,1,3539,SLC,GEG,-19.0,0.0,546.0,Delta,15


## Target Variable and Features Matrix ##

In order to fit a Decision Tree Classifier we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it. 

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

#Baseline model accuracy
y.mean()

0.5

In [6]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dep_hour               int64
dtype: object

In [7]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(644232, 713)

In [8]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

## Decision Tree Classifier ##

A Decision Tree Classifier is the model to be fitted first, the simplest one out of the family and still quite efficient. 

In [9]:
#Instantiating the model
tree = DecisionTreeClassifier(random_state=1519)

#Fitting the model
result_tree_cvec = tree.fit(X_train, y_train)

#Accuracy score training set
round(result_tree_cvec.score(X_train, y_train),4)

1.0

In [10]:
#Accuracy score testing set
round(result_tree_cvec.score(X_test, y_test),4)

0.6214

## Bagging Classifier ##

An attempt to improve Decision Tree Classifier's performance using an ensemble model.

In [11]:
#Instantiating Bagging Classifier
bag = BaggingClassifier(random_state=1519)

#Fitting the model
results_bag = bag.fit(X_train, y_train)

#Accuracy score training set
round(results_bag.score(X_train, y_train),4)

0.9796

In [12]:
#Accuracy score testing set
round(results_bag.score(X_test, y_test),4)

0.6579

## Random Forest Classifier ##

An attempt to further improve Decision Tree Classifier's performance.

In [13]:
#Instantiating Random Forest Classifier
forest = RandomForestClassifier(random_state=1519)

#Fitting the model
results_forest = forest.fit(X_train, y_train)

#Accuracy score training set
round(results_forest.score(X_train, y_train),4)

0.9831

In [14]:
#Accuracy score testing set
round(results_forest.score(X_test, y_test),4)

0.6371

In [15]:
#Initializing a pipleline for gridsearching best Random Forest paramaters
pipe = Pipeline(steps = [('model', RandomForestClassifier(random_state=1519))])

#Hyperparameters
hyperparams = {'model__max_depth':np.linspace(2,10,6),
               'model__n_estimators':[5,10],
                'model__min_samples_split':[2,3,4]
                   }
#Initializing GridSearch with 3-fold cross-validation
gs = GridSearchCV(pipe,
                  hyperparams,
                  n_jobs=-1,
                  verbose=1,
                      cv=3)

#Fitting GridSearch and saving results
results = gs.fit(X_train,y_train)

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  3.5min
[Parallel(n_jobs=-1)]: Done 108 out of 108 | elapsed:  8.8min finished


In [16]:
#Best gridsearched Random Forest accuracy score on testing set
round(results.best_score_,4)

0.5939

In [17]:
#Best gridsearched Random Forest accuracy score on training set
round(results.score(X_train,y_train),4)

0.5987

In [18]:
#Best gridsearched model's parameters
results.best_params_

{'model__max_depth': 10.0,
 'model__min_samples_split': 2,
 'model__n_estimators': 10}

## AdaBoost Classifier ##

Since our optimized Random Forest model improved a bit, and it is now quite well-fitted, but it's performance is still not satisfactory, let's try a boosting technique.

In [19]:
#Instantiating an AdaBoost Clasifier with Decision Tree Classifier as estimator and n_estimators=10
ada = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=10, min_samples_split=3,n_estimators=10), 
                        random_state=1519)

#Fitting the model
results_ada = ada.fit(X_train, y_train)

#Accuracy score training set
round(results_ada.score(X_train, y_train),4)

0.6954

In [20]:
#Accuracy score testing set
round(results_ada.score(X_test, y_test),4)

0.647

## XGBoost ##

As an another attempt to improve my classifier's performance I would like to apply Extreme Gradient Booster technique.

In [21]:
#Instantiating an XGBoost Clasifier with Decision Tree Classifier as estimator and n_estimators=10
model=xgb.XGBClassifier(max_depth=10,
                        n_estimators=10,
                        n_jobs=-1,
                        random_state=1519,
                       verbosity=1,
                       )
#Fitting the model
model.fit(X_train, y_train)
#Model's accuracy on training test
model.score(X_train, y_train)

0.6513740391660148

In [33]:
#Model's accuracy on testing set
model.score(X_test,y_test)

0.6415204460504911

In [35]:
#Initializing a pipleline for gridsearching best XGBoost paramaters
pipe = Pipeline(steps = [('model', xgb.XGBClassifier(random_state=1519))])

#Hyperparameters
hyperparams = {'model__max_depth':[2,4,6,8,10],
               'model__n_estimators':[5,10],
                'model__min_samples_split':[2,3,4]
                   }
#Initializing GridSearch with 3-fold cross-validation
gs = GridSearchCV(pipe,
                  hyperparams,
                  n_jobs=-1,
                  verbose=1,
                      cv=3)

#Fitting GridSearch and saving results
results_xg = gs.fit(X_train,y_train)

Fitting 3 folds for each of 30 candidates, totalling 90 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.


KeyboardInterrupt: 

In [None]:
#Best gridsearched XGBoost accuracy score on testing set
round(results_xg.best_score_,4)

In [None]:
#Best gridsearched Random Forest accuracy score on training set
round(results_xg.score(X_train,y_train),4)

In [None]:
#Best gridsearched model's parameters
results_xg.best_params_

## Feature Importances ##

**This part is done on Boom Devahastin's advice in an attempt to improve existing models performance**

One important feature of the Decision Trees models is the possibility to have a closer look at feature importance coefficients. Let's take a close look at one of our best performing models - AdaBoost Classifier and try to understand what model's features are the most influential over the flight delay.

In [39]:
#Saving corresponding coefficients to a DataFrame
feat_imp = pd.DataFrame(results_ada.feature_importances_)
feat_imp.columns = ['coef']

#Having a look at the top 10 values
feat_imp[['coef']].sort_values('coef', ascending=False ).head(10)

Unnamed: 0,coef
1,0.22603
2,0.134846
5,0.10954
3,0.09516
0,0.086396
4,0.02837
706,0.011326
710,0.010104
705,0.008049
712,0.005758


As we could notice, the values quickly decrease. Let's sort indices of values of some sinificance (>.0001) into a separate DataFrame for future processing.

In [41]:
#Saving corresponding indices to a DataFrame
imp_features = feat_imp[feat_imp['coef']> .001]['coef'].sort_values(ascending=False).index

In [27]:
# X[X.columns[imp_features]].columns

# X.columns

Index(['day_of_month', 'day_of_week', 'dep_hour', 'op_carrier_fl_num', 'month',
       'distance', 'carrier_Delta', 'carrier_SouthWest', 'carrier_American',
       'carrier_United', 'carrier_Frontier Airlines', 'origin_ORD', 'dest_EWR',
       'dest_ATL', 'dest_DFW', 'carrier_JetBlue', 'dest_LGA', 'origin_DFW',
       'origin_LGA', 'origin_EWR', 'dest_SFO', 'origin_DEN', 'dest_ORD',
       'origin_ATL', 'origin_SFO', 'origin_BOS', 'dest_DEN', 'dest_IAH',
       'dest_SEA', 'dest_BOS', 'carrier_Spirit Airlines', 'origin_MSP',
       'dest_CLT', 'origin_SLC', 'origin_SEA', 'dest_SLC', 'origin_IAH',
       'carrier_Hawaiian Airlines', 'dest_LAX', 'dest_DTW', 'origin_DTW',
       'carrier_Allegiant Air', 'dest_MSP', 'origin_CLT', 'dest_LAS',
       'origin_PHL', 'origin_LAS', 'origin_PHX', 'origin_MCO', 'origin_LAX',
       'dest_PHX', 'origin_MDW', 'dest_PHL', 'origin_FLL', 'origin_MIA',
       'dest_FLL', 'dest_DCA', 'dest_BWI', 'origin_DCA', 'dest_MCO',
       'origin_IAD', 'origin_PDX'

In [44]:
#Let's create a dictionary of the original features and their corresponding coefficients
dict_feat={}
for i in range(len(imp_features)):
    dict_feat.update({X.columns[imp_features[i]]:feat_imp['coef'][i]})

#Saving dictionary into a DataFrame for future separate processing for visualization purposes
df = pd.DataFrame.from_dict(data = dict_feat, orient='index', columns=['value'])

In [48]:
#Saving our DataFrame onto a csv file
df.to_csv('~/ga/projects/cstone/AUX/features_importance.csv')