# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)
- [Decision-Tree-Classifier](#Decision-Tree-Classifier)
- [Bagging-Classifier](#Bagging-Classifier)
- [Random Forest Classifier](#Random-Forest-Classifier)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import xgboost as xgb

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(644232, 11)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier
0,10,3,3,4195,LEX,ORD,-15.0,0.0,323.0,American
1,11,5,1,6002,DFW,CHA,-7.0,0.0,695.0,American
2,11,4,7,1937,ORD,AUS,-24.0,0.0,977.0,United
3,10,13,6,948,SFO,DEN,-3.0,0.0,967.0,United
4,11,9,5,1026,HOU,ABQ,-8.0,0.0,759.0,SouthWest


## Target Variable and Features Matrix ##

In order to fit a Decision Tree Classifier we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it. 

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

#Baseline model accuracy
y.mean()

0.5

In [6]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dtype: object

In [7]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(644232, 712)

In [8]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

## Decision Tree Classifier ##

A Decision Tree Classifier is the model to be fitted first, the simplest one out of the family and still quite efficient. 

In [9]:
#Instantiating the model
tree = DecisionTreeClassifier(random_state=1519)

#Fitting the model
result_tree_cvec = tree.fit(X_train, y_train)

#Accuracy score training set
round(result_tree_cvec.score(X_train, y_train),4)

1.0

In [10]:
#Accuracy score testing set
round(result_tree_cvec.score(X_test, y_test),4)

0.6065

## Bagging Classifier ##

An attempt to improve Decision Tree Classifier's performance using an ensemble model.

In [None]:
#Instantiating Bagging Classifier
bag = BaggingClassifier(random_state=1519)

#Fitting the model
results_bag = bag.fit(X_train, y_train)

#Accuracy score training set
round(results_bag.score(X_train, y_train),4)

In [None]:
#Accuracy score testing set
round(results_bag.score(X_test, y_test),4)

## Random Forest Classifier ##

An attempt to further improve Decision Tree Classifier's performance.

In [None]:
#Instantiating Random Forest Classifier
forest = RandomForestClassifier(random_state=1519)

#Fitting the model
results_forest = forest.fit(X_train, y_train)

#Accuracy score training set
round(results_forest.score(X_train, y_train),4)

In [None]:
#Accuracy score testing set
round(results_forest.score(X_test, y_test),4)

In [None]:
#Initializing a pipleline for gridsearching best Random Forest paramaters
pipe = Pipeline(steps = [('model', RandomForestClassifier(random_state=1519))])

#Hyperparameters
hyperparams = {'model__max_depth':np.linspace(2,10,6),
               'model__n_estimators':[5,10],
                'model__min_samples_split':[2,3,4]
                   }
#Initializing GridSearch with 3-fold cross-validation
gs = GridSearchCV(pipe,
                  hyperparams,
                  n_jobs=-1,
                  verbose=1,
                      cv=3)

#Fitting GridSearch and saving results
results = gs.fit(X_train,y_train)

In [None]:
#Best gridsearched Random Forest accuracy score on testing set
round(results.best_score_,4)

In [None]:
#Best gridsearched Random Forest accuracy score on training set
round(results.score(X_train,y_train),4)

In [None]:
#Best gridsearched model's parameters
results.best_params_

## AdaBoost Classifier ##

Since our optimized Random Forest model improved a bit, and it is now quite well-fitted, but it's performance is still not satisfactory, let's try a boosting technique.

In [None]:
#Instantiating an AdaBoost Clasifier with Decision Tree Classifier as estimator and n_estimators=10
ada = AdaBoostClassifier(base_estimator=RandomForestClassifier(max_depth=10, min_samples_split=3,n_estimators=10), 
                        random_state=1519)

#Fitting the model
results_ada = ada.fit(X_train, y_train)

#Accuracy score training set
round(results_ada.score(X_train, y_train),4)

In [None]:
#Accuracy score testing set
round(results_ada.score(X_test, y_test),4)

## XGBoost ##

In [None]:
model=xgb.XGBClassifier(max_depth=10,
                        n_estimators=10,
                        n_jobs=-1,
                        random_state=1519,
                       verbosity=1,
                       )
model.fit(X_train, y_train)
model.score(X_train, y_train)

In [None]:
model.score(X_test,y_test)

## Feature Importances ##

In [None]:
feat_imp = pd.DataFrame(results_ada.feature_importances_)

In [None]:
feat_imp.columns = ['coef']

In [None]:
feat_imp[['coef']].sort_values('coef', ascending=False ).head(10)

In [None]:
imp_features = feat_imp[feat_imp['coef']> .001]['coef'].sort_values(ascending=False).index
imp_features

In [None]:
X[X.columns[imp_features]].columns

In [None]:
X.columns

In [None]:
dict_feat={}
for i in range(len(imp_features)):
    dict_feat.update({X.columns[imp_features[i]]:feat_imp['coef'][i]})

In [None]:
dict_feat

In [None]:
df = pd.DataFrame.from_dict(data = dict_feat, orient='index', columns=['value'])

In [None]:
df.shape