# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)
- [Fitting Multivariate Logistic Regression](#Fitting-Multivariate-Logistic-Regression)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np


import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression
from sklearn.ensemble                import BaggingClassifier,RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier

from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC

import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)

## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(644232, 11)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier
0,10,3,3,4195,LEX,ORD,-15.0,0.0,323.0,American
1,11,5,1,6002,DFW,CHA,-7.0,0.0,695.0,American
2,11,4,7,1937,ORD,AUS,-24.0,0.0,977.0,United
3,10,13,6,948,SFO,DEN,-3.0,0.0,967.0,United
4,11,9,5,1026,HOU,ABQ,-8.0,0.0,759.0,SouthWest


## Target Variable and Features Matrix ##

In order to fit a logistic regression we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it.

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

In [6]:
#Baseline model accuracy
y.mean()

0.5

In [7]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dtype: object

In [8]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(644232, 712)

In [9]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

## Fitting Multivariate Logistic Regression ##

In [10]:
#Initializing Logistic Regression
log_reg=LogisticRegression(random_state=1519)

#Fitting Multivariate Logistic Regression
log_reg.fit(X_train,y_train)

#Evaluating accuracy on the training set
print(f' Training test accuracy score is {round(log_reg.score(X_train,y_train),4)}')

 Training test accuracy score is 0.5737


In [11]:
#Evaluating accuracy on the testing set
print(f' Testing test accuracy score is {round(log_reg.score(X_test,y_test),4)}')

 Testing test accuracy score is 0.5716


As we can see, our model is quite balanced, and performs with nearly equal accuracy on training and testing sets, but it's performance is only slightly higher than our baseline model's accuracy of 0.5. 
Let's try gridsearching model's parameters for regulariztion strength (C) and penalty ('l1' for Lasso penalty and 'l2' for Ridge penalty) in order to improve model's performance.

In [None]:
#Initializing a pipleline for gridsearching best Logistic Regression paramaters
pipe = Pipeline(steps = [('model', LogisticRegression())])

#Hyperparameters
hyperparams = {'model__C':np.linspace(.1,1,10),
                'model__penalty':['l1', 'l2']
                   }
#Initializing GridSearch with 3-fold cross-validation
gs = GridSearchCV(pipe,
                  hyperparams,
                  n_jobs=-1,
                  verbose=2,
                      cv=3)

#Fitting GridSearch and saving results
results = gs.fit(X_train,y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 17.7min


In [None]:
#Best gridsearched Logistic Regression accuracy score on testing set
round(results.best_score_,4)

In [None]:
#Best gridsearched Logistic Regression accuracy score on training set
round(results.score(X_train,y_train),4)

In [None]:
#Best gridsearched model's parameters
results.best_params_