# Table of Content #

- [Importing Necessary Libraries](#Importing-Necessary-Libraries)
- [Importing Data and Initial Checks](#Importing-Data-and-Initial-Checks)
- [Target Variable and Features Matrix](#Target-Variable-and-Features-Matrix)
- [Fitting Multivariate Logistic Regression](#Fitting-Multivariate-Logistic-Regression)

## Importing Necessary Libraries ##

In [1]:
import pandas as pd
import numpy as np

from sklearn.pipeline                import Pipeline
from sklearn.model_selection         import train_test_split, GridSearchCV
from sklearn.linear_model            import LogisticRegression

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Importing Data and Initial Checks ##

In [2]:
#Loading data from a csv file
data = pd.read_csv('~/ga/projects/capstone_data/data/data_ready.csv')

#Checking size
data.shape

(644232, 12)

In [3]:
#Checking columns
data.columns

Index(['Unnamed: 0', 'month', 'day_of_month', 'day_of_week',
       'op_carrier_fl_num', 'origin', 'dest', 'arr_delay', 'delay_indicator',
       'distance', 'carrier', 'dep_hour'],
      dtype='object')

In [4]:
#Dropping a technical column
data.drop(columns = ['Unnamed: 0'], axis=1, inplace=True)

#Checking DataFrame
data.head()

Unnamed: 0,month,day_of_month,day_of_week,op_carrier_fl_num,origin,dest,arr_delay,delay_indicator,distance,carrier,dep_hour
0,10,3,3,5228,ONT,SFO,-12.0,0.0,363.0,Delta,11
1,11,7,3,1443,BNA,DAL,-7.0,0.0,623.0,SouthWest,15
2,12,14,5,4072,LGA,CLE,-12.0,0.0,419.0,United,15
3,12,9,7,331,JFK,LAX,-17.0,0.0,2475.0,American,11
4,12,17,1,3539,SLC,GEG,-19.0,0.0,546.0,Delta,15


## Target Variable and Features Matrix ##

In order to fit a logistic regression we need to use our **DELAY_INDICATOR** as our target variable. We also need to drop ARR_DELAY from our features as our target variable was efficiently engineered from it.

In [5]:
#Target variable
y = data['delay_indicator']

#Features matrix
X = data.drop(columns=['delay_indicator','arr_delay'])

In [6]:
#Baseline model accuracy
y.mean()

0.5

In [7]:
#Checking our feature matrix data types
X.dtypes

month                  int64
day_of_month           int64
day_of_week            int64
op_carrier_fl_num      int64
origin                object
dest                  object
distance             float64
carrier               object
dep_hour               int64
dtype: object

In [8]:
#Getting dummies for our text features ORIGIN, DEST and CARRIER
X = pd.get_dummies(X,columns = ['origin','dest','carrier'],drop_first=True)

#Checking the shape of our feature matrix
X.shape

(644232, 713)

In [9]:
#Training and testing sets split with random_state=1519 for reproduceability of results 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1519)

## Fitting Multivariate Logistic Regression ##

In [10]:
#Initializing Logistic Regression
log_reg=LogisticRegression(random_state=1519)

#Fitting Multivariate Logistic Regression
log_reg.fit(X_train,y_train)

#Evaluating accuracy on the training set
print(f' Training test accuracy score is {round(log_reg.score(X_train,y_train),4)}')

 Training test accuracy score is 0.5938


In [11]:
#Evaluating accuracy on the testing set
print(f' Testing test accuracy score is {round(log_reg.score(X_test,y_test),4)}')

 Testing test accuracy score is 0.5925


As we can see, our model is quite balanced, and performs with nearly equal accuracy on training and testing sets, but it's performance is only slightly higher than our baseline model's accuracy of 0.5. 
Let's try gridsearching model's parameters for regulariztion strength (C) and penalty ('l1' for Lasso penalty and 'l2' for Ridge penalty) in order to improve model's performance.

In [12]:
#Initializing a pipleline for gridsearching best Logistic Regression paramaters
pipe = Pipeline(steps = [('model', LogisticRegression())])

#Hyperparameters
hyperparams = {'model__C':np.linspace(.1,1,10),
                'model__penalty':['l1', 'l2']
                   }
#Initializing GridSearch with 3-fold cross-validation
gs = GridSearchCV(pipe,
                  hyperparams,
                  n_jobs=-1,
                  verbose=2,
                      cv=3)

#Fitting GridSearch and saving results
results = gs.fit(X_train,y_train)

Fitting 3 folds for each of 20 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 18.4min
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed: 35.8min finished


In [13]:
#Best gridsearched Logistic Regression accuracy score on testing set
round(results.best_score_,4)

0.5923

In [14]:
#Best gridsearched Logistic Regression accuracy score on training set
round(results.score(X_train,y_train),4)

0.5949

In [15]:
#Best gridsearched model's parameters
results.best_params_

{'model__C': 0.5, 'model__penalty': 'l1'}

So far, Logistic Regression Classifier has performed at around .59 accuracy for both train/test sets for the default model, and tuned model hasn't shown any significant improvement (and took a while to wrap up calculations).