## Model iterations with Logistic Regression
In this notebook we use Logistic Regression to build our predictive model

Steps:
- Run a vanilla Logistic regression model with cross validation (3 folds) and compare the recall score for class 3 (this will be our baseline LR model)
- Next we run a grid search to identify hyperparameters that would help improve our class 3 recall score
- build a Logistic Regression model with identified best hyper params and compare training, validation and test Recall Scores.

In [1]:
#imports
%load_ext autoreload
%autoreload 2

In [2]:
import os, sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

from src import cleaning_functions as cfs
from src import preprocessing_functions as pfs
from src import modeling_functions as mfs
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import recall_score, f1_score, make_scorer
from sklearn.linear_model import LogisticRegression

In [3]:
df = cfs.cleaned_dataframe()

df = df.drop(['vdcmun_id', 'ward_id'], axis=1)

#create target and feature dataframes 
y = df['target']
X = df.drop('target', axis=1)

#train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 2015, test_size = .2)

#train validation split
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, random_state = 2015, test_size = .25)

#stratified sampling to have equal representation of classes
rus = RandomUnderSampler(random_state=2015)
X_tr_res, y_tr_res = rus.fit_resample(X_tr, y_tr)

#One Hot encode all categorical features land_surface_condition,foundation_type, roof_type, ground_floor_type, other_floor_type
X_tr_res, X_val, X_test = pfs.ohe_train_and_test_features(X_tr_res, X_val, X_test)

#One hot encode district_id
X_tr_res, X_val, X_test = pfs.ohe_train_val_test_geos(X_tr_res, X_val, X_test)

In [4]:
X_train_scaled = X_tr_res.copy()
col_names = ['count_floors_pre_eq', 'count_families', 'age_building', 'plinth_area_sq_ft', 'height_ft_pre_eq']
features = X_train_scaled[col_names]
scaler = StandardScaler().fit(features.values)
features = scaler.transform(features.values)

X_train_scaled[col_names] = features

X_val_scaled = X_val.copy()
X_test_scaled = X_test.copy()

features = X_val_scaled[col_names]
features = scaler.transform(features.values)
X_val_scaled[col_names] = features
features = X_test_scaled[col_names]
features = scaler.transform(features.values)
X_test_scaled[col_names] = features

In [5]:
#import recall scorer
recall3 = mfs.scorer_recall3()

### Vanilla Logistic Regression Model

In [6]:
lr1 = LogisticRegression(random_state=2015)
print('Recall Score for class 3 : ',cross_val_score(lr1, X_train_scaled, y_tr_res, scoring=recall3, cv=3))

Recall Score for class 3 :  [0.63749313 0.64293044 0.64468958]


Low variance observed between the folds

Printing out the report of the model on the validation set

In [7]:
lr1.fit(X_train_scaled, y_tr_res)
print('Classification Report for lr1, Validation Set','\n',
      '======================================================','\n',
      classification_report(y_val, lr1.predict(X_val_scaled)))

Classification Report for lr1, Validation Set 
               precision    recall  f1-score   support

         1.0       0.61      0.67      0.64     33118
         2.0       0.28      0.48      0.35     27409
         3.0       0.85      0.64      0.73     91892

    accuracy                           0.62    152419
   macro avg       0.58      0.60      0.57    152419
weighted avg       0.69      0.62      0.64    152419



In [8]:
print('Classification Report for lr1, Training Data','\n',
      '======================================================','\n',
      classification_report(y_tr_res, lr1.predict(X_train_scaled)))

Classification Report for lr1, Training Data 
               precision    recall  f1-score   support

         1.0       0.67      0.68      0.67     81857
         2.0       0.48      0.47      0.48     81857
         3.0       0.63      0.64      0.64     81857

    accuracy                           0.60    245571
   macro avg       0.60      0.60      0.60    245571
weighted avg       0.60      0.60      0.60    245571



We run a **grid Search** here to identify best hyper paramters for our data

**C** here is the inverse of regularization strength and **penalty** l2 and l1 refer to ridge and lasso regularizers

In [9]:
lr = LogisticRegression(random_state=2015)
params = {'C':[0.1, 1, 5, 10], 'penalty':['l2','l1']}
gs_lr = GridSearchCV(lr, param_grid=params, scoring=recall3, cv=3, verbose=2)

In [10]:
gs_lr.fit(X_train_scaled, y_tr_res)

Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] C=0.1, penalty=l2 ...............................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ................................ C=0.1, penalty=l2, total=   6.2s
[CV] C=0.1, penalty=l2 ...............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    6.2s remaining:    0.0s


[CV] ................................ C=0.1, penalty=l2, total=   6.4s
[CV] C=0.1, penalty=l2 ...............................................
[CV] ................................ C=0.1, penalty=l2, total=   6.3s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.0s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.1s
[CV] C=0.1, penalty=l1 ...............................................
[CV] ................................ C=0.1, penalty=l1, total=   0.0s
[CV] C=1, penalty=l2 .................................................
[CV] .................................. C=1, penalty=l2, total=   6.3s
[CV] C=1, penalty=l2 .................................................
[CV] .................................. C=1, penalty=l2, total=   6.9s
[CV] C=1, penalty=l2 .................................................
[CV] .

[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:  1.4min finished


GridSearchCV(cv=3, estimator=LogisticRegression(random_state=2015),
             param_grid={'C': [0.1, 1, 5, 10], 'penalty': ['l2', 'l1']},
             scoring=make_scorer(recall_score_class3), verbose=2)

In [11]:
gs_lr.best_params_

{'C': 10, 'penalty': 'l2'}

In [12]:
lr2 = LogisticRegression(penalty='l2', C=10, random_state=2015)
lr2.fit(X_train_scaled, y_tr_res)

LogisticRegression(C=10, random_state=2015)

In [13]:
print('Classification Report for lr2, Validation Data','\n',
      '======================================================','\n',
      classification_report(y_val, lr2.predict(X_val_scaled)))

Classification Report for lr2, Validation Data 
               precision    recall  f1-score   support

         1.0       0.61      0.67      0.64     33118
         2.0       0.28      0.47      0.35     27409
         3.0       0.85      0.64      0.73     91892

    accuracy                           0.62    152419
   macro avg       0.58      0.60      0.57    152419
weighted avg       0.69      0.62      0.64    152419



In [14]:
print('Classification Report for lr2, Train Data','\n',
      '======================================================','\n',
      classification_report(y_tr_res, lr2.predict(X_train_scaled)))

Classification Report for lr2, Train Data 
               precision    recall  f1-score   support

         1.0       0.67      0.68      0.67     81857
         2.0       0.48      0.47      0.48     81857
         3.0       0.63      0.64      0.64     81857

    accuracy                           0.60    245571
   macro avg       0.60      0.60      0.60    245571
weighted avg       0.60      0.60      0.60    245571



In [15]:
print('Classification Report for lr2, Test Data','\n',
      '======================================================','\n',
      classification_report(y_test, lr2.predict(X_test_scaled)))

Classification Report for lr2, Test Data 
               precision    recall  f1-score   support

         1.0       0.62      0.68      0.65     33245
         2.0       0.28      0.48      0.35     27146
         3.0       0.85      0.64      0.73     92028

    accuracy                           0.62    152419
   macro avg       0.58      0.60      0.58    152419
weighted avg       0.70      0.62      0.64    152419



Checking to see if further improvements can be made with lowering regularization by increasing C=100

In [16]:
lr3 = LogisticRegression(penalty='l2', C=100, random_state=2015)
lr3.fit(X_train_scaled, y_tr_res)

LogisticRegression(C=100, random_state=2015)

In [17]:
recall_score(y_tr_res, lr3.predict(X_train_scaled), average=None)[2]

0.6425351527664097

In [18]:
recall_score(y_val, lr3.predict(X_val_scaled), average=None)[2]

0.6420689505071171

No changes observed from before, we choose **lr2 as our final logistic regression model**