<a href="https://colab.research.google.com/github/ojasnadkar96/cs273p_project/blob/master/logReg1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Logistic Regression Classification

Importing all the necessary libraries,

In [0]:
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
%matplotlib inline

In [0]:
#Function for importing preprocessed data
def import_pkl(df,name):
    fullname = name+'.pkl'
    df = pickle.load(open(fullname, 'rb'))
    return df

In [0]:
#Declaring training, validation and testing dataframes
df_train = pd.DataFrame()
df_valid = pd.DataFrame()
df_test = pd.DataFrame()
df_train_l = pd.DataFrame()
df_valid_l = pd.DataFrame()
df_test_l = pd.DataFrame()

In [0]:
#Loading the preprocessed data into pandas dataframes
df_train = import_pkl(df_train,'train_x')
df_valid = import_pkl(df_valid,'valid_x')
df_test = import_pkl(df_test,'test_x')
df_train_l = import_pkl(df_train_l,'train_x_l')
df_valid_l = import_pkl(df_valid_l,'valid_x_l')
df_test_l = import_pkl(df_test_l,'test_x_l')

In [0]:
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)
print(df_train_l.shape)
print(df_valid_l.shape)
print(df_test_l.shape)

(77854, 168)
(13737, 168)
(10175, 168)
(77854, 1)
(13737, 1)
(10175, 1)


In [0]:
#import warnings filter
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)

In [0]:
#Logistic Regression Model Training
from sklearn.linear_model import LogisticRegression
LogReg_model = LogisticRegression(class_weight='balanced')
LogReg_model.fit(df_train,np.ravel(df_train_l))

LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
#Logistic Regression Training Score
score = LogReg_model.score(df_train,np.ravel(df_train_l))
print('Training error: ', score*100)

Training error:  56.47750918385697


In [0]:
#Logistic Regression Validation Score
score = LogReg_model.score(df_valid,np.ravel(df_valid_l))
print('Validation error: ', score*100)

Validation error:  56.16218970663173


In [0]:
from pprint import pprint
pprint(LogReg_model.get_params())

{'C': 1.0,
 'class_weight': 'balanced',
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'warn',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'warn',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}


Above are all the parameters in the logistic regression model. We head to RandomizedSearchCV to start with hyperparameter tuning.

In [0]:
from sklearn.model_selection import RandomizedSearchCV

penalty = ['l1', 'l2']
C = [1, 10, 100]
max_iter = [10, 20]
tol = [0.01, 0.001]

# Create the random grid
random_grid = {'penalty': penalty,
               'max_iter': max_iter,
               'C': C,
               'tol': tol}
pprint(random_grid)

{'C': [1, 10, 100],
 'max_iter': [10, 20],
 'penalty': ['l1', 'l2'],
 'tol': [0.01, 0.001]}


In [0]:
import numpy as np
from sklearn.model_selection import PredefinedSplit
LogReg_modelgcv = LogisticRegression()
train_len = len(df_train)
valid_len = len(df_valid)
df_tv = pd.concat([df_train, df_valid], ignore_index = True)
df_tv_l = pd.concat([df_train_l, df_valid_l], ignore_index = True)
bound = np.array([(i < train_len) * -1 for i in range(train_len + valid_len)])
split = PredefinedSplit(bound)
logreg_random = RandomizedSearchCV(estimator = LogReg_modelgcv, param_distributions = random_grid, n_iter = 100, n_jobs = -1, verbose = 1, cv = split)

logreg_random.fit(df_tv,np.ravel(df_tv_l))

logreg_random.best_params_

Fitting 1 folds for each of 24 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:  3.8min finished


{'C': 10, 'max_iter': 10, 'penalty': 'l2', 'tol': 0.001}

Out of all the random parameter values, the model has performed best on the parameter values in output above. Now, we train the model on these parameters and check the accuracy.

In [0]:
logreg_new = LogisticRegression(class_weight='balanced',
 C=10,
 max_iter=10,
 penalty='l2',
 tol=0.001)
logreg_new.fit(df_train,np.ravel(df_train_l))
score = logreg_new.score(df_train,np.ravel(df_train_l))
print(score*100)

56.47750918385697


In [0]:
score = logreg_new.score(df_valid,np.ravel(df_valid_l))
print(score*100)

56.15491009681881


As we can see, there is a negligible increase in the accuracy. This could be due to the fact that we used RandomizedSearchCV instead of GridSearchCV. So, now we perform GridSearchCV to increase the accuracy.

In [0]:
from sklearn.model_selection import GridSearchCV


max_iter = [10,20]

C = [10]

penalty = ['l2']

tol = [0.01, 0.001]


search_grid = {'max_iter' : max_iter,
               'C': C,
               'penalty': penalty,
               'tol': tol}

pprint(search_grid)

{'C': [10], 'max_iter': [10, 20], 'penalty': ['l2'], 'tol': [0.01, 0.001]}


In [0]:
# Create a based model
log_reg_3 = LogisticRegression()
# Instantiate the grid search model
logreg_grid = GridSearchCV(estimator = log_reg_3, param_grid = search_grid, n_jobs = 2, verbose = 1, cv = split)

# Fit the random search model
logreg_grid.fit(df_tv, np.ravel(df_tv_l))

logreg_grid.best_params_

Fitting 1 folds for each of 4 candidates, totalling 4 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   4 out of   4 | elapsed:   53.5s finished


{'C': 10, 'max_iter': 10, 'penalty': 'l2', 'tol': 0.001}

Now, we find the final score of the model with hyperparameters tuned as above.

In [0]:
logreg_final = LogisticRegression(C=10, max_iter=10, penalty='l2', tol=0.001)
logreg_final.fit(df_train, np.ravel(df_train_l))
score = logreg_final.score(df_train,np.ravel(df_train_l))
print(score*100)

58.0419760063709


In [0]:
score = logreg_final.score(df_valid,np.ravel(df_valid_l))
print(score*100)

57.95297372060858


In [0]:
score = logreg_final.score(df_test,np.ravel(df_test_l))
print(score*100)

58.37837837837838


As we can see, the accuracy has increased from 56% to 58% after Grid SearchCV.