## 3.2. Logistic Regression

## Content
- [Importing Libraries and Dataset](#Importing-Libraries-and-Dataset)
- [Functions for model presentation](#Functions-for-model-presentation)
- [Training model](#Training-model)

## Importing Libraries and Dataset

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
from mlxtend.preprocessing import DenseTransformer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, make_scorer, recall_score, accuracy_score, precision_score, f1_score, roc_auc_score
import pickle

In [2]:
val = pd.read_csv('../datasets/val.csv')
train = pd.read_csv('../datasets/train.csv')
df = pd.read_csv('../datasets/dataset.csv')

In [3]:
X_val = val.text
y_val = val.target_variable
X_train = train.text
y_train = train.target_variable
X = df['text']
y = df['target_variable']

## Functions for model presentation

In [4]:
# Our scorer based on accuracy_score
scorers = {'precision_score': make_scorer(precision_score),
           'recall_score': make_scorer(recall_score),
           'accuracy_score': make_scorer(accuracy_score),
           'f1_score': make_scorer(f1_score),
           'roc_auc_score': make_scorer(roc_auc_score, needs_threshold=True)
          }

#make a function that prints evaluation metrics score
def evaluation_metrics(model):
    print('Train\'s accuracy_score: {}'.format(round(model.score(X_train, y_train),4)))
    print('Best accuracy score from training: {}'.format(round(model.best_score_,4)))
    print('Validation\'s accuracy score : {}'.format(round(model.score(X_val, y_val),4)))
    print('Difference in accuracy scores between train and val: {}'.format(round(model.best_score_ - model.score(X_val, y_val),4)))
    model_proba = [i[1] for i in model.predict_proba(X_val)]
    print('ROC_AUC score on Validation Set: {}'.format(round(roc_auc_score(y_val, model_proba), 4)))
    
    y_pred = model.predict(X_val)
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    sensitivity = tp/(tp+fn)
    specificity = tn/(tn+fp)
    precision = tp/(tp+fp)
    
    print(f'Model sensitivity is : {sensitivity}')
    print(f'Model specificity is : {specificity}')
    print(f'Model f1 score is : {(2*sensitivity*precision)/(sensitivity+precision)}')
    print('\n\nClassification report :\n', classification_report(y_val, y_pred),'\n')
    print(pd.DataFrame({'Pred Negative' : [tn,fn], 'Pred Positive' : [fp,tp]}, index = ['Actual Negative','Actual Postitive']))


#for final model section:
#make a function that prints all classification metrics, AUC-ROC + TN, FP, FN, TP
def all_metrics(model):
    y_pred = model.predict(X_val)
    tn, fp, fn, tp = confusion_matrix(y_val, y_pred).ravel()
    print("True Negatives: " + str(tn))
    print("False Positives: " + str(fp))
    print("False Negatives: " + str(fn))
    print("True Positives: " + str(tp))
    print()
    print('--------------------------------')
    print()
    print('Accuracy: {}'.format(round(accuracy_score(y_val, y_pred), 4)))
    print('Misclassification rate: {}'.format(round((fp+fn)/(tp+fp+tn+fn),4)))
    print('Precision: {}'.format(round(precision_score(y_val, y_pred), 4)))
    print('Recall: {}'.format(round(recall_score(y_val, y_pred), 4)))
    print('Specificity: {}'.format(round(tn/(tn+fp),4)))
    print(f'Model f1 score is : {(f1_score(y_val, y_pred))}')
    #get roc auc score
    model_proba = [i[1] for i in model.predict_proba(X_val)]
    print('ROC_AUC score on Validation Set: {}'.format(round(roc_auc_score(y_val, model_proba), 4)))

In [5]:
#Set stratified k-fold for cross validation.
#we will use stratified k-fold since it is more suitable for binary classification.

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

## Training model

**Count Vectorizer**

In [6]:
log_pipe_cvec = Pipeline([('cvec',CountVectorizer()),('log',LogisticRegression())])

In [7]:
print(f'Accuracy : {np.mean(cross_val_score(log_pipe_cvec, X_train, y_train, cv = skf, n_jobs = -1))}')

Accuracy : 0.8271772967961475


**TFIDF Vectorizer**

In [8]:
log_pipe_tvec = Pipeline([('cvec',CountVectorizer()),('tvec',TfidfTransformer()),('log',LogisticRegression())])

In [9]:
print(f'Accuracy : {np.mean(cross_val_score(log_pipe_tvec, X_train, y_train, cv = skf, n_jobs = -1))}')

Accuracy : 0.8295429209045556


**Grid Search**

In [10]:
params = {'cvec__max_features': [5000, 8000, 10000, 16000, 240000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [0.9, 0.95],
    'cvec__ngram_range': [(1,1),(1,2),(2,2)],
    'log__penalty':['l1','l2'],
    'log__solver':['liblinear'],
    'log__C': [1,10]}

log_gs_tvec = GridSearchCV(log_pipe_tvec, param_grid = params, cv=skf, n_jobs=-1, verbose=1, scoring=scorers, refit='accuracy_score')
log_gs_tvec.fit(X_train, y_train)
log_gs_tvec.best_params_

Fitting 5 folds for each of 240 candidates, totalling 1200 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:  8.5min
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed: 26.6min
[Parallel(n_jobs=-1)]: Done 792 tasks      | elapsed: 53.6min
[Parallel(n_jobs=-1)]: Done 1200 out of 1200 | elapsed: 88.0min finished


{'cvec__max_df': 0.9,
 'cvec__max_features': 240000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'log__C': 10,
 'log__penalty': 'l2',
 'log__solver': 'liblinear'}

**Model Performance**

In [11]:
evaluation_metrics(log_gs_tvec)

Train's accuracy_score: 0.9882
Best accuracy score from training: 0.8337
Validation's accuracy score : 0.8349
Difference in accuracy scores between train and val: -0.0011
ROC_AUC score on Validation Set: 0.8978
Model sensitivity is : 0.9018026565464896
Model specificity is : 0.7152542372881356
Model f1 score is : 0.8750287686996547


Classification report :
               precision    recall  f1-score   support

           0       0.80      0.72      0.76      1180
           1       0.85      0.90      0.88      2108

    accuracy                           0.83      3288
   macro avg       0.83      0.81      0.82      3288
weighted avg       0.83      0.83      0.83      3288
 

                  Pred Negative  Pred Positive
Actual Negative             844            336
Actual Postitive            207           1901


**Saving model**

In [15]:
logreg = Pipeline([('cvec',CountVectorizer(max_df = 0.9,
                                           max_features = 240000,
                                           min_df = 2,
                                           ngram_range = (1, 2))),
                   ('tvec',TfidfTransformer()),
                   ('log',LogisticRegression(C = 10,
                                             penalty = 'l2',
                                             solver = 'liblinear'))])
logreg.fit(X,y)

Pipeline(steps=[('cvec',
                 CountVectorizer(max_df=0.9, max_features=240000, min_df=2,
                                 ngram_range=(1, 2))),
                ('tvec', TfidfTransformer()),
                ('log', LogisticRegression(C=10, solver='liblinear'))])

In [16]:
pickle.dump(logreg,open('../saved_models/logreg.sav','wb'))