## Logistic Regression

We will predict the CD account of an universal dataset used last week.

## 1. Setup

In [30]:
# Common imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

np.random.seed(1)

# 2. Load the data

We will use the universal data that we cleaned in last class (the original, not the one that you altered for last weeks exercise).

In [31]:
X_train = pd.read_csv("./data/universal_train_X.csv")
X_test = pd.read_csv("./data/universal_test_X.csv")
y_train = pd.read_csv("./data/universal_train_y.csv")
y_test = pd.read_csv("./data/universal_test_y.csv")

## 3. Model the data

First, we will create a dataframe to hold all the results of our models.

In [32]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### 3.1 Fit and test a Logistic Regression model

In [33]:
log_reg_model = LogisticRegression(penalty='none', max_iter=900)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [34]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

  'Precision': [TP/(TP+FP)],


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0


### 3.2 Change to liblinear solver

In [35]:
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(X_train, np.ravel(y_train))

In [36]:
model_preds = log_reg_liblin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

  'Precision': [TP/(TP+FP)],


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0


### 3.3 L2 Regularization

In [37]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=1000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [38]:
model_preds = log_reg_L2_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

  'Precision': [TP/(TP+FP)],


Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0
0,L2 logistic,0.940909,,0.0,0.0


### 3.4 L1 Regularization

In [39]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [40]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0
0,L2 logistic,0.940909,,0.0,0.0
0,L1 logistic,0.978485,1.0,0.635897,0.777429


3.5 Elastic Net Regularization

In [41]:
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=1000)
_ = log_reg_elastic_model.fit(X_train, np.ravel(y_train))



In [42]:
model_preds = log_reg_elastic_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elestic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

  'Precision': [TP/(TP+FP)],


## 5.0 Summary

Sorted by accuracy, the best models are:

In [43]:
performance.sort_values(by=['Accuracy'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0
0,L2 logistic,0.940909,,0.0,0.0
0,Elestic logistic,0.940909,,0.0,0.0
0,L1 logistic,0.978485,1.0,0.635897,0.777429


Sorted by Precision, the best models are:

In [44]:
performance.sort_values(by=['Precision'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,L1 logistic,0.978485,1.0,0.635897,0.777429
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0
0,L2 logistic,0.940909,,0.0,0.0
0,Elestic logistic,0.940909,,0.0,0.0


Sorted by Recall, the best models are:

In [45]:
performance.sort_values(by=['Recall'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0
0,L2 logistic,0.940909,,0.0,0.0
0,Elestic logistic,0.940909,,0.0,0.0
0,L1 logistic,0.978485,1.0,0.635897,0.777429


Sorted by F1, the best models are:

In [46]:
performance.sort_values(by=['F1'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.940909,,0.0,0.0
0,liblinear logistic,0.940909,,0.0,0.0
0,L2 logistic,0.940909,,0.0,0.0
0,Elestic logistic,0.940909,,0.0,0.0
0,L1 logistic,0.978485,1.0,0.635897,0.777429


### So which model is the 'best' and the one you wish to choose?

This is very much depending on the profit or loss associated with FP, FN, TP and TN. We will discuss this in the next class.