# **Lab 4B: Logistic Regression**

**WHAT** This nonmandatory lab consists of several programming and insight exercises/questions.

**WHY** The exercises are meant to familiarize yourself with the basic concepts of logistic regression.

**HOW** Follow the exercises in this notebook either on your own or with a fellow student. Work your way through these exercises at your own pace and be sure to ask questions to the TA's when you don't understand something.

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1}.}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}.}$

In [None]:
# import packages
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.metrics import precision_recall_curve, auc, average_precision_score

## 1. Loading the Data

In [None]:
train = pd.read_excel('lendingclub_traindata.xlsx')
validation = pd.read_excel('lendingclub_valdata.xlsx')
test = pd.read_excel('lendingclub_testdata.xlsx')

# 1 = good, 0 = default

#give column names
cols = ['home_ownership', 'income', 'dti', 'fico', 'loan_status']

train.columns = validation.columns = test.columns = cols

print(train.head())
print("--------------------------------")
print (validation.head())
print("--------------------------------")
print(test.head())

The data has already been split into training set, validation set, and test set. There are 7000 instances of the training set, 3000 instances of the validation set and 2290 instances of the test set. The four features have been labeled as: home ownership, income, dti and fico.

In [None]:
# remove target column to create feature only dataset
X_train = train.drop('loan_status', axis=1)
X_val = validation.drop('loan_status', axis=1)
X_test = test.drop('loan_status', axis=1)

# Scale data using the mean and standard deviation of the training set. 
# This is not necessary for the simple logistic regression we will do here 
# but should be done if L1 or L2 regrularization is carried out
Xmean = X_train.mean()
Xstd = X_train.std()
X_train = (X_train - Xmean) / Xstd
X_val = (X_val - Xmean) / Xstd
X_test = (X_test - Xmean) / Xstd

# store target column as y-variables 
y_train = train['loan_status']
y_val = validation['loan_status']
y_test = test['loan_status']

#print first five instances for each data set

print(X_train.head())
print("--------------------------------")
print(X_val.head())
print("--------------------------------")
print(X_test.head())

print(X_train.shape, y_train.shape, X_val.shape,y_val.shape, X_test.shape, y_test.shape)
X_train.columns

In [None]:
freq = y_train.value_counts()           # count frequency of different classes in training set
freq/sum(freq)*100                      # get percentage of above

## 2. Using Logistic Regression
We use Sklearn's `LogisticRegression`:

In [None]:
#Create an instance of logisticregression named LogReg 

LogReg =  LogisticRegression(penalty="none", solver="newton-cg")     

# Fit logististic regression to training set
LogReg.fit(X_train, y_train)                                     # fit training data on logistic regression 
print("Parameter estimates from training data:")
print(LogReg.intercept_, *LogReg.coef_)                           # get the coefficients of each features

When used on scaled data the model has a bias of 1.416 and coefficients of 0.145, 0.034, -0.324 and 0.363. We now test the model on the validation set.

The cost function used is the negative of the average log-likelihood. Below, it is computed for the validation and test set as well.

In [None]:
def neg_avg_loglik(y, probs):
    # Preconditions:
    # y: ndarray of 0/1
    # probs: two-column ndarray, rows contain [P(Y=0),P(Y=1)]
    return np.average(-np.log(y * probs[:,1] + (1-y) * probs[:,0]))
    

# prob_train, prob_val, and prob_test are the predicted probabilities for the training,
# validation, and test set, using the fitted logistic regression model

prob_train = LogReg.predict_proba(X_train)
prob_val = LogReg.predict_proba(X_val)
prob_test = LogReg.predict_proba(X_test)

# Calculate the negative of the average loglikelihood for training, validation, and test set

cost_func_train_minimum = neg_avg_loglik(y_train, prob_train)
cost_func_val = neg_avg_loglik(y_val, prob_val)
cost_func_test = neg_avg_loglik(y_test, prob_test)

print('Cost function value overview:')
print(f'minimum reached for training set = {cost_func_train_minimum:8.6}')
print("From this trained model:")
print(f'        value for validation set = {cost_func_val:8.6}')
print(f'              value for test set = {cost_func_test:8.6}')

The differences don't tell us very much. It would be more meaningful to compare the last two numbers to the result we would get if we computed the cost function when the validation resp the test data set are used to fit the model; those are going to be lower than the numbers above (can you see why?).

<div style="background-color:#c2eafa">
    
$\ex{1}$ a) Fit the logistic regression model to the validation data and then compute `neg_avg_loglik(y_val, prob_val_new)`, where `prob_val_new` are the predicted probabilities from the (new) fitted model. Now compute how much this differs from the results above, express this as: 2 times the difference in log-likelihood. b) Then repeat this for the test data set.

The outcomes can be used to perform a formal test about the quality of the (training data) fit. The details are beyond our scope.

In [None]:
# START ANSWER

# computation for validation data
LogReg.fit(X_val,y_val)
print("Parameter estimates from validation data:")
print(LogReg.intercept_, *LogReg.coef_)                           # get the coefficients of each features

# probabilities from model fitted to validation data
prob_val_new = LogReg.predict_proba(X_val)
cost_func_val_minimum = neg_avg_loglik(y_val, prob_val_new)

# calculate the deviance: twice the difference in log-likelihood
deviance_val = 2*(cost_func_val - cost_func_val_minimum)*X_val.shape[0]

# Calculate the negative of the average loglikelihood for training, validation, and test set
print('\nCost function value overview, for validation data set:')
print(f'\tminimum reached for validation set = {cost_func_val_minimum:8.6}')
print(f'\tvalue based on training model      = {cost_func_val:8.6}')
print(f'twice the difference in loglikelihood = {deviance_val:6.2f}')


# computation for test data
LogReg.fit(X_test,y_test)
print("\n\nParameter estimates from test data:")
print(LogReg.intercept_, *LogReg.coef_)                           # get the coefficients of each features

# probabilities from model fitted to testidation data
prob_test_new = LogReg.predict_proba(X_test)
cost_func_test_minimum = neg_avg_loglik(y_test, prob_test_new)

# calculate the deviance: twice the difference in log-likelihood
deviance_test = 2*(cost_func_test - cost_func_test_minimum)*X_test.shape[0]

# Calculate the negative of the average loglikelihood for training, testidation, and test set
print('\nCost function value overview, for test data set:')
print(f'\tminimum reached for test set       = {cost_func_test_minimum:8.6}')
print(f'\tvalue based on training model      = {cost_func_test:8.6}')
print(f'twice the difference in loglikelihood = {deviance_test:6.2f}')
# END ANSWER

<div style="background-color:#ffa500">

Some comments (visible in answers-version only):

[//]: # (START ANSWER)
**This is optional: YOU DO NOT NEED TO KNOW/UNDERSTAND THIS.**

For each data set, the cost function value is lowest, when computed with the parameter from the fit to the data set itself. If we compare this to the cost function value computed from another model, the difference tells us about how well that model fits. Looking above, naturally the values based on the training model are higher (i.e., worse) that those of the best fitting model. The deviance values are to be compared to a $\chi^2(5)$-distribution, resulting in $p$-values of about 25 and 5 percent, indicating that the model fits and generalizes well enough.

[//]: # (END ANSWER)

## 3. Model Evaluation

An analyst must decide on a criterion for predicting whether loan will be good or default. This involves specifying a threshold By default this threshold is set to 0.5, i.e., loans are separated into good and bad categories according to whether the probability of no default is greater or less than 0.5. However this does not work well for an imbalanced data set such as this. It would predict that all loans are good! We will look at the results for few other thresholds. 



<div style="background-color:#c2eafa">
$\ex{2}$ Complete the code below to get the predicted labels for each probability threshold


In [None]:
THRESHOLD = [0.7, .75, .80, .85]
# Create dataframe to store results
results = pd.DataFrame(columns=["THRESHOLD", "accuracy", "true pos rate", "true neg rate", "false pos rate", "precision", "f-score"]) # df to store results

# Create threshold row
results['THRESHOLD'] = THRESHOLD                                                                         
             
j = 0                                                                                                      

# Iterate over the 3 thresholds

for i in THRESHOLD:                                                                                       
    
    # If prob for test set > threshold predict 1
    # START ANSWER
    preds = np.where(LogReg.predict_proba(X_test)[:,1] > i, 1, 0)     
    # END ANSWER                                 
    
    
    # create confusion matrix 
    cm = (confusion_matrix(y_test, preds,labels=[1, 0], sample_weight=None) / len(y_test))*100    # confusion matrix (in percentage)
    
    print('Confusion matrix for threshold =',i)
    print(cm)
    print(' ')      
    
    TP = cm[0][0]                                                                                          # True Positives
    FN = cm[0][1]                                                                                          # False Positives
    FP = cm[1][0]                                                                                          # True Negatives
    TN = cm[1][1]                                                                                          # False Negatives
        
    results.iloc[j,1] = accuracy_score(y_test, preds) 
    results.iloc[j,2] = recall_score(y_test, preds)
    results.iloc[j,3] = TN/(FP+TN)                                                                         # True negative rate
    results.iloc[j,4] = FP/(FP+TN)                                                                         # False positive rate
    results.iloc[j,5] = precision_score(y_test, preds)
    results.iloc[j,6] = f1_score(y_test, preds)
   
   
    j += 1

print('ALL METRICS')
results.T

This table illustrates the trade-off that there is between the true positive rate and the false positive rate: we can improve the percentage of good loans we identify only by increasing the percentage of bad loans that are misclassified. The receiver operating curve (ROC) captures this trade-off by considering different thresholds.

<div style="background-color:#c2eafa">
$\ex{3}$ Plot the ROC curve for the model and also plot the baseline that can be obtained by random predictions.

In [None]:
# Calculate the receiver operating curve and the AUC measure
LogReg.fit(X_train, y_train)  # make sure we have the correct trained model

lr_prob=LogReg.predict_proba(X_test)
lr_prob=lr_prob[:, 1]
baseline_prob=[0 for _ in range(len(y_test))]
baseline_auc=roc_auc_score(y_test, baseline_prob)
lr_auc=roc_auc_score(y_test,lr_prob)
print("AUC random predictions =", baseline_auc)
print("AUC predictions from logistic regression model =", lr_auc)
baseline_fpr,baseline_tpr,_=roc_curve(y_test,baseline_prob)
lr_fpr,lr_tpr,_=roc_curve(y_test,lr_prob)

# START ANSWER
plt.plot(baseline_fpr,baseline_tpr,linestyle='--',label='Random Predction')
plt.plot(lr_fpr,lr_tpr,marker='.',label='Logistic Regression')

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()
# END ANSWER

## 4. Can we tell how well the model generalizes?
There is a difficulty: how could we detect overfitting? The "difference of log-likelihoods test" mentioned above was an idea outside our scope. We could try this:

<div style="background-color:#c2eafa">
$\ex{4}$ Produce and plot another ROC curve that would give you an idea about how well the fitted model generalizes. Add it to the plot above or make a new one combining the two below. Use a different color for the 2nd ROC.

In [None]:
# START ANSWER
LogReg.fit(X_train, y_train)
lr_prob=LogReg.predict_proba(X_test)
lr_prob=lr_prob[:, 1]
baseline_prob=[0 for _ in range(len(y_test))]
baseline_auc=roc_auc_score(y_test, baseline_prob)
lr_auc=roc_auc_score(y_test,lr_prob)
print("AUC random predictions =", baseline_auc)
print("AUC predictions from logistic regression model fitted with train set      =", lr_auc)
baseline_fpr,baseline_tpr,_=roc_curve(y_test,baseline_prob)
lr_fpr,lr_tpr,_=roc_curve(y_test,lr_prob)


plt.plot(baseline_fpr,baseline_tpr,linestyle='--',label='Random Predction')
plt.plot(lr_fpr,lr_tpr,marker='.',label='LR trained on train set', color="r", alpha=0.5)

LogReg.fit(X_val, y_val)
lr_prob=LogReg.predict_proba(X_test)
lr_prob=lr_prob[:, 1]
baseline_prob=[0 for _ in range(len(y_test))]
lr_auc=roc_auc_score(y_test,lr_prob)
print("AUC predictions from logistic regression model fitted with validation set =", lr_auc)
baseline_fpr,baseline_tpr,_=roc_curve(y_test,baseline_prob)
lr_fpr,lr_tpr,_=roc_curve(y_test,lr_prob)

plt.plot(lr_fpr,lr_tpr,marker='.',label='LR trained on val set', linestyle='--', color ="g", alpha=0.3)

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

# END ANSWER

<div style="background-color:#c2eafa">
$\q{1}$ Explain your choice of ROC curve. State any conclusion you think you can draw from the comparison.

<div style="background-color:#ffa500">
    
Write your answer in this colored box:

[//]: # (START ANSWER)
One possibility is to fit a logistic regression model to the validation data and then use the test data for the ROC. If the newly fitted model differs much from the original one, this might show up in the ROC curve. The only thing we might have to worry about is that the training data set is more than twice as big as the validation data set, so possibly the model fit to the validation data is not as good.

We only see small differences in the ROCs and the AUC differs $0.00015$, so a miniscule amount.

[//]: # (END ANSWER)