# Logistic Regression

We will discuss:

- How MLE fits in when it comes to estimating coefficients
- Log-odds
- Odds ratios
- and assumptions regarding MLE


In [2]:
import pandas as pd
import numpy as np
from main import preprocess_df
from collections import defaultdict as dd


my_df = pd.read_csv(
    "./datasets/lc_data_2007_to_2018.csv",
    low_memory=False,
    encoding="latin1",
    nrows=100000,  # only looking at 100k rows right now for performance
)
pd.set_option("display.max_columns", None)
cleaned_df = preprocess_df(my_df)

Logistic regression is when we calculate a score for how likely something is to happen, via a linear equation. In our case, we want to predict the probability of default from all the factors contained in the dataset. But since we're using a linear equation, this score can range from $-\infin$ to $\infin$, and we could end up with a value that makes no sense probabilistically, like $1.2$. To account for this, we put it through the sigmoid function:

##### $S(z) = \frac{1}{1+e^{-z}}$

This function takes in values from $-\infin$ to $\infin$ and outputs only values between 0 and 1, perfect for probability.


### An Explanation of Why Coefficients are Estimated Using Maximum Likelihood

##### Background

In logistic regression, a function $S(z)$ is created to predict the probability of a binary output from inputs ($x_1, x_2, ...x_n$). $z$ is a linear expression, formed from a linear combination of terms $\beta_nx_n$. Each term is the product of a factor's value ($x_n$) and the weight ($\beta_n$) associated with that factor. 'Coefficients' here refer to the $\beta\text{s}$. The logistic regression system takes all the training data in, and tries to minimise the error between what it predicts vs what the true outcome.

##### Explanation

Betas are estimated with MLE as opposed to Least Squares because of this: the 'true' output value will always be one of yes or no, 1 or 0. Errors are defined as $\epsilon = y-\hat y$. An assumption inherent in Least Squares is that the errors are Gaussian around the prediction. But for any one prediction $\hat y$, the error's two possible values are $-\hat y$ (if $y=0$) or $1-\hat y$ (if $y=1$). Thus the error follows a Bernoulli distribution, not a Gaussian one. Also, Gaussian tails are infinite, so that would imply there is a nonzero probability for the true value to lie outside $[0,1]$. Thus we cannot use Least Squares. The outcomes are Bernoulli, and the way the $\beta\text{s}$ are found is by using MLE to find which $\beta$ values maximise the likelihood of observing what we just saw with the true values.


Before we start though, there are some NaN values, so we need to do imputation and we have different scales of values, so we need to standardise scaling.


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.impute import SimpleImputer

X = cleaned_df.loc[:, cleaned_df.columns != "did_default"]
y = cleaned_df.loc[:, cleaned_df.columns == "did_default"]

y = y.values.ravel()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

imputer = SimpleImputer(strategy="mean")
imputed_X_train = imputer.fit_transform(X_train)
imputed_X_test = imputer.transform(X_test)

scaler = StandardScaler()

scaled_X_train = scaler.fit_transform(imputed_X_train)
scaled_X_test = scaler.transform(imputed_X_test)

Now running the logistic regression itself with our two weights:


In [None]:
weights = [None, "balanced"]
for w in weights:
    log_reg = LogisticRegression(class_weight=w, random_state=42)
    log_reg.fit(scaled_X_train, y_train)
    y_pred = log_reg.predict(scaled_X_test)
    y_pred_proba = log_reg.predict_proba(scaled_X_test)

    accuracy = accuracy_score(y_true=y_test, y_pred=y_pred)
    print(f"Accuracy: {accuracy:.2f}")

    cm = confusion_matrix(y_test, y_pred)
    print("\nConfusion Matrix:")
    print(cm)

    cr = classification_report(y_test, y_pred, target_names=["Paid", "Defaulted"])
    print("Classification Report:")
    print(cr)
    print("-" * 60)

Accuracy: 0.80

Confusion Matrix:
[[13615   413]
 [ 3045   506]]
Classification Report:
              precision    recall  f1-score   support

        Paid       0.82      0.97      0.89     14028
   Defaulted       0.55      0.14      0.23      3551

    accuracy                           0.80     17579
   macro avg       0.68      0.56      0.56     17579
weighted avg       0.76      0.80      0.75     17579

------------------------------------------------------------
Accuracy: 0.68

Confusion Matrix:
[[9765 4263]
 [1302 2249]]
Classification Report:
              precision    recall  f1-score   support

        Paid       0.88      0.70      0.78     14028
   Defaulted       0.35      0.63      0.45      3551

    accuracy                           0.68     17579
   macro avg       0.61      0.66      0.61     17579
weighted avg       0.77      0.68      0.71     17579

------------------------------------------------------------


### Background on Class Weighting

In the test set with seed 42, there are 14028 non-defaulting customers and 3551 defaulting customers. Because the defaulting customers make up a smaller proportion of the population, the model weights them as unimportant and prefers to make sure it gets all the paid customers right. But in reality, it incurs much greater losses to grant a loan to someone who won't pay it back than it is to miss out on interest by refusing a loan to someone who would've paid it back. So we need the model to weight the defaulted cases higher than it currently is.

### Results Without Class Weighting

```
Accuracy: 0.80

Confusion Matrix:
  13615   413
   3045   506

Classification Report:
              precision    recall  f1-score   support
        Paid       0.82      0.97      0.89     14028
   Defaulted       0.55      0.14      0.23      3551
    accuracy                           0.80     17579
   macro avg       0.68      0.56      0.56     17579
weighted avg       0.76      0.80      0.75     17579
```

As can be seen here without class weighting, accuracy is 0.80 (high) but recall on defaulted loans is 0.14 (very low). That means that overall the model is 80% accurate with its predictions of whether a loan is going to be defaulted on or not, but for loans that did default, it correctly predicted 'default' only 14% of the time. That means that of the 3551 loans that did end up defaulting, it predicted full repayment for 3045 of them, which would be disastrous for the bank. Losing the full principal from an approved borrower when she defaults is a much worse outcome than missing out on the interest from rejecting a borrower that would've repaid in full. The reason accuracy was so high here is because the model was predicting 'paid' for 95% of loans, and 80% of loans were indeed paid, so it got a high accuracy score simply because there were more paid loans than defaulted ones.

### Results With Class Weighting

```
Accuracy: 0.68

Confusion Matrix:
  9765 4263
  1302 2249
Classification Report:
              precision    recall  f1-score   support
        Paid       0.88      0.70      0.78     14028
   Defaulted       0.35      0.63      0.45      3551
    accuracy                           0.68     17579
   macro avg       0.61      0.66      0.61     17579
weighted avg       0.77      0.68      0.71     17579
```

When we introduce class weighting, we can see that accuracy drops from 0.80 to 0.68. This is because overall it is more conservative, meaning it predicts default for more loans, including many that actually were repaid. This is a worthy sacrifice though, because the number of defaulted loans it predicted as 'paid' (false negatives) dropped from 3045 to 1302, a huge relative difference. The model went from catching 14% of defaulters to catching 63% of them. The reason for this change is because we changed the way the model weights errors. Instead of making each loan equal weight, which is what caused the model to just guess 'paid' almost every time and get an 0.80 accuracy score but fail to flag a lot of loans that defaulted, the approach used here was to make each population of loans equally weighted. Therefore, since there are fewer defaulted loans, each one individually is weighted higher than an individual paid loan. So the error of predicting 'paid' for a loan that actually defaulted hits harder than predicting 'defaulted' for a loan that was actually repaid, which is the way it should be.  
However, that being said, precision did take a hit with the more conservative approach, dropping from 0.55 to 0.35. This means of all the people predicted to default, the proportion of them who actually did dropped from 0.55 to 0.35. Therefore, when we balance class weights, 65% of people who we predicted to default actually would've repaid. This is a lot of interest to miss out on, but the bank must accept that missing out on some interest is worth the cost of avoiding the loss of entire principals.
