# Binary Classification: Simple Logistic Regression with Statsmodels

Predict whether a bank client is likely to default on their loan or not.

Note that interest rate indicates the 3-month interest rate between banks and duration indicates the time since the last contact was made with a given consumer. The previous variable shows whether the last marketing campaign was successful with this customer. The march and may are Boolean variables that account for when the call was made to the specific customer and credit shows if the customer has enough credit to avoid defaulting.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

  import pandas.util.testing as tm


## Load data

In [2]:
# load bank csv data and store in df
url = "https://raw.githubusercontent.com/lucaskienast/Classification-Models/main/1)%20Binary%20Classification/Bank_data.csv"
df = pd.read_csv(url)
df = df.drop(["Unnamed: 0"], axis=1)
df.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,4.12,0.0,0.0,0.0,0.0,686.0,yes
4,4.856,0.0,1.0,0.0,0.0,157.0,no


## Map target: yes/no -> 1/0

In [3]:
# change yes/no in column y to 1/0
df["y"] = np.where(df["y"]=="yes", 1, 0)
df.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,0
1,0.767,0.0,0.0,2.0,1.0,274.0,1
2,4.858,0.0,1.0,0.0,0.0,167.0,0
3,4.12,0.0,0.0,0.0,0.0,686.0,1
4,4.856,0.0,1.0,0.0,0.0,157.0,0


## Explore data

In [4]:
# show summary descriptive statistics
df.describe(include="all")

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,2.835776,0.034749,0.266409,0.388031,0.127413,382.177606,0.5
std,1.876903,0.183321,0.442508,0.814527,0.333758,344.29599,0.500483
min,0.635,0.0,0.0,0.0,0.0,9.0,0.0
25%,1.04275,0.0,0.0,0.0,0.0,155.0,0.0
50%,1.466,0.0,0.0,0.0,0.0,266.5,0.5
75%,4.9565,0.0,1.0,0.0,0.0,482.75,1.0
max,4.97,1.0,1.0,5.0,1.0,2653.0,1.0


## Declare features and targets

In [5]:
# create feature (X) and target (y) variables
y = df["y"]
X1 = df["duration"]

## Create Simple Logistic Regression Model

In [6]:
# build model
X = sm.add_constant(X1) # y-intercept
log_reg = sm.Logit(y, X)
log_results = log_reg.fit()

Optimization terminated successfully.
         Current function value: 0.546118
         Iterations 7


## Interpret Model

In [7]:
# show summary of logit regression model
print(log_results.summary())

                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  518
Model:                          Logit   Df Residuals:                      516
Method:                           MLE   Df Model:                            1
Date:                Fri, 27 Aug 2021   Pseudo R-squ.:                  0.2121
Time:                        18:25:13   Log-Likelihood:                -282.89
converged:                       True   LL-Null:                       -359.05
Covariance Type:            nonrobust   LLR p-value:                 5.387e-35
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.7001      0.192     -8.863      0.000      -2.076      -1.324
duration       0.0051      0.001      9.159      0.000       0.004       0.006


The dependent variable is 'duration'. The model used is a Logit regression (logistic in common lingo), while the method - Maximum Likelihood Estimation (MLE). It has clearly converged after classifyin 518 observations.

The Pseudo R-squared is 0.21 which is within the 'acceptable region'.

The duration variable is significant and its coefficient is 0.0051.

The constant is also significant and equals: -1.70

The logit model goes as follows:

$ log( \frac{P(x)}{1-P(x)} ) = -1.7 + 0.0051 * duration $

$ log( odds(x) ) = -1.7 + 0.0051 * t $

Hence,

$ log( odds(x_1) ) = -1.7 + 0.0051 * t_1 $

$ log( odds(x_2) ) = -1.7 + 0.0051 * t_2 $

That gives:


$ log( \frac{odds(x_1)}{odds(x_2)} ) = 0.0051 ( t_1 - t_2 ) $

Taking the exponential:

$ \frac{odds(x_1)}{odds(x_2)} = exp^{0.0051 ( t_1 - t_2 )} $

Assume $t_1 - t_2 = 100 days$:

$ \frac{odds(x_1)}{odds(x_2)} = exp^{0.0051*100} $

$ \frac{odds(x_1)}{odds(x_2)} = 1.7 $

Therefore,

$ odds(x_1) = 1.7 * odds(x_2) $

So given person x1 has 100 more days to pay back their loan than x2, x1 is 1.7 times as likely to not default on it than x2.