# Binary Classification: Multivariate Logistic Regression with Statsmodels

Predict whether a bank client is likely to default on their loan or not.

Note that interest rate indicates the 3-month interest rate between banks and duration indicates the time since the last contact was made with a given consumer. The previous variable shows whether the last marketing campaign was successful with this customer. The march and may are Boolean variables that account for when the call was made to the specific customer and credit shows if the customer has enough credit to avoid defaulting.

## Import libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from scipy import stats

  import pandas.util.testing as tm


## Load data

In [2]:
# load bank csv data and store in df
url = "https://raw.githubusercontent.com/lucaskienast/Classification-Models/main/1)%20Binary%20Classification/Bank_data.csv"
df = pd.read_csv(url)
df = df.drop(["Unnamed: 0"], axis=1)
df.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,4.12,0.0,0.0,0.0,0.0,686.0,yes
4,4.856,0.0,1.0,0.0,0.0,157.0,no


## Map target: yes/no -> 1/0

In [3]:
# change yes/no in column y to 1/0
df["y"] = np.where(df["y"]=="yes", 1, 0)
df.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0.0,1.0,0.0,0.0,117.0,0
1,0.767,0.0,0.0,2.0,1.0,274.0,1
2,4.858,0.0,1.0,0.0,0.0,167.0,0
3,4.12,0.0,0.0,0.0,0.0,686.0,1
4,4.856,0.0,1.0,0.0,0.0,157.0,0


## Explore data

In [4]:
# show summary descriptive statistics
df.describe(include="all")

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,2.835776,0.034749,0.266409,0.388031,0.127413,382.177606,0.5
std,1.876903,0.183321,0.442508,0.814527,0.333758,344.29599,0.500483
min,0.635,0.0,0.0,0.0,0.0,9.0,0.0
25%,1.04275,0.0,0.0,0.0,0.0,155.0,0.0
50%,1.466,0.0,0.0,0.0,0.0,266.5,0.5
75%,4.9565,0.0,1.0,0.0,0.0,482.75,1.0
max,4.97,1.0,1.0,5.0,1.0,2653.0,1.0


## Declare features and targets

In [5]:
# create feature (X) and target (y) variables
y = df['y']
X1 = df[['interest_rate','credit','march','previous','duration']]

## Train Multivariate Logistic Regression Model

In [6]:
# build model
X = sm.add_constant(X1) # y-intercept
log_reg = sm.Logit(y, X)
log_results = log_reg.fit()

Optimization terminated successfully.
         Current function value: 0.336664
         Iterations 7


In [7]:
# show summary statistics
print(log_results.summary())

                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:                  518
Model:                          Logit   Df Residuals:                      512
Method:                           MLE   Df Model:                            5
Date:                Fri, 27 Aug 2021   Pseudo R-squ.:                  0.5143
Time:                        19:53:45   Log-Likelihood:                -174.39
converged:                       True   LL-Null:                       -359.05
Covariance Type:            nonrobust   LLR p-value:                 1.211e-77
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
const            -0.0211      0.311     -0.068      0.946      -0.631       0.589
interest_rate    -0.8001      0.089     -8.943      0.000      -0.975      -0.625
credit            2.3585      1.088     

Interpretation:

- given the all same circumstances (feature values) except the credit score (which ranges b/w 0 and 1 here), a client that has a credit score of 1 is 2.36 times as likely to pay back their loan as one with a credit score of 0.
- vice versa for the other features

## Model Accuracy

In [8]:
# make predictions for training data
log_results.predict()

array([0.10845729, 0.94310101, 0.01016712, 0.8107996 , 0.00950534,
       0.53401102, 0.03208113, 0.00360438, 0.71233264, 0.06744818,
       0.9536981 , 0.49838996, 0.89280993, 0.1015164 , 0.07595256,
       0.81903592, 0.87459125, 0.84855627, 0.3728643 , 0.76321316,
       0.97293194, 0.87334213, 0.12982914, 0.09814958, 0.65435403,
       0.08506009, 0.77401405, 0.76293622, 0.90314904, 0.04231132,
       0.02297297, 0.12217763, 0.27155822, 0.71480891, 0.05771755,
       0.01197368, 0.99318432, 0.06187083, 0.99998254, 0.54171261,
       0.05858667, 0.84153321, 0.05871492, 0.00874369, 0.8683381 ,
       0.54679839, 0.00476303, 0.15173607, 0.15689295, 0.8281327 ,
       0.72220836, 0.05318745, 0.03268499, 0.05137519, 0.00752218,
       0.80512363, 0.09452058, 0.80034267, 0.98080827, 0.83781628,
       0.03688478, 0.1128039 , 0.98917666, 0.76387305, 0.16448677,
       0.30797084, 0.04896691, 0.05903493, 0.99644849, 0.06952408,
       0.02351395, 0.71587342, 0.13853176, 0.97026524, 0.87188

These values between 0 and 1 represent the probability of an observation being classified into class 1. Hence, if the probability is greater than or equal to 0.5, it will be rounded to 1, and otherwise to 0.

In [9]:
# show confusion matrix
log_results.pred_table()

array([[218.,  41.],
       [ 30., 229.]])

In [10]:
# confusion matrix formatted as df
cm_df = pd.DataFrame(log_results.pred_table())
cm_df.columns = ["Predicted 0", "Predicted 1"]
cm_df = cm_df.rename(index={0:"Actual 0", 1:"Actual 1"})
cm_df

Unnamed: 0,Predicted 0,Predicted 1
Actual 0,218.0,41.0
Actual 1,30.0,229.0


- For 218 observations, the model predicted 0 when the true value was 0.
- For 30 observations, the model predicted 0 when the true value was 1.
- vice versa for when the model predicted 1

In [11]:
# calculate actual accuracy
cm = np.array(cm_df)
accuracy_train = (cm[0,0] + cm[1,1])/cm.sum()
accuracy_train

0.862934362934363

The model shows 86% accuracy on the training data.

## Test Model

In [12]:
# load test bank csv data and store in df
url = "https://raw.githubusercontent.com/lucaskienast/Classification-Models/main/1)%20Binary%20Classification/Bank_data_testing.csv"
df_test = pd.read_csv(url)
df_test = df_test.drop(["Unnamed: 0"], axis=1)
df_test["y"] = np.where(df_test["y"]=="yes", 1, 0)
df_test.head()

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.313,0.0,1.0,0.0,0.0,487.0,0
1,4.961,0.0,0.0,0.0,0.0,132.0,0
2,4.856,0.0,1.0,0.0,0.0,92.0,0
3,4.12,0.0,0.0,0.0,0.0,1468.0,1
4,4.963,0.0,0.0,0.0,0.0,36.0,0


In [13]:
# declare features and targets of test dataset
y_test = df_test['y']
X1_test = df_test[['interest_rate','credit','march','previous','duration']]
X_test = sm.add_constant(X1_test) # y-intercept

In [14]:
def confusion_matrix(data,actual_values,model):
        # Confusion matrix 
        
        # Parameters
        # ----------
        # data: data frame or array
            # data is a data frame formatted in the same way as your input data (without the actual values)
            # e.g. const, var1, var2, etc. Order is very important!
        # actual_values: data frame or array
            # These are the actual values from the test_data
            # In the case of a logistic regression, it should be a single column with 0s and 1s
            
        # model: a LogitResults object
            # this is the variable where you have the fitted model 
            # e.g. results_log in this course
        # ----------
        
        #Predict the values using the Logit model
        pred_values = model.predict(data)
        # Specify the bins 
        bins=np.array([0,0.5,1])
        # Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
        # if they are between 0.5 and 1, they will be considered 1
        cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
        # Calculate the accuracy
        accuracy = (cm[0,0]+cm[1,1])/cm.sum()
        # Return the confusion matrix and 
        return cm, accuracy

In [15]:
confusion_matrix(X_test, y_test, log_results)

(array([[93., 18.],
        [13., 98.]]), 0.8603603603603603)

The model performs slightly worse on the test data, which is expected.