# Understanding Logistic Regression Tables

Using the same code as in the previous exercise, try to interpret the summary table.

### More information about the dataset: 
Note that <i> interest rate</i> indicates the 3-month interest rate between banks and <i> duration </i> indicates the time since the last contact was made with a given consumer. The <i> previous </i> variable shows whether the last marketing campaign was successful with this customer. The <i>March</i> and <i> May </i> are Boolean variables that account for when the call was made to the specific customer and <i> credit </i> shows if the customer has enough credit to avoid defaulting.

<i> Notes: 
    <li> the first column of the dataset is an index one; </li>
    <li> you don't need the graph for this exercise; </li>
    <li> the dataset used is much bigger </li>
</i>

## Import the relevant libraries

In [19]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Load the data

Load the ‘Bank_data.csv’ dataset.

In [2]:
raw_data = pd.read_csv("Bank_data.csv")
raw_data

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,0,1.334,0.0,1.0,0.0,0.0,117.0,no
1,1,0.767,0.0,0.0,2.0,1.0,274.0,yes
2,2,4.858,0.0,1.0,0.0,0.0,167.0,no
3,3,4.120,0.0,0.0,0.0,0.0,686.0,yes
4,4,4.856,0.0,1.0,0.0,0.0,157.0,no
...,...,...,...,...,...,...,...,...
513,513,1.334,0.0,1.0,0.0,0.0,204.0,no
514,514,0.861,0.0,0.0,2.0,1.0,806.0,yes
515,515,0.879,0.0,0.0,0.0,0.0,290.0,no
516,516,0.877,0.0,0.0,5.0,1.0,473.0,yes


In [4]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518 entries, 0 to 517
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     518 non-null    int64  
 1   interest_rate  518 non-null    float64
 2   credit         518 non-null    float64
 3   march          518 non-null    float64
 4   may            518 non-null    float64
 5   previous       518 non-null    float64
 6   duration       518 non-null    float64
 7   y              518 non-null    object 
dtypes: float64(6), int64(1), object(1)
memory usage: 32.5+ KB


In [7]:
raw_data.describe()

Unnamed: 0.1,Unnamed: 0,interest_rate,credit,march,may,previous,duration
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,258.5,2.835776,0.034749,0.266409,0.388031,0.127413,382.177606
std,149.677988,1.876903,0.183321,0.442508,0.814527,0.333758,344.29599
min,0.0,0.635,0.0,0.0,0.0,0.0,9.0
25%,129.25,1.04275,0.0,0.0,0.0,0.0,155.0
50%,258.5,1.466,0.0,0.0,0.0,0.0,266.5
75%,387.75,4.9565,0.0,1.0,0.0,0.0,482.75
max,517.0,4.97,1.0,1.0,5.0,1.0,2653.0


In [8]:
raw_data["credit"].unique()

array([0., 1.])

In [10]:
raw_data["march"].unique()

array([1., 0.])

In [11]:
raw_data["may"].unique()

array([0., 2., 1., 3., 4., 5.])

In [13]:
# We make sure to create a copy of the data before we start altering it. Note that we don't change the original data we loaded.
data = raw_data.copy()
data = data.drop("Unnamed: 0", axis=1)
data["y"] = data["y"].map({"yes":1, "no":0})
data['credit'] = data['credit'].astype(int)
data = data.astype({"march": 'int', "may": 'int', "previous": 'int'})
data

Unnamed: 0,interest_rate,credit,march,may,previous,duration,y
0,1.334,0,1,0,0,117.0,0
1,0.767,0,0,2,1,274.0,1
2,4.858,0,1,0,0,167.0,0
3,4.120,0,0,0,0,686.0,1
4,4.856,0,1,0,0,157.0,0
...,...,...,...,...,...,...,...
513,1.334,0,1,0,0,204.0,0
514,0.861,0,0,2,1,806.0,1
515,0.879,0,0,0,0,290.0,0
516,0.877,0,0,5,1,473.0,1


In [14]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 518 entries, 0 to 517
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   interest_rate  518 non-null    float64
 1   credit         518 non-null    int64  
 2   march          518 non-null    int64  
 3   may            518 non-null    int64  
 4   previous       518 non-null    int64  
 5   duration       518 non-null    float64
 6   y              518 non-null    int64  
dtypes: float64(2), int64(5)
memory usage: 28.5 KB


### Declare the dependent and independent variables

Use 'duration' as the independent variable.

In [15]:
y = data["y"]
x1 = data["duration"]

### Simple Logistic Regression

Run the regression.

In [16]:
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()

Optimization terminated successfully.
         Current function value: 0.546118
         Iterations 7


### Interpretation

In [17]:
results_log.summary()

0,1,2,3
Dep. Variable:,y,No. Observations:,518.0
Model:,Logit,Df Residuals:,516.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 21 Nov 2022",Pseudo R-squ.:,0.2121
Time:,18:17:05,Log-Likelihood:,-282.89
converged:,True,LL-Null:,-359.05
Covariance Type:,nonrobust,LLR p-value:,5.387e-35

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.7001,0.192,-8.863,0.000,-2.076,-1.324
duration,0.0051,0.001,9.159,0.000,0.004,0.006


The dependent variable is 'duration'. The model used is a Logit regression (logistic in common lingo), while the method is - Maximum Likelihood Estimation (MLE). It has clearly converged after classifying 518 observations. 

The Pseudo R-squared is 0.21 which is within the 'acceptable region'. 

The duration variable is significant and its coefficient is 0.0051.

The constant is also significant and equals: -1.70

MLE is based on likelihood function. It is a function which estimates how likely it is that the model at hand
describes the real underlying relationship of the variables.
The bigger the likelihood function, the higher the probability that our model is correct.
Without getting too much into the statistics of it, MLE tries to maximize the likelihood function.
That's why it is called maximum likelihood estimation.
Knowing this, and the fact that iterations are in play, we should already have an idea what's going on behind the scenes. The computer is going through different values until it finds a model for which the likelihood is the highest. When it can no longer improve it, it will just stop the optimization.
That is also how any typical machine learning process goes.

It is much more convenient to take the log likelihood when performing MLE. Because of this convenience, the log likelihood is the more popular metric. The value of the log likelihood is almost, but not always negative, and the bigger it is, the better.

Then we have LL-Null. It stands for log likelihood-null. The LL-Null is the log likelihood of a model which has no independent variables. Actually, the same y is the dependent variable of that model, but the sole independent variable it's an array of ones. This array is the constant we are adding with the add constant method.
y = βₒ * 1

In [20]:
# Let me show you real quick. Here's an array of ones only.
# If we create a logistic regression based on it, it will have a log likelihood equal to the 
# LL-Null of the previous model.
x0 = np.ones(518)
reg_log = sm.Logit(y,x0)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.693147
         Iterations 1


0,1,2,3
Dep. Variable:,y,No. Observations:,518.0
Model:,Logit,Df Residuals:,517.0
Method:,MLE,Df Model:,0.0
Date:,"Mon, 21 Nov 2022",Pseudo R-squ.:,0.0
Time:,18:38:45,Log-Likelihood:,-359.05
converged:,True,LL-Null:,-359.05
Covariance Type:,nonrobust,LLR p-value:,

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,0,0.088,0,1.000,-0.172,0.172


So, we were not lying. All right, But why? Well, you may wanna compare the log likelihood of your model with the LL-Null, to see if your model has any explanatory power. Does that sound familiar? Seeing if our model is significant.
There was this F test for the linear regression. There must be one for logistic regression, too.
Good news, there is. It is called a log likelihood ratio test or LLR. It is based on the log likelihood of the model
and the LL-Null. It measures if our model is statistically different from the LL-Null, AKA, a useless model.
Without telling you the exact way to perform it, we have it's P value and that's all we need. As we can see, it is very low, around 0.000 (LLR p-value:5.387e-35). Our model is significant.

Finally, let's talk about the Pseudo R-squared. Unlike the linear one, there is no such thing as a clearly defined R-squared for the logistic regression. There are several propositions, which have a similar meaning to the R-squared, but none of them is even close to the real deal. Some terms you may have heard are AIC, BIC, and McFadden's R-squared. Well, this one here is McFadden's R-squared (Pseudo R-squ.:0.2121). According to McFadden himself, a good Pseudo R-squared is somewhere between 0.2 and 0.4. Moreover, this measure is mostly useful for comparing variations of the same model. Different models will have completely different and incomparable Pseudo R-squares.

## Interpreting the coefficients table

In [22]:
x = sm.add_constant(x1)
reg_log = sm.Logit(y,x)
results_log = reg_log.fit()
results_log.summary()

Optimization terminated successfully.
         Current function value: 0.546118
         Iterations 7


0,1,2,3
Dep. Variable:,y,No. Observations:,518.0
Model:,Logit,Df Residuals:,516.0
Method:,MLE,Df Model:,1.0
Date:,"Mon, 21 Nov 2022",Pseudo R-squ.:,0.2121
Time:,18:51:46,Log-Likelihood:,-282.89
converged:,True,LL-Null:,-359.05
Covariance Type:,nonrobust,LLR p-value:,5.387e-35

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.7001,0.192,-8.863,0.000,-2.076,-1.324
duration,0.0051,0.001,9.159,0.000,0.004,0.006


Pi here refers to the probability of an event occurring. While one minus pi to the probability of the event not occurring. The fraction of these two is a very popular concept, odds.
log(π/(1-π)) = -1.7001 + 0.0051*duration 

For a unit change in a variable, the change in the odds equals the exponential of the coefficient. (see proof in lect 242, sect 36)