### Logistic Regression Example 5.1

We fit a multiple logistic regression model to the **Default** data set using **balance**, **income**, and **student** as predictor variables. Note that the latter is a *qualitative* predictor with levels **Yes** and **No**. In order to use it in the regession model, we define a *dummy variable* with value $1$ if **student=Yes** and $0$ if **student=No**.


In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm

# Load data
df = pd.read_csv('./data/Default.csv', sep=';')

# Add a numerical column for default
df = df.join(pd.get_dummies(df[['default', 'student']], 
                            prefix={'default': 'default', 
                                    'student': 'student'},
                            drop_first=True))
# Set ramdom seed
np.random.seed(1)
# Index of Yes:
i_yes = df.loc[df['default_Yes'] == 1, :].index

# Random set of No:
i_no = df.loc[df['default_Yes'] == 0, :].index
i_no = np.random.choice(i_no, replace=False, size=333)

# Fit Linear Model on downsampled data
i_ds = np.concatenate((i_no, i_yes))
x_ds = df.iloc[i_ds][['balance', 'income', 'student_Yes']]
y_ds = df.iloc[i_ds]['default_Yes']

# Model using statsmodels.api
x_sm = sm.add_constant(x_ds.astype('float'))
model_sm = sm.GLM(y_ds, x_sm, family=sm.families.Binomial())
model_sm = model_sm.fit()

print(model_sm.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:            default_Yes   No. Observations:                  666
Model:                            GLM   Df Residuals:                      662
Model Family:                Binomial   Df Model:                            3
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -186.21
Date:                Mon, 20 Oct 2025   Deviance:                       372.42
Time:                        16:18:38   Pearson chi2:                     571.
No. Iterations:                     7   Pseudo R-squ. (CS):             0.5627
Covariance Type:            nonrobust                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
const          -7.1303      0.869     -8.205      

In [7]:
# Predict training data
x_pred = x_sm
y_pred = model_sm.predict(x_pred)

# Round to 0 or 1
y_pred = y_pred.round()

# Create confusion matrix
confusion = pd.DataFrame({'predicted': y_pred,
                          'true': y_ds})
confusion = pd.crosstab(confusion.predicted, confusion.true, 
                        margins=True, margins_name="Sum")

print(confusion)

true       False  True  Sum
predicted                  
0.0          293    36  329
1.0           40   297  337
Sum          333   333  666


In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Model using sklearn
model_sk = LogisticRegression(solver='liblinear', penalty='l1')

# Calculate cross validation scores:
scores = cross_val_score(model_sk, x_ds, y_ds, cv=5)
print(np.mean(scores))

0.8827965435978005


First we find that the predictors **balance** and **student** are significant, i.e. they contribute substantially to the model for **default**. The coefficient of **student** is negative, i.e. the student status means a *decrease* in probability for default for a fixed value of **balance** and **income**.

Further we find a cross-validated score of 0.8828, which amounts to say that the model classifies correctly 88.28% of the cases. This is not much an increase compared with the single logistic regression model. Also the confusion matrix is very similar to the simple regression case. 

We will now use the coefficients above in order to predict the probability for default for new observations. For example, if a student has a credit card bill of CHF 1500 and an income of CHF 40000, so the estimated probability for **default** is
\begin{equation}
\hat{p}(1500,40,1)
=\dfrac{e^{-6.679+0.00529\cdot 1500 -0.0043\cdot 40-0.6468\cdot 1}}{1+e^{-6.679+0.00529\cdot 1500-0.0043\cdot 40-0.6468\cdot 1}}
=0.564
\end{equation}

For a non-student with the same balance and income the estimated probability for default is
\begin{equation}
\hat{p}(1500,40,0)
=\dfrac{e^{-6.679+0.00529\cdot 1500-0.0043\cdot 40-0.6468\cdot 0}}{1+e^{-6.679+0.00529\cdot 1500-0.0043\cdot 40-0.6468\cdot 0}}
=0.747
\end{equation}
The coefficient for **income** is multiplied by $1000$ for lucidity. Thus we insert $40$ instead of $40000$ into the model. 
