## **Descriptive/Predictive Logistic Regression**

In [8]:
import pandas as pd
import numpy as np

## Universal Bank Data Set
## Goal: probability to get a personal loan with the bank
df = pd.read_csv('https://raw.githubusercontent.com/martinwg/ISA591/refs/heads/main/data/UniversalBank.csv')
df.head()

Unnamed: 0,ID,Personal Loan,Age,Experience,Income,ZIP Code,Family,CCAvg,Education,Mortgage,Securities Account,CD Account,Online,CreditCard
0,1,0,25,1,49,91107,4,1.6,1,0,1,0,0,0
1,2,0,45,19,34,90089,3,1.5,1,0,1,0,0,0
2,3,0,39,15,11,94720,1,1.0,1,0,0,0,0,0
3,4,0,35,9,100,94112,1,2.7,2,0,0,0,0,0
4,5,0,35,8,45,91330,4,1.0,2,0,0,0,0,1


In [9]:
## Remove uninformative variables
## ID and ZIP Code
df = df.drop(['ID', 'ZIP Code'], axis=1)
df.head()

Unnamed: 0,Personal Loan,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Securities Account,CD Account,Online,CreditCard
0,0,25,1,49,4,1.6,1,0,1,0,0,0
1,0,45,19,34,3,1.5,1,0,1,0,0,0
2,0,39,15,11,1,1.0,1,0,0,0,0,0
3,0,35,9,100,1,2.7,2,0,0,0,0,0
4,0,35,8,45,4,1.0,2,0,0,0,0,1


In [10]:
## check for missing
df.isnull().sum()

Unnamed: 0,0
Personal Loan,0
Age,0
Experience,0
Income,0
Family,0
CCAvg,0
Education,0
Mortgage,0
Securities Account,0
CD Account,0


In [11]:
## Linear models require columns to NOT HAVE PERFECT CORRELATION
## pd.get_dummmies () --> PERFECT CORRELATION if you don't use drop_first = True
## Ideally - Check for very high or perfect correlation
# Create a correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than a threshold (e.g., 0.99)
threshold = 0.99
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

# Drop the highly correlated features
df = df.drop(columns=to_drop)

print(f"Columns removed: {to_drop}")

Columns removed: ['Experience']


In [16]:
## split into 70% 30% random_state = 591
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop('Personal Loan', axis=1), df['Personal Loan'], test_size=0.3, random_state=591)

## **Logistic Regression**

Descriptive modeling checks the model fit to the data. Goodness fit metrics include p-values, tests of hypotheses,

In [17]:
import statsmodels.api as sm

## instance and fit
lr = sm.Logit(y_train, sm.add_constant(X_train)).fit()

## summary
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.127007
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:          Personal Loan   No. Observations:                 3500
Model:                          Logit   Df Residuals:                     3489
Method:                           MLE   Df Model:                           10
Date:                Thu, 14 Nov 2024   Pseudo R-squ.:                  0.5865
Time:                        18:47:06   Log-Likelihood:                -444.53
converged:                       True   LL-Null:                       -1075.0
Covariance Type:            nonrobust   LLR p-value:                1.030e-264
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                -13.5833      0.803    -16.916      0.000     -15.157     -12.009
Age  

In [18]:
##  Iterations 9: took 9 steps to get to the min of loss function (e.g., SSE)
##  LLR (Log-Likelihood Ratio) p-value: 1.030e-264
##  H0: Beta_1 = Beta_2 = ... = Beta_10 = 0
##  At least one is NOT
##  The probability to obtain -444.53 or MORE EXTREME given that H0 is true is 0
##  We reject H0 in favor of HA

##  EACH PREDICTOR HAS ITS OWN TEST
##  H0: Beta_Age = 0
##  HA: Beta_Age != 0
##  p-value:  0.286
##  In the presence of the other predictors, Age is NOT a significant
##  predictor of the PROBABILITY TO GET A LOAN

##  TO IMPROVE THE FIT of a descriptive model remove INSIGNIFICANT predictors
##  recommendation: to drop ONE variable at a time UNTIL ALL LEFT are significant

In [22]:
## let's use alpha = 0.05
column_drop_list = ['Age', 'CCAvg', 'Mortgage']

## instance and fit
lr = sm.Logit(y_train, sm.add_constant(X_train.drop(columns = column_drop_list))).fit()

## summary
print(lr.summary())

Optimization terminated successfully.
         Current function value: 0.128005
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:          Personal Loan   No. Observations:                 3500
Model:                          Logit   Df Residuals:                     3492
Method:                           MLE   Df Model:                            7
Date:                Thu, 14 Nov 2024   Pseudo R-squ.:                  0.5832
Time:                        18:58:23   Log-Likelihood:                -448.02
converged:                       True   LL-Null:                       -1075.0
Covariance Type:            nonrobust   LLR p-value:                1.517e-266
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                -13.0113      0.669    -19.443      0.000     -14.323     -11.700
Incom

In [None]:
## Final Descriptive (goodness of fit metrics are OPTIMAL)

In [None]:
## Income + (Income has a positive effect on the probability to get a loan)

In [None]:
## Intepretations:
## As Income increases by $1, by how much does the PROBABILITY OF GETTING A LOAN CHANGE?
## (not easy to understant) As Income increases by $1, then the estimated log(odds) of getting a loan increase by 0.0574 controlling for OTHER FACTORS
## As Income increases by $1, then the estimated odds of getting a loan increase by a FACTOR OF 1.0590 (increases by 5.9%) controlling for OTHER FACTORS

In [23]:
import numpy as np
np.exp(0.0574)

1.0590793574234165

In [25]:
## Online {1: online banking acct, 0: does not}
## Interpret the estimate: -0.8040  (means the probability to get a LOAN is lower than NOT HAVING an online account)
## (log odds) The log(odds) of getting a loan are 0.8040 lower for those who have ONLINE accts compared to those who do not (CFOF)
## (odds) The odds of getting a loan decrease by a factor of 0.4475 for those who have ONLINE accts compared to those who do not (CFOF)

In [26]:
np.exp(-0.8040)

0.44753523810441237

In [27]:
10*0.4475

4.475

## **Predictive Logistic Regression**

We do not consider goodness of fit metrics (p-values, hypothesis tests,..). We just want to accurately predict NEW DATA.

How about variables (unimportant)?

In [30]:
## for predictive models use sklearn
from sklearn.linear_model import LogisticRegression

## instance
lr = LogisticRegression(max_iter=1000) ## steps taken are 1000 steps

## fit
lr.fit(X_train, y_train)

In [None]:
## what if it does NOT converge? CHANGE THE optimization method
## solver='lbfgs' often does not converge (faster)
## solver = 'liblinear' converges MOST times (as long NOT perfect collinearity) - SLOW

In [37]:
## This model uses regularization by default L2
## L2 will make the slopes CLOSE to zero to reduce the influence of the variable in the model
lr = LogisticRegression(max_iter=1000)

lr.fit(X_train, y_train)

In [38]:
## estimates
lr.coef_

array([[ 8.35784738e-03,  5.42925780e-02,  6.63688824e-01,
         8.89574958e-02,  1.62277419e+00,  1.13394212e-03,
        -4.07518706e-01,  3.02678088e+00, -7.16442029e-01,
        -9.80041548e-01]])