<a href="https://colab.research.google.com/github/manvirkaur84/manvirkaur/blob/main/docs/ml-concepts/207_ML_MSBA/SBACaseLogit%20Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [96]:
%pip install dmba



In [97]:
%matplotlib inline
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_auc_score, roc_curve
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pylab as plt
import seaborn as sns
from dmba import classificationSummary, gainsChart, liftChart
from dmba.metric import AIC_score
import math
from scipy.stats import chi2


DATA = Path('/content/sample_data/')

#***Section A:*** ⤵
####Fit a logistic regression model to reproduce parameter (coefficient) estimates (up to 4 decimals) in Tables 7(a), 8, 9 of this article using the SBA case data SBAcase.11.13.17.csv by using (1) sklearn LogisticRegression() liblinear solver and (2) sklearn LogisticRegression() Default Solver 'lbfgs'.
____________

#Using sklearn LogisticRegression() liblinear solver

##Table 7(a).1

In [102]:
sba_df = pd.read_csv(DATA / 'SBAcase.11.13.17.csv')

# TARGET VARIABLE - 'Default' is our dummy variable derived from "MIS_Status"
# The value for “Default” = 1 if MIS_Status = CHGOFF, and “Default” = 0 if MIS_Status = PIF
sba_df['Default'] = np.where(sba_df['MIS_Status'] == 'CHGOFF', 1, 0)
sba_df.drop(columns=['MIS_Status'], inplace=True)

# Training data only
train = sba_df[sba_df['Selected'] == 1].copy()

# Predictors for Table 7(a)
predictors = ['New', 'RealEstate', 'DisbursementGross', 'Portion', 'Recession']
X_train = train[predictors].copy()
y_train = train['Default']

scaler = StandardScaler()
X_train['DisbursementGross'] = scaler.fit_transform(X_train[['DisbursementGross']])

# Fit logistic regression (liblinear, very large C to mimic no penalty)
logit_reg = LogisticRegression(solver='liblinear', penalty='l2', C=1e42, max_iter=1000)
logit_reg.fit(X_train, y_train)


print(f"intercept  {logit_reg.intercept_[0]:.4f}")
coef_df = pd.DataFrame({'coeff': logit_reg.coef_[0]}, index=predictors)
print(coef_df.round(4))   # round to 4 decimals
print()


intercept  1.2703
                    coeff
New               -0.0772
RealEstate        -2.0329
DisbursementGross -0.1160
Portion           -2.8297
Recession          0.4971



##Table 8.1

In [103]:
sba_df = pd.read_csv(DATA / 'SBAcase.11.13.17.csv')

# TARGET VARIABLE - 'Default' is our dummy variable derived from "MIS_Status"
# The value for “Default” = 1 if MIS_Status = CHGOFF, and “Default” = 0 if MIS_Status = PIF
sba_df['Default'] = np.where(sba_df['MIS_Status'] == 'CHGOFF', 1, 0)
sba_df.drop(columns=['MIS_Status'], inplace=True)

# Training data only
train = sba_df[sba_df['Selected'] == 1].copy()

# Predictors for Table 8
predictors = ['RealEstate', 'Portion', 'Recession']
X_train = train[predictors]
y_train = train['Default']

# validation/test set (Selected = 0)
valid_X = sba_df[sba_df['Selected'] == 0][predictors]
valid_y = sba_df[sba_df['Selected'] == 0]['Default']

# Fit logistic regression (liblinear, very large C to mimic no penalty)
logit_reg = LogisticRegression(solver='liblinear', penalty='l2', C=1e42, max_iter=1000)
logit_reg.fit(X_train, y_train)


print(f"intercept  {logit_reg.intercept_[0]:.4f}")
coef_df = pd.DataFrame({'coeff': logit_reg.coef_[0]}, index=predictors)
print(coef_df.round(4))   # round to 4 decimals
print()

intercept  1.3930
             coeff
RealEstate -2.1282
Portion    -2.9874
Recession   0.5041



##Table 9.1

In [104]:
#code from Dicussion Assignment #1  - sklearn LogisticRegression() liblinear solver
#Table 9

predictions = logit_reg.predict(valid_X)
predictions_nominal = [ 0 if x < 0.5 else 1 for x in predictions]
classificationSummary(valid_y, predictions_nominal)

Confusion Matrix (Accuracy 0.6784)

       Prediction
Actual   0   1
     0 682  14
     1 324  31


# Using sklearn LogisticRegression() Default Solver **'lbfgs'**

##Table 7(a).2


In [106]:
sba_df = pd.read_csv(DATA / 'SBAcase.11.13.17.csv')

# TARGET VARIABLE - 'Default' is our dummy variable derived from "MIS_Status"
# The value for “Default” = 1 if MIS_Status = CHGOFF, and “Default” = 0 if MIS_Status = PIF
sba_df['Default'] = np.where(sba_df['MIS_Status'] == 'CHGOFF', 1, 0)
sba_df.drop(columns=['MIS_Status'], inplace=True)

# Training data only
train = sba_df[sba_df['Selected'] == 1].copy()

# Predictors for Table 7(a)
predictors = ['New', 'RealEstate', 'DisbursementGross', 'Portion', 'Recession']
X_train = train[predictors].copy()
y_train = train['Default']

# validation/test set (Selected = 0)
valid_X = sba_df[sba_df['Selected'] == 0][predictors]
valid_y = sba_df[sba_df['Selected'] == 0]['Default']

logit_reg_default = LogisticRegression(penalty=None, solver='lbfgs', max_iter=10000)
logit_reg_default.fit(X_train, y_train)

print(f"intercept  {logit_reg_default.intercept_[0]:.4f}")
coef_df = pd.DataFrame({'coeff': logit_reg_default.coef_[0]}, index=predictors)
print(coef_df.round(4))   # round to 4 decimals
print()


intercept  0.6887
                    coeff
New               -0.2054
RealEstate        -2.6111
DisbursementGross -0.0000
Portion           -1.5922
Recession          0.3239



##Table 8.2

In [107]:
sba_df = pd.read_csv(DATA / 'SBAcase.11.13.17.csv')

# TARGET VARIABLE - 'Default' is our dummy variable derived from "MIS_Status"
# The value for “Default” = 1 if MIS_Status = CHGOFF, and “Default” = 0 if MIS_Status = PIF
sba_df['Default'] = np.where(sba_df['MIS_Status'] == 'CHGOFF', 1, 0)
sba_df.drop(columns=['MIS_Status'], inplace=True)

# Training data only
train = sba_df[sba_df['Selected'] == 1].copy()

# Predictors for Table 7(a)
predictors = ['RealEstate', 'Portion', 'Recession']
X_train = train[predictors].copy()
y_train = train['Default']

# validation/test set (Selected = 0)
valid_X = sba_df[sba_df['Selected'] == 0][predictors]
valid_y = sba_df[sba_df['Selected'] == 0]['Default']

logit_reg_default = LogisticRegression(penalty=None, solver='lbfgs', max_iter=10000)
logit_reg_default.fit(X_train, y_train)

print(f"intercept  {logit_reg_default.intercept_[0]:.4f}")
coef_df = pd.DataFrame({'coeff': logit_reg_default.coef_[0]}, index=predictors)
print(coef_df.round(4))   # round to 4 decimals
print()


intercept  1.4010
             coeff
RealEstate -2.1301
Portion    -3.0000
Recession   0.5002



##Table 9.2

In [108]:
#code from Dicussion Assignment #1  - sklearn LogisticRegression() liblinear solver
#Table 9

predictions = logit_reg.predict(valid_X)
predictions_nominal = [ 0 if x < 0.5 else 1 for x in predictions]
classificationSummary(valid_y, predictions_nominal)

Confusion Matrix (Accuracy 0.6784)

       Prediction
Actual   0   1
     0 682  14
     1 324  31


#***Section B*** ⤵
####Refer to Table 8 of the article. Write the estimated equation that associates the outcome variable (i.e., default or not) with predictors RealEstate, Portion, and Recession, in three formats:
* The logit as a function of the predictors
* The odds as a function of the predictors
* The probability as a function of the predictors
------


Estimated Equation(s):
* B0 = Intercept
* B2 = RealEstate
* B3 = Portion
* B4 = Recession

p = P(Default = 1 | RealEstate, Portion, Recession)

1) Logit Function = ln(1/(1-p)) = B0 + B1X1 + B2X2 + B3X3
* ln(1/(1-p)) = 1.3931 + (-2.1821 * X1) + (-2.9875 * X2) + (0.5041 * X3)

2) Odds Function = p/(1-p) = e^logit
* e^[1.3931 + (-2.1821 * X1) + (-2.9875 * X2) + (0.5041 * X3)]

3) Probability Function = p = (odds/1+odds)
* [(e^[1.3931 + (-2.1821 * X1) + (-2.9875 * X2) + (0.5041 * X3))/(1+e^[1.3931 + (-2.1821 * X1) + (-2.9875 * X2) + (0.5041 * X3))]

#***Section C*** ⤵
#####Explain why risk indicators in Table 8 were selected using p-values in Table 7(a).
------


Risk indicators in Table 8 were selected using p-values from Table 7(a) to make the model simpler and reliable. Table 7(a) had 5 variables, but DisembursementGross and New were dropped beacaue their p-values were greater than 0.05, which made them statistically insignificant causing them to have no serious effect on the data. The remaining variables all had p-values less than 0.05 and were selected.