<h3 style="text-align:center; font-size:36px; color:black; font-weight:bold"> Default Credit Score Case</h3>
<h3 style="text-align:center; font-size:26px; color:black">Analytics</h3>

---

The purpose of this notebook is to understand the business through the use of a proper regression model. We understand that more advanced models can perform better than GLMs in some cases. So, the model develped here only will be used to quantify the effect of covariates (once GLMs are fully interpretable while "more advanced model" not) or independent variables in the business and can be used as a baseline model.

---

# 1. Loading libraries

In [52]:
import os as os
import numpy as np
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
import statsmodels.api as sm
from sklearn.metrics import roc_auc_score, confusion_matrix
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns

# 2. Loading data

In [11]:
file = os.path.join('data','processed_dataframe.csv')
try:
    df = pd.read_csv(file)
    print(f' data frame: {file} read')
    
except:
    print(f'error in loading dataframe, verify the path or file {file}')

 data frame: data\processed_dataframe.csv read


In [12]:
df.round(2).head(5)

Unnamed: 0,Default,UIS,age,NTD3059,RDW,MW,OCL,NTDGT90,NB,NTD6089,ND
0,1.0,0.77,45.0,2.0,0.8,9120.0,13.0,0.0,6.0,0.0,2.0
1,0.0,0.96,40.0,0.0,0.12,2600.0,4.0,0.0,0.0,0.0,1.0
2,0.0,0.66,38.0,1.0,0.09,3042.0,2.0,1.0,0.0,0.0,0.0
3,0.0,0.23,30.0,0.0,0.04,3300.0,5.0,0.0,0.0,0.0,0.0
4,0.0,0.21,74.0,0.0,0.38,3500.0,3.0,0.0,1.0,0.0,1.0


# 3. Regression Model - Binomial Model with Logistic Link

Once, our target (Default) data consists of two categories, default (1) and non default (0), we have some options of Generalized Linear Models (GLMs) to assess or model this kind of data, we can cite loglinear models or logistic models. By the sake of simplicity we will use a logistic regression model or the Binomial Model with logistic link function.

## 3.1. First Iteration

In [32]:
# Getting features and target
X = df[['UIS','age','RDW','MW','NTD3059','NTD6089','NTDGT90','NB','ND','OCL']]
y = np.array(df[['Default']])

# Apply SMOTE to the training set
smote = SMOTE(sampling_strategy = 'not majority', random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Fiting the model
log_model = sm.Logit(y_resampled, X_resampled)
fit_model = log_model.fit()

print(fit_model.summary())

Optimization terminated successfully.
         Current function value: 0.478411
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:               142966
Model:                          Logit   Df Residuals:                   142956
Method:                           MLE   Df Model:                            9
Date:                Wed, 14 Feb 2024   Pseudo R-squ.:                  0.3098
Time:                        08:30:58   Log-Likelihood:                -68396.
converged:                       True   LL-Null:                       -99096.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
UIS            1.7725      0.020     87.459      0.000       1.733       1.812
age           -0.0319      0.

Above is presented the result of logistic regression. As we can see in table the coefficient related to ND (Number of Dependents) have a high p-value meaning strong evidence to accept H<sub>0</sub> that is the hipothesis that the coefficient is 
null, for all the others coefficient we found strong evidence to reject H<sub>0</sub>.

The statistics pseudo R-squared shows us how well the model fits to the data for GLMs, we can see that the value of statistics is low for this model 0.3098 in a scale from 0 to 1. This means that this model explains about 31% of the data variability, this can be considered not fair.

Other point to note is the comparison between the log-likelihood of the full model compared to the log-likelihood of the null model (denoted by LLR-Null). As the output above give us the log-likelihood ratio test between the two models (denoted by LLR p-value), as we can see the p-value is very low, menaning that we have strong evidence to reject the hypothesis H<sub>0</sub>: "the full model is equivalent to the null model".

## 3.2. Second Iteration

On this second iteration, we will assess the logistic regression without the variable ND, to obtain a more parsimonious model.

In [42]:
# Getting features and target
X_resampled = X_resampled[['UIS','age','RDW','NTD3059','NTD6089','NTDGT90','NB','OCL']]

# Fiting the model
log_model = sm.Logit(y_resampled, X_resampled)
fit_model = log_model.fit()

print(fit_model.summary())

Optimization terminated successfully.
         Current function value: 0.480385
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                      y   No. Observations:               142966
Model:                          Logit   Df Residuals:                   142958
Method:                           MLE   Df Model:                            7
Date:                Wed, 14 Feb 2024   Pseudo R-squ.:                  0.3070
Time:                        08:58:40   Log-Likelihood:                -68679.
converged:                       True   LL-Null:                       -99096.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
UIS            1.6652      0.019     85.448      0.000       1.627       1.703
age           -0.0360      0.

As we can see, now we have stronger evidence to reject the hypotheis H<sub>0</sub>:&beta;=0 evidenced by the low p-values associated to the coefficients. The low R-squared, evidences that the model have a low predictive power associated to it and the LLR p-value evidences that the current model is not equivalent to the null model.

To compare whether the removal of ND variable has effect on the prediction of the target, we can perform the likelihood ratio test between the current model and the full model, using the following:

<center> &Lambda; = -2&times;[log(Likelihood<sub>current</sub>)-log(Likelihood<sub>full</sub>)] &sim; &chi;<sup>2</sup>(1)</center>

In [44]:
# Likelihood Ratio Test

LLc, LLf = -68679, -68396
LLR = -2*(LLc-LLf)

pval = 1-stats.chi2.cdf(LLR, 1)

print('-------------------------------------')
print(f'Log-likelihood ratio test: {pval.round(4)}')
print('-------------------------------------')

-------------------------------------
Log-likelihood ratio test: 0.0
-------------------------------------


As we can see the p-value associated to the LLR test results in low p-value, so we have stronger evidences to reject H<sub>0</sub>: "both the models are equivalent".

Another measure of goodness of fit is the Area Under the Receiver Operator Curve (AUC ROC).

In [45]:
# Area under the ROC curve.
AUC = roc_auc_score(y_resampled, fit_model.fittedvalues )

print('-------------------------------------')
print(f'Area Under ROC: {AUC.round(2)}')
print('-------------------------------------')

-------------------------------------
Area Under ROC: 0.86
-------------------------------------


As we can see the area under ROC give us a value of 0.83, meaning that the model can be considered fair, following some literatures such as Hosmer and Lemeshow.

Another diagnostics for logistic regression model is the confusion matrix:

In [67]:
pd.DataFrame(fit_model.pred_table(threshold=0.5))

Unnamed: 0,0,1
0,57655.0,13828.0
1,17544.0,53939.0


The confusion matrix for the parsimonious model is shown above, where the rows represent the observation and the columns represent the predicted values. We can see that both the false positive and false negative are lower than the true positive and true negatives respectively, at least 60% to 70% bellow. So, we can conclude that this model fits fairly the data and it is a good choice to quantify the effects of the covariates in analysing the defaults. 

# 4. Conclusions

* The binomial model presented can be considered a fair model to assess the default classification, evidenced by the AUC and confusion matrix.

* The coefficients of the model can be interpreted as odds ratio, this analysis give us an understanding of the business, so, follows:

In [46]:
pd.DataFrame(np.exp(fit_model.params)).transpose().round(2)

Unnamed: 0,UIS,age,RDW,NTD3059,NTD6089,NTDGT90,NB,OCL
0,5.29,0.96,1.71,2.23,4.12,5.07,0.97,1.01


| Variable | Odds ration | Interpretation |
| --- | --- | --- |
| UIS | 5.29 | Each 10% of utilization of insecure credit lines relative to its limit, the odds of default rises 43% (0.43)* |
| age | 0.96 | The odds of default decreases by 20% each 5 years of age |
| RDW | 1.71 | Each 10% of rise in debt ratio assets represents a rise in 1.7% (0.17) of odds of default |
| NTD3059 | 2.23 | Each time the client has been in default between 30 and 59 days, the odds of default rises by 123% (1.23) |
| NTD6089 | 4.12 | Each time the client has been in default between 60 and 89 days, the odds of default rises by 312% (3.12) |
| NTDGT90 | 5.07 | Each time the client has been in default above 90 days, the odds of default rises by 407% (4.07) |
| NB | 0.97 | Each real state loan decreases by 3% (0.03) the odds of default |
| OCL | 1.01 | Each open credit loan (number of loans) rises by 1% (0.01) the odds of default |