# Classification : Logsitic Regression
Data : bankloan.csv     
How  :    
- Build a logistic regression model
    - Target = Default
    - Features = employ, debtinc, creddebt, othdebt
- Interpret the result
- Validate the model accuracy using 20% testing data

## Import Library & Data

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

In [2]:
df = pd.read_csv(r'C:\Users\user\Documents\Data Science\What is Classification_\bankloan.csv')
df

Unnamed: 0,age,ed,employ,address,income,debtinc,creddebt,othdebt,default
0,41,3,17,12,176,9.3,11.359392,5.008608,1
1,27,1,10,6,31,17.3,1.362202,4.000798,0
2,40,1,15,14,55,5.5,0.856075,2.168925,0
3,41,1,15,14,120,2.9,2.658720,0.821280,0
4,24,2,2,0,28,17.3,1.787436,3.056564,1
...,...,...,...,...,...,...,...,...,...
695,36,2,6,15,27,4.6,0.262062,0.979938,1
696,29,2,6,4,21,11.5,0.369495,2.045505,0
697,33,1,15,3,32,7.6,0.491264,1.940736,0
698,45,1,19,22,77,8.4,2.302608,4.165392,0


Information :
- age : customer age
- ed : education level (1, 2, 3,...)
- employ : long working
- address: length of stay
- income: income per month
- debtinc: percentage of debt from income
- credit: nominal debt (in 1000 dollars)
- otherdebt : nominal debt from other sources (in 1000 dollars)

In [3]:
features = ['employ', 'debtinc', 'creddebt', 'othdebt'] # features
target = ['default'] # target

x = df[features]
y = df[target]

x.describe() # descriptive statistics

Unnamed: 0,employ,debtinc,creddebt,othdebt
count,700.0,700.0,700.0,700.0
mean,8.388571,10.260571,1.553553,3.058209
std,6.658039,6.827234,2.117197,3.287555
min,0.0,0.4,0.011696,0.045584
25%,3.0,5.0,0.369059,1.044178
50%,7.0,8.6,0.854869,1.987567
75%,12.0,14.125,1.901955,3.923065
max,31.0,41.3,20.56131,27.0336



## Modeling

In [4]:
model = sm.Logit(y, sm.add_constant(x)) # model definition
result = model.fit() # training model

Optimization terminated successfully.
         Current function value: 0.411165
         Iterations 7


In [5]:
print(result.summary()) # logistic regression result

                           Logit Regression Results                           
Dep. Variable:                default   No. Observations:                  700
Model:                          Logit   Df Residuals:                      695
Method:                           MLE   Df Model:                            4
Date:                Wed, 21 Jul 2021   Pseudo R-squ.:                  0.2844
Time:                        22:47:38   Log-Likelihood:                -287.82
converged:                       True   LL-Null:                       -402.18
Covariance Type:            nonrobust   LLR p-value:                 2.473e-48
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.2302      0.236     -5.210      0.000      -1.693      -0.767
employ        -0.2436      0.029     -8.456      0.000      -0.300      -0.187
debtinc        0.0885      0.021      4.200      0.0

> Result :    
> 1. LLR p-value : 2.473e-48
> 2. p-value :
    - const : 0.000
    - employ : 0.000
    - debtinc : 0.000
    - creddebt : 0.000
    - othdebt : 0.940
> 3. Coef :
    - employ : -0.2436
    - debtinc : 0.0885
    - creddebt : 0.5041      
    
Coef -> OR exp(beta (c-a))

In [6]:
c = 20
a = 15

OR_employ = np.exp(0.2436*(c-a))
print(OR_employ)

3.380420128015566


In [7]:
c = 20
a = 15

OR_debtinc = np.exp(0.0885*(c-a))
print(OR_debtinc)

1.5565938428137092


In [8]:
c = 20
a = 15

OR_creddebt = np.exp(0.5041*(c-a))
print(OR_creddebt)

12.434812515742879


The **default** risk is the modeled one, meaning the risk of someone who pays poorly
> Interpretation :
> 1. LLR p-value : 2.473e-48 < 0.05 (**reject H0**), it means at least one of the variables **has a significant effect** on the default risk.
> 2. p-value :
    - const : 0.000 < 0.05 (**reject H0**), it means that the model requires an intercept
    - employ : 0.000 < 0.05 (**reject H0**), it means length of time working has a **negative effect** on default risk. The longer the work, the lower the risk of a person's default (employ 0 - 31 years)
    - debtinc : 0.000 < 0.05 (**reject H0**), it means percentage of debt has a **positive effect** on default risk. The greater the percentage of debt, the greater the risk of a person's default (debtinc 0.4 - 41.3 %)
    - creddebt : 0.000 < 0.05 (**reject H0**), it means nominal of debt has a **positive effect** on default risk. The greater the nominal of debt, the greater the risk of a person's default (debtinc 0.011 - 20.56) (in 1000 dollars)
    - otherdebt : 0.940 > 0.05 (**fail to teject H0**), it means there is not enough evidence that nominal debt from other sources has an effect on default risk.
> 3. Coef :
    - employ : -0.2436, when working time increases by 5 years, the tendency of a person to default will decrease by 3.38 times (other variable values do not change)
    - debtinc : 0.0885, when percentage of debt increases by 5 %, the tendency of a person to default will increase by 1.56 times (other variable values do not change)
    - creddebt : 0.5041, when nominal of debt increases by 5000 dollars, the tendency of a person to default will increase by 12.43 times (other variable values do not change)

Coefficient Determination:
<br>
Pseudo R-Square = 28.44% : This model can explain 28.44% variation of default rate.



## Multicollinearity

In [9]:
from statsmodels.stats.outliers_influence import variance_inflation_factor # VIF

def calc_vif(X): # VIF calculation function

    # Calculating VIF
    vif = pd.DataFrame()
    vif["variables"] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    return(vif)

In [10]:
calc_vif(x) 

Unnamed: 0,variables,VIF
0,employ,2.222753
1,debtinc,3.045977
2,creddebt,2.816577
3,othdebt,4.116876


No multicollinearity problem     

        

        
## Model Validation
in 20 % data testing

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [12]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 2020) # data splitting

In [13]:
sm_logit_train = sm.Logit(y_train, sm.add_constant(x_train)) # model definition
result_train = sm_logit_train.fit() # training model

Optimization terminated successfully.
         Current function value: 0.408607
         Iterations 7


In [14]:
y_prob = result_train.predict(sm.add_constant(x_test)) # probability of default risk
y_class = np.where(y_prob > 0.5, 1, 0) # default or non-default class

In [15]:
print('Model accuracy in test dataset: ',accuracy_score(y_test, y_class))

Model accuracy in test dataset:  0.8214285714285714


If there are 100 customers, about 82 out of 100 customers will be predicted (default/non-default) correctly.