# Explainable AI: Application in Credit Scoring

Thesis: Explainable AI: Applications in Credit Scoring <br>
Degree: Master of Information Management <br>
Dataset: Give Me Some Credit (GMC), taken from Kaggle

### Import all the necessary packages

In [42]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from statsmodels.stats.outliers_influence import variance_inflation_factor

### Importing the dataset

In [7]:
path = "~/Desktop/Master of Information Management/Master's Thesis/GiveMeSomeCredit/cs-training.csv"
df_training = pd.read_csv(path)

In [11]:
df_training.drop(['Unnamed: 0'], axis=1, inplace=True)

### Pre-processing

#### Dealing with missing values

In [13]:
imputer = SimpleImputer(strategy="median")

In [15]:
imputer.fit(df_training)

SimpleImputer(strategy='median')

In [18]:
X = imputer.transform(df_training)

In [23]:
df_training = pd.DataFrame(X, columns=df_training.columns, index=df_training.index)

In [3]:
df_training.isna().any()

NameError: name 'df_training' is not defined

#### Dealing with multicollinearity based on VIF <= 10

In [44]:
## independent variabels
X = df_training[['RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate',
       'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse',
       'NumberOfDependents']]

In [46]:
## VIF dataframe
vif_data = pd.DataFrame()

In [47]:
vif_data["feature"] = X.columns

In [48]:
## calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

In [49]:
print(vif_data)

                                feature        VIF
0  RevolvingUtilizationOfUnsecuredLines   1.000777
1                                   age   3.638439
2  NumberOfTime30-59DaysPastDueNotWorse  41.173243
3                             DebtRatio   1.049552
4                         MonthlyIncome   1.269632
5       NumberOfOpenCreditLinesAndLoans   4.570548
6               NumberOfTimes90DaysLate  73.196237
7          NumberRealEstateLoansOrLines   2.304678
8  NumberOfTime60-89DaysPastDueNotWorse  91.181441
9                    NumberOfDependents   1.403443


In [51]:
## drop 'NumberOfTime30-59DaysPastDueNotWorse' and 'NumberOfTime60-89DaysPastDueNotWorse' and repeat process
X = df_training[['RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines',
       'NumberOfDependents']]

In [52]:
vif_data=pd.DataFrame()

In [53]:
vif_data["feature"] = X.columns

In [54]:
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

In [55]:
print(vif_data)

                                feature       VIF
0  RevolvingUtilizationOfUnsecuredLines  1.000777
1                                   age  3.623498
2  NumberOfTime30-59DaysPastDueNotWorse  1.007052
3                             DebtRatio  1.049539
4                         MonthlyIncome  1.269621
5       NumberOfOpenCreditLinesAndLoans  4.502202
6          NumberRealEstateLoansOrLines  2.303571
7                    NumberOfDependents  1.399348


In [57]:
df_training = df_training.drop(['NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate'], axis=1)

In [60]:
df_training.shape

(150000, 9)

In [61]:
df_training.columns

Index(['SeriousDlqin2yrs', 'RevolvingUtilizationOfUnsecuredLines', 'age',
       'NumberOfTime30-59DaysPastDueNotWorse', 'DebtRatio', 'MonthlyIncome',
       'NumberOfOpenCreditLinesAndLoans', 'NumberRealEstateLoansOrLines',
       'NumberOfDependents'],
      dtype='object')

In [62]:
df_training.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 9 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  float64
 1   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 2   age                                   150000 non-null  float64
 3   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  float64
 4   DebtRatio                             150000 non-null  float64
 5   MonthlyIncome                         150000 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  float64
 7   NumberRealEstateLoansOrLines          150000 non-null  float64
 8   NumberOfDependents                    150000 non-null  float64
dtypes: float64(9)
memory usage: 10.3 MB


### Feature scaling will be performed in the MLP Notebook