<h3> Objective: Build a model that can accurately predict whether a loan-applicant will default based on a set of features (Loan Amount, Debt-to-Income Ratio, Loan Type, Rate of Interest, Income, Credit Worthiness, Credit Score). </h3>

In [1]:
import sys
!{sys.executable} -m pip install xgboost


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3.8 install --upgrade pip[0m


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, KFold
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.utils import resample

In [3]:
df = pd.read_csv("Loan_Default_Cleaned.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,id,year,loan_amount,rate_of_interest,interest_rate_spread,upfront_charges,term,property_value,income,...,construction_type,occupancy_type,secured_by,total_units,credit_type,co-applicant_credit_type,age,submission_of_application,region,security_type
0,0,24890.0,2019.0,116500.0,4.201667,0.972767,3405.226667,360.0,118000.0,1740.0,...,sb,pr,home,1U,EXP,CIB,25-34,to_inst,south,direct
1,1,24891.0,2019.0,206500.0,3.996667,0.6778,558.893333,360.0,281333.333333,4980.0,...,sb,pr,home,1U,EQUI,EXP,55-64,to_inst,North,direct
2,2,24892.0,2019.0,406500.0,4.56,0.2,595.0,360.0,508000.0,9480.0,...,sb,pr,home,1U,EXP,CIB,35-44,to_inst,south,direct
3,3,24893.0,2019.0,456500.0,4.25,0.681,124.5,360.0,658000.0,11880.0,...,sb,pr,home,1U,EXP,CIB,45-54,not_inst,North,direct
4,4,24894.0,2019.0,696500.0,4.0,0.3042,0.0,360.0,758000.0,10440.0,...,sb,pr,home,1U,CRIF,EXP,25-34,not_inst,North,direct


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148670 entries, 0 to 148669
Data columns (total 35 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Unnamed: 0                 148670 non-null  int64  
 1   id                         148670 non-null  float64
 2   year                       148670 non-null  float64
 3   loan_amount                148670 non-null  float64
 4   rate_of_interest           148670 non-null  float64
 5   interest_rate_spread       148670 non-null  float64
 6   upfront_charges            148670 non-null  float64
 7   term                       148670 non-null  float64
 8   property_value             148670 non-null  float64
 9   income                     148670 non-null  float64
 10  credit_score               148670 non-null  float64
 11  ltv                        148670 non-null  float64
 12  status                     148670 non-null  float64
 13  dtir1                      14

Extract only columns identified in the set of features and convert all data to numerical for use in the machine learning models

In [5]:
df_features = df[['loan_amount', 'dtir1', 'loan_type', 'rate_of_interest', 'income', 'credit_worthiness', 'credit_score', 'status']]

In [6]:
df_features = pd.get_dummies(df_features, columns=['loan_type', 'credit_worthiness'], drop_first=True)

In [7]:
df_features.head()

Unnamed: 0,loan_amount,dtir1,rate_of_interest,income,credit_score,status,loan_type_type2,loan_type_type3,credit_worthiness_l2
0,116500.0,45.0,4.201667,1740.0,758.0,1.0,0,0,0
1,206500.0,36.333333,3.996667,4980.0,552.0,1.0,1,0,0
2,406500.0,46.0,4.56,9480.0,834.0,0.0,0,0,0
3,456500.0,42.0,4.25,11880.0,587.0,0.0,0,0,0
4,696500.0,39.0,4.0,10440.0,602.0,0.0,0,0,0


In [8]:
df_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 148670 entries, 0 to 148669
Data columns (total 9 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   loan_amount           148670 non-null  float64
 1   dtir1                 148670 non-null  float64
 2   rate_of_interest      148670 non-null  float64
 3   income                148670 non-null  float64
 4   credit_score          148670 non-null  float64
 5   status                148670 non-null  float64
 6   loan_type_type2       148670 non-null  uint8  
 7   loan_type_type3       148670 non-null  uint8  
 8   credit_worthiness_l2  148670 non-null  uint8  
dtypes: float64(6), uint8(3)
memory usage: 7.2 MB


We are first going to resample the status data to ensure an equal amount of data for both categories.

In [9]:
df_features['status'].value_counts()

0.0    112031
1.0     36639
Name: status, dtype: int64

In [10]:
# Separate the data into two DataFrames based on categories 0 and 1
df_features_0 = df_features[df_features['status'] == 0 ]
df_features_1 = df_features[df_features['status'] == 1 ]

In [11]:
# Determine the desired number of rows for each category
desired_rows = 36000

In [12]:
# Resample the larger category to match the desired number of rows
df_resampled_0 = resample(df_features_0, n_samples=desired_rows, replace=False, random_state=42)

In [13]:
# Concatenate the resampled DataFrame with the smaller category DataFrame
df_balanced = pd.concat([df_resampled_0, df_features_1])

In [14]:
df_balanced.head()

Unnamed: 0,loan_amount,dtir1,rate_of_interest,income,credit_score,status,loan_type_type2,loan_type_type3,credit_worthiness_l2
130220,176500.0,26.0,4.375,4500.0,520.0,0.0,0,0,0
52597,476500.0,31.0,3.99,7560.0,771.0,0.0,0,0,0
84169,726500.0,44.0,3.875,10620.0,750.0,0.0,0,0,0
144846,726500.0,44.0,4.0,10560.0,613.0,0.0,0,0,0
67253,96500.0,38.0,5.5,5700.0,553.0,0.0,0,0,0


In [15]:
df_balanced['status'].value_counts()

1.0    36639
0.0    36000
Name: status, dtype: int64

We will now split our data into training set and testing set. To train our model, we use X_train as the features and y_train as the ground truth. Similarly, when testing, we will use X_test as the features and y_test to validate the predicted labels. We will also use 70% of data for training the model, and 30% for testing so that the algorithm has sufficient data to effectively learn. With random_state=42, we can obtain the same train and test sets across different executions, thus improving the model's performance score.

In [16]:
# ====================================================================
# Splitting Data
# ====================================================================

# train_test_split
train_set, test_set = train_test_split(df_balanced, test_size=0.3, random_state=42)

y_train = train_set['status']
X_train = train_set.drop(columns=['status'])
y_test = test_set['status']
X_test = test_set.drop(columns=['status'])

As our target is a categorical variable, we cannot perform Linear Regression model. Thus, we will attempt to use classification algorithms and execute three different classifiers and choose the best one:

- <b>RandomForestClassifer:</b> is a Supervised Machine Learning Algorithm that is used widely in Classification and Regression problems. It builds decision trees on different samples and takes their majority vote for classification and average in case of regression.
- <b>XGBoostClassifier:</b> is a scalable, distributed gradient-boosted decision tree (GBDT) machine learning library. It provides parallel tree boosting and is the leading machine learning library for regression, classification, and ranking problems.
- <b>NaiveBayesClassifer:</b> is a Supervised Machine Learning Algorithm, which is used for classification tasks, like text classification. It is also part of a family of generative learning algorithms, meaning that it seeks to model the distribution of inputs of a given class or category.

In [17]:
# ====================================================================
# Random Forest Classifier: 0.86 accuracy on test set
# ====================================================================

# Fitting Model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Predict on test data
rf_preds = rf.predict(X_test)

# Test set Performance
print(accuracy_score(y_test, rf_preds))
print(confusion_matrix(y_test, rf_preds))

0.8630231277533039
[[9433 1381]
 [1604 9374]]


In [18]:
print(f'training accuaracy: {rf.score(X_train,y_train)}')
print(f'testing accuaracy: {rf.score(X_test,y_test)}')

training accuaracy: 1.0
testing accuaracy: 0.8630231277533039


Next, let's try the XGBoostClassifier and evaluate if it can outperform the RandomForestClassifier.

In [19]:
# ====================================================================
# XGBoost Classifier: 0.93 accuracy on test set
# ====================================================================

# Fitting Model
xgbc = XGBClassifier()
xgbc.fit(X_train, y_train)

# Scores on train set
scores = cross_val_score(xgbc, X_train, y_train, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())

# Predict on test data
xgb_preds = xgbc.predict(X_test)

# Test set performance
print(accuracy_score(y_test, xgb_preds))
print(confusion_matrix(y_test, xgb_preds))

Mean cross-validation score: 0.93
0.9367198972099853
[[10648   166]
 [ 1213  9765]]


In [20]:
print(f'training accuaracy: {xgbc.score(X_train,y_train)}')
print(f'testing accuaracy: {xgbc.score(X_test,y_test)}')

training accuaracy: 0.9484335359018231
testing accuaracy: 0.9367198972099853


Finally, let's try the NaiveBayesClassifer.

In [21]:
# ====================================================================
# Naive Bayesian Classifer: 0.54 accuracy on test set
# ====================================================================

# Fitting Model: Guassian
naive_bayes = GaussianNB()
naive_bayes.fit(X_train , y_train)

# Predict on test data
nb_preds = naive_bayes.predict(X_test)

# Test set performance: 0.7527
print(accuracy_score(y_test, nb_preds))
print(confusion_matrix(y_test, nb_preds))

print("====================")

# Fitting Model: Bernouli
naive_bayes = BernoulliNB()
naive_bayes.fit(X_train , y_train)


# Predict on test data
nb_preds = naive_bayes.predict(X_test)

# Test set performance: 0.7648
print(accuracy_score(y_test, nb_preds))
print(confusion_matrix(y_test, nb_preds))

0.5723660058737151
[[7477 3337]
 [5982 4996]]
0.5482745961820852
[[9101 1713]
 [8131 2847]]


In [22]:
print(f'training accuaracy: {naive_bayes.score(X_train,y_train)}')
print(f'testing accuaracy: {naive_bayes.score(X_test,y_test)}')

training accuaracy: 0.5459515802308887
testing accuaracy: 0.5482745961820852


<b>Conclusion:</b> As we can see, the NaiveBayesClassifier did not perform as well as the other classifiers and XGBoostClassifier seems to have achieved the highest accuracy score of 0.93 and hence can be used in the finance industry for identifying loan defaulters along with the identified set of features.