# Home Credit Default Risk Prediction Using Naive Bayes Model 

## By Debayan Dutta July-09-2023

## Introduction

Project Goal: Home Credit will be able to identify if a customer is a safe candidate to lend to, then create a personalized customer loan and repayment plan to be accountable for, resulting in an increase in revenue, improved customer experience, and lower default rates.

Business Problem: Home Credit desires to know safe borrowers in a customer base that is unfamiliar with banking and give the customer a plan for successful loan repayment. Lending to those who are more likely to default on loans decreases the profits of Home Credit and results in negative customer experiences.

Analytic Problem:

The target variable is specificially customers that do have a negative history of repayment to lend to, and postive repayment. Represented in the application_train/test.csv sets of binary where 1 = Not trust worthy borrower (Client with payment difficulties), 0 = Trustworthy borrower (client with good repayment history).

Predict which customers will be good borrowers, using a classification method based on customer financial behavior data.

Use a machine learning method to see the relationship that a trust worthy customer has to other attributes about them.

## Machine Learning Model Choice 4 - Naive Bayes Algorithm

- Naive Bayes is a simple yet powerful probabilistic classifier based on Bayes' theorem and the assumption of feature independence.

- Naive Bayes assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This makes it a very fast and efficient algorithm to train.

- It is commonly used for classification tasks and works well with high-dimensional data.

- Naive Bayes calculates the probability of a class given the feature values and makes predictions based on the class with the highest probability.

- The "naive" assumption assumes that features are conditionally independent given the class, which simplifies the calculation of probabilities.

- Naive Bayes is computationally efficient and performs well in many real-world scenarios, especially when the independence assumption holds reasonably well.

# Analysis

## Importing packages and data

In [80]:
#Import necessary libraries
import os
import pandas as pd
import numpy as np
#conda install -c conda-forge imbalanced-learn
from imblearn.over_sampling import RandomOverSampler
import matplotlib.pyplot as plt
from scipy.stats import randint, uniform
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report,precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import CategoricalNB
from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn import metrics
import warnings

## Loading Datasets

In [81]:
#Load the necessary datasets
train_data = pd.read_csv('application_train.csv')
test_data = pd.read_csv('application_test.csv')
pos_cash_balance = pd.read_csv('POS_CASH_balance.csv')
bureau = pd.read_csv('bureau.csv')
#bureau_balance = pd.read_csv('bureau_balance.csv')
credit_card_balance = pd.read_csv('credit_card_balance.csv')
#installments_payments = pd.read_csv('installments_payments.csv')
#previous_application = pd.read_csv('previous_application.csv')

## Joining in Datasets 

In [82]:
#Calculate average 'SK_DPD' values in credit_card_balance.csv
average_sk_dpd = credit_card_balance.groupby('SK_ID_CURR')['SK_DPD'].mean().reset_index()

#Merge average_sk_dpd with train_data based on 'SK_ID_CURR'
train_data = train_data.merge(average_sk_dpd, on='SK_ID_CURR', how='left')

#Merge average_sk_dpd with test_data based on 'SK_ID_CURR'
test_data = test_data.merge(average_sk_dpd, on='SK_ID_CURR', how='left')

#Fill missing values with 0
train_data['SK_DPD'] = train_data['SK_DPD'].fillna(0)
test_data['SK_DPD'] = test_data['SK_DPD'].fillna(0)

In [83]:
#Calculate average 'CREDIT_DAY_OVERDUE' values in bureau.csv
average_credit_day_overdue = bureau.groupby('SK_ID_CURR')['CREDIT_DAY_OVERDUE'].mean().reset_index()

#Merge average_credit_day_overdue with train_data based on 'SK_ID_CURR'
train_data = train_data.merge(average_credit_day_overdue, on='SK_ID_CURR', how='left')

#Merge average_credit_day_overdue with test_data based on 'SK_ID_CURR'
test_data = test_data.merge(average_credit_day_overdue, on='SK_ID_CURR', how='left')

#Fill missing values with 0
train_data['CREDIT_DAY_OVERDUE'] = train_data['CREDIT_DAY_OVERDUE'].fillna(0)
test_data['CREDIT_DAY_OVERDUE'] = test_data['CREDIT_DAY_OVERDUE'].fillna(0)

In [84]:
#Shape of the joined datasets
print(train_data.shape)
print(test_data.shape)

(307511, 124)
(48744, 123)


## EDA - Treating Missing Data

In [85]:
#Remove columns with more than 30% null values
train_data = train_data.dropna(thresh=len(train_data) * 0.7, axis=1)
test_data = test_data.dropna(thresh=len(test_data) * 0.7, axis=1)

#Select valid numeric columns
numeric_columns = train_data.select_dtypes(include=np.number).columns

#Impute mean for numeric columns with less than 30% null values in train_data
train_data = train_data.loc[:, numeric_columns].fillna(train_data.loc[:, numeric_columns].mean())

#Select valid numeric columns in test
test_numeric_columns = test_data.select_dtypes(include=np.number).columns

#Impute mean for numeric columns with less than 30% null values in test_data
test_data = test_data.loc[:, test_numeric_columns].fillna(test_data.loc[:, test_numeric_columns].mean())

In [86]:
# Check for null values in train_data
null_counts = train_data.isnull().sum()

print(null_counts)

print(train_data.shape)
print(test_data.shape)

SK_ID_CURR                    0
TARGET                        0
CNT_CHILDREN                  0
AMT_INCOME_TOTAL              0
AMT_CREDIT                    0
                             ..
AMT_REQ_CREDIT_BUREAU_MON     0
AMT_REQ_CREDIT_BUREAU_QRT     0
AMT_REQ_CREDIT_BUREAU_YEAR    0
SK_DPD                        0
CREDIT_DAY_OVERDUE            0
Length: 63, dtype: int64
(307511, 63)
(48744, 62)


Reviewing the columns with more than 30% null values dropped the number of variables in the train and test sets by 61. 

## Modeling Creation Processing & Evaluation

In [108]:

# Preprocess the data
label_encoder = LabelEncoder()
for column in train_data.columns:
    if train_data[column].dtype == 'object':
        train_data[column] = label_encoder.fit_transform(train_data[column].astype(str))

# Separate the features (X) and target variable (y)
X = train_data.drop('TARGET', axis=1)
y = train_data['TARGET']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Naive Bayes model
naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)

# Make predictions on the test set
y_pred1 = naive_bayes.predict(X_test)

# Calculate accuracy
accuracy1 = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.9195323805342829


In [107]:
#Estimator BernoulliNB

label_encoder = LabelEncoder()
for column in train_data.columns:
    if train_data[column].dtype == 'object':
        train_data[column] = label_encoder.fit_transform(train_data[column].astype(str))

# Separate the features (X) and target variable (y)
X = train_data.drop('TARGET', axis=1)
y = train_data['TARGET']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Naive Bayes model
naive_bayes_b1 = BernoulliNB()
naive_bayes_b1.fit(X_train, y_train)

# Make predictions on the test set
y_pred2 = naive_bayes_b1.predict(X_test)

# Calculate accuracy
accuracy2 = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.9195323805342829


In [89]:
# Perform cross-validation
scores = cross_val_score(naive_bayes, X_train, y_train, cv=5)

# Print cross-validation scores
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
print("Standard Deviation:", scores.std())

Cross-Validation Scores: [0.91774725 0.91813341 0.91849925 0.91790817 0.91798947]
Mean Accuracy: 0.9180555095225575
Standard Deviation: 0.00025458504357842574


In [109]:
# Adjusting hyperparameters for Bernoulli's Estimator 
# Perform label encoding on object columns
label_encoder = LabelEncoder()
for column in train_data.columns:
    if train_data[column].dtype == 'object':
        train_data[column] = label_encoder.fit_transform(train_data[column].astype(str))

# Separate the features (X) and target variable (y)
X = train_data.drop('TARGET', axis=1)
y = train_data['TARGET']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameters and their values to tune
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 2.0, 5.0],
    'binarize': [0.0, 0.5, 1.0],
    'fit_prior': [True, False]
}

# Create Bernoulli Naive Bayes model
naive_bayes_b2 = BernoulliNB()

# Perform Grid Search with 5-fold cross-validation
grid_search = GridSearchCV(estimator=naive_bayes_b2, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best hyperparameter values
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

# Train the Naive Bayes model with the best hyperparameters
best_naive_bayes = BernoulliNB(**best_params)
best_naive_bayes.fit(X_train, y_train)

# Make predictions on the test set
y_pred3 = best_naive_bayes.predict(X_test)

# Calculate accuracy
accuracy3 = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Best Hyperparameters: {'alpha': 0.1, 'binarize': 1.0, 'fit_prior': True}
Accuracy: 0.9195323805342829


### Feature Importance 

In [105]:
from sklearn.inspection import permutation_importance

# Calculate permutation importance
importance = permutation_importance(naive_bayes, X_test, y_test)

# Get feature importances
feature_importance = importance.importances_mean

# Print feature importances
for i, feature_name in enumerate(X.columns):
    if feature_importance[i] != 0:
        print(f"{feature_name}: {feature_importance[i]}")

AMT_INCOME_TOTAL: -3.2518738923403843e-06
AMT_CREDIT: -2.6014991138678666e-05
AMT_ANNUITY: -5.8533730061949286e-05
AMT_GOODS_PRICE: -9.755621676976744e-06
DAYS_BIRTH: -6.5037477846585645e-06
DAYS_EMPLOYED: -3.2518738923403843e-06
DAYS_REGISTRATION: 1.95112433539979e-05
DAYS_ID_PUBLISH: -1.6259369461657514e-05
DAYS_LAST_PHONE_CHANGE: 9.105246898526431e-05
SK_DPD: -6.5037477846585645e-06
CREDIT_DAY_OVERDUE: 0.0003154317675560625


In [111]:
#Create confusion matrix
# Calculate confusion matrix
conf_matrix1 = confusion_matrix(y_test, y_pred1)
conf_matrix2 = confusion_matrix(y_test, y_pred2)
conf_matrix3 = confusion_matrix(y_test, y_pred3)


# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix1)
print(conf_matrix2)
print(conf_matrix3)

Confusion Matrix:
[[56385   169]
 [ 4923    26]]
[[56467    87]
 [ 4931    18]]
[[56554     0]
 [ 4949     0]]


In [112]:
#Evaluating confusion matrix values for Naive Bayes [Estimator = GausianNB]
TN, FP, FN, TP = conf_matrix1.ravel()

#Calculate accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN)

#Calculate precision
precision = TP / (TP + FP)

#Calculate recall
recall = TP / (TP + FN)

#Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)

Accuracy: 0.9172072907012666
Precision: 0.13333333333333333
Recall: 0.005253586583148111
F1-score: 0.010108864696734058


In [113]:
#Evaluating confusion matrix values for Naive Bayes [Estimator = BernoulliNB]
TN, FP, FN, TP = conf_matrix2.ravel()

#Calculate accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN)

#Calculate precision
precision = TP / (TP + FP)

#Calculate recall
recall = TP / (TP + FN)

#Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)

Accuracy: 0.9184104840414289
Precision: 0.17142857142857143
Recall: 0.003637098403717923
F1-score: 0.007123070834982192


In [114]:
#Evaluating confusion matrix values for Naive Bayes [Estimator = BernoulliNB] with hyperparameter tuning
TN, FP, FN, TP = conf_matrix3.ravel()

#Calculate accuracy
accuracy = (TP + TN) / (TP + TN + FP + FN)

#Calculate precision
precision = TP / (TP + FP)

#Calculate recall
recall = TP / (TP + FN)

#Calculate F1-score
f1_score = 2 * (precision * recall) / (precision + recall)

# Print the metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)

Accuracy: 0.9195323805342829
Precision: nan
Recall: 0.0
F1-score: nan


  precision = TP / (TP + FP)


The accuracy score is 0.917, indicating that 91% of the predictions made by the model were correct. A precision score of 0.133 indicates that out of all the cases predicted as positive, only 13.3% were actually true positives. A recall score of 0.005 means that the model identified 5% of the actual positive cases.The F1-score is 0.010, which suggests that the model's overall performance is very low.

### Prediction on the Test Set

In [116]:
test_predictions = best_naive_bayes.predict(test_data)

In [117]:
#Create a submission DataFrame
submission = pd.DataFrame({
    "SK_ID_CURR": test_data["SK_ID_CURR"], 
    "TARGET": test_predictions
})

#Remove duplicate SK_ID_CURR values
submission = submission.drop_duplicates(subset="SK_ID_CURR", keep="first")

#Save the submission DataFrame to a CSV file
#submission.to_csv("submission.csv", index=False)

In [118]:
test_predictions

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [119]:
submission

Unnamed: 0,SK_ID_CURR,TARGET
0,100001,0
1,100005,0
2,100013,0
3,100028,0
4,100038,0
...,...,...
48739,456221,0
48740,456222,0
48741,456223,0
48742,456224,0


Using estimator as GaussianNB we get accuracy of 0.9172 while using estimator as BernoulliNB we get accuracy 0.91841 and the accuracy increases to 0.9195 when we reset the hyperparameters. 
In case of estimator = BernoulliNB, we have obtained a kaggle score of 0.502