## **Problem Statement:**
For a safe and secure lending experience, it's important to analyze the past data. In this project, you have to build a deep learning model to predict the chance of default for future loans using the historical data. As you will see, this dataset is highly imbalanced and includes a lot of features that make this problem more challenging.

## **Objective:**
Create a model that predicts whether or not an applicant will be able to repay a loan using the historical data

## **Steps to be done:**

1. Load the dataset that is given to you


In [None]:
#Upload the kaggle.json file
from google.colab import files
files.upload()

In [None]:
!pwd


In [None]:
ls


In [None]:
!unzip loan_data.zip -d loan_data

In [None]:
import os
os.listdir("/content/loan_data")

In [None]:
import pandas as pd
ld=pd.read_csv("loan_data.csv",index_col="SK_ID_CURR")
ld.head()

2. Check for null values in the dataset

In [None]:
column_values=ld.isnull().sum()
columns_with_null_values=column_values[column_values>0]
print(columns_with_null_values)

3. Print the percentage of default to a payer of the dataset for the TARGET column

In [None]:
default=(ld['TARGET']==0).sum() #In the context of a loan, "default" means the borrower has failed to meet the agreed-upon repayment terms
payer=(ld['TARGET']==1).sum()
default_to_payer=default/payer*100
print("percentage of default to a payer",default_to_payer)

6. Encode the columns that is required for the model

In [None]:
#Convert the categorical to numerical
categorical_cols = ld.select_dtypes(include=['object', 'category']).columns
print(categorical_cols)
#One-Hot Encoding
ld = pd.get_dummies(ld, columns=categorical_cols, drop_first=True)
print(ld)


4)Balance the dataset if the data is imbalanced

In [None]:
ld.isnull().sum()

In [None]:
 ld.corr(numeric_only=True)["TARGET"].sort_values(ascending=False)

In [None]:
#Remove columns which are not correlated with the target
# Correlation with target
target_corr = ld.corr(numeric_only=True)['TARGET'].abs()

# Drop features with correlation below threshold
weak_features = target_corr[target_corr < 0.01].index.tolist()
ld_filtered = ld.drop(columns=weak_features)

print("Dropped low-correlation features:", weak_features)

In [None]:
ld_filtered.isnull().sum()

In [None]:
numeric_cols = ld_filtered.select_dtypes(include=['int64', 'float64']).columns

for col in numeric_cols:
    ld_filtered[col].fillna(ld_filtered[col].median(), inplace=True)

In [None]:
categorical_cols = ld_filtered.select_dtypes(include=['object', 'category']).columns

for col in categorical_cols:
    ld_filtered[col].fillna(ld_filtered[col].mode()[0], inplace=True)

In [None]:
ld_filtered.isnull().sum()

5. Plot the balanced or imbalanced data

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(12, 8))
sns.heatmap(ld.corr(numeric_only=True), annot=False, cmap='coolwarm')
plt.title("Correlation Heatmap - df")
plt.show()

plt.figure(figsize=(12, 8))
sns.heatmap(ld_filtered.corr(numeric_only=True), annot=False, cmap='coolwarm')
plt.title("Correlation Heatmap - df_filtered")
plt.show()

6. Build a Deep Learning Model

In [None]:
# Separate features and target
X = ld_filtered.drop('TARGET', axis=1)
y = ld_filtered['TARGET']

In [None]:
# Train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

In [None]:
#Build a Deep Learning Model
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define the model
model = Sequential([
    Dense(64, activation='relu', input_shape=(X_train.shape[1],)),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(1, activation='sigmoid')  # Binary classification
])

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', tf.keras.metrics.Recall(name="sensitivity")])


In [None]:
#Train the model
home_loan_model = model.fit(
    X_train, y_train,
    validation_data=(X_test, y_test),
    epochs=25,
    batch_size=32
)

7. Calculate Sensitivity as a metric

In [None]:
loss, accuracy, sensitivity = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Sensitivity (Recall): {sensitivity:.4f}")

In [None]:
import matplotlib.pyplot as plt

plt.plot(home_loan_model.history['accuracy'], label='Train Accuracy')
plt.plot(home_loan_model.history['val_accuracy'], label='Val Accuracy')
plt.legend()
plt.title('Accuracy Over Epochs')
plt.show()

8. Calculate the area under  the receiver operating characteristics curve

The AUC-ROC is a powerful metric for evaluating binary classifiers. It tells you how well the model separates the two classes (e.g., defaulters vs non-defaulters) across all thresholds.

Formula Recap:
The ROC curve plots:

True Positive Rate (TPR) vs. False Positive Rate (FPR)
as the classification threshold varies.

The AUC (Area Under Curve):

0.5 = No better than random

1.0 = Perfect classifier

Assuming you already have:

y_test → true labels

y_pred_prob → predicted probabilities from your model (model.predict())

In [None]:
from sklearn.metrics import roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# 1. Get predicted probabilities
y_pred_prob = model.predict(X_test)

# 2. Compute AUC score
auc_score = roc_auc_score(y_test, y_pred_prob)
print(f"AUC-ROC Score: {auc_score:.4f}")

# 3. Plot the ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'AUC = {auc_score:.2f}')
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')  # baseline
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.grid()
plt.show()


So, the Area Under the ROC Curve is 0.50