## Loan Default Prediction using Deep Neural Networks with ADASYN

This notebook demonstrates a process for predicting loan defaults. Key steps include:
1.  **Importing Libraries**: Essential packages for data manipulation, visualization, and modeling.
2.  **Loading Dataset**: Reading the loan data.
3.  **Exploratory Data Analysis (EDA)**: Understanding data structure, selecting features, and preparing the target variable.
4.  **Feature Visualization**: Visualizing distributions and correlations of selected features.
5.  **Data Preprocessing**: Splitting data, scaling features, and checking class balance.
6.  **Handling Imbalanced Data**: Applying ADASYN to the training set to address class imbalance.
7.  **Deep Neural Network (DNN) Modeling**: Defining, compiling, and training a DNN.
8.  **Model Evaluation**: Assessing the model's performance using various metrics.

### 1. Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, RobustScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, classification_report
from imblearn.over_sampling import ADASYN
import tensorflow as tf
from tensorflow.keras import Sequential, Input
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

# Visualization Setup
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.6f' % x)

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


### 2. Load Dataset
get CSV from https://www.kaggle.com/datasets/adarshsng/lending-club-loan-data-csv

In [2]:
# Update the path to your dataset file if necessary
df = pd.read_parquet("C:/DATA/Data/loan_ORI.parquet.gzip")

### 3. Exploratory Data Analysis (EDA)

#### 3.1. Dataset Information

In [None]:
df.info()

#### 3.2. Descriptive Statistics (Placeholder)

In [None]:
numerical_cols_for_describe = df.select_dtypes(include=np.number).columns.tolist()
describe_df = df[numerical_cols_for_describe].describe().T
describe_df

#### 3.3. Feature Selection and Initial Target Exploration

In [None]:
selected_features = ['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'loan_status']
df_selected_features = df[selected_features].copy()

print("Original Loan Status Distribution:")
loan_status_prs = df_selected_features['loan_status'].value_counts(normalize=True) * 100
loan_status_count = df_selected_features['loan_status'].value_counts()
loan_status_summary = pd.DataFrame({
    'Count': loan_status_count,
    'Percent (%)': loan_status_prs
}).reset_index()
loan_status_summary

#### 3.4. Target Variable Preparation for Binary Classification

In [None]:
df_selected = df_selected_features[df_selected_features['loan_status'].isin(['Fully Paid','Default'])].copy()
mapping = {'Fully Paid': 0, 'Default': 1}
df_selected['loan_status'] = df_selected['loan_status'].map(mapping)

print("Encoded Loan Status Distribution (%):")
print(df_selected['loan_status'].value_counts(normalize=True) * 100)
print("\nEncoded Loan Status Distribution (count):")
print(df_selected['loan_status'].value_counts())

plt.figure(figsize=(6,4))
ax = sns.countplot(x='loan_status', data=df_selected)
plt.title('Target Class Distribution (0: Fully Paid, 1: Default)')
plt.xlabel('Loan Status (Encoded)')
plt.ylabel('Count')
for p in ax.patches:
  ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 5), textcoords='offset points')
plt.show()

target_variable = 'loan_status'
independent_features = ['loan_amnt', 'int_rate', 'installment', 'annual_inc']

#### 3.5. Handling Missing Values in Selected Features

In [None]:
print("Missing Values Before Handling:")
print(df_selected[independent_features].isnull().sum())

df_selected.dropna(subset=independent_features, inplace=True)
print("\nMissing Values After Row Removal (if any):")
print(df_selected[independent_features].isnull().sum())
print(f"\nDataset Size (After Handling Missing Values): {df_selected.shape}")

### 4. Feature Visualization

In [None]:
if not df_selected.empty and all(f in df_selected.columns for f in independent_features):
    for col in independent_features:
        plt.figure(figsize=(8, 5))
        sns.histplot(df_selected[col], kde=True, bins=30)
        plt.title(f'Distribution of {col}')
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plt.show()

    plt.figure(figsize=(8, 6))
    correlation_matrix = df_selected[independent_features].corr()
    sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
    plt.title('Correlation Matrix of Independent Features')
    plt.show()
else:
    print("Skipping feature visualization as df_selected is empty or features are missing.")

### 5. Data Preprocessing

#### 5.1. Data Splitting & Feature Scaling

In [None]:
X = df_selected[independent_features]
y = df_selected[target_variable]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

loan_amnt_col = ['loan_amnt']
other_cols = ['int_rate', 'installment', 'annual_inc']

min_max_scaler_loan_amnt = MinMaxScaler(feature_range=(-1, 1))
scaler_others = StandardScaler()

X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

X_train_scaled[loan_amnt_col] = min_max_scaler_loan_amnt.fit_transform(X_train[loan_amnt_col])
X_test_scaled[loan_amnt_col] = min_max_scaler_loan_amnt.transform(X_test[loan_amnt_col])

if other_cols: # Ensure other_cols is not empty
    X_train_scaled[other_cols] = scaler_others.fit_transform(X_train[other_cols])
    X_test_scaled[other_cols] = scaler_others.transform(X_test[other_cols])

print("\nDescriptive statistics of X_train_scaled:")
print(X_train_scaled.describe())

#### 5.2. Check Class Composition in Splits

In [None]:
# Check composition and count of classes in 'y' after splitting
print("Class composition and counts in y_train:")
print(y_train.value_counts())
print("\nClass proportions in y_train (%):")
print(y_train.value_counts(normalize=True) * 100)

print("\nClass composition and counts in y_test:")
print(y_test.value_counts())
print("\nClass proportions in y_test (%):")
print(y_test.value_counts(normalize=True) * 100)

### 6. Handling Imbalanced Data with ADASYN

In [None]:
print(f"\nClass distribution in y_train before ADASYN: \n{y_train.value_counts(normalize=True)}")

# Save X_train_scaled and y_train before ADASYN for PCA visualization
X_train_scaled_before_adasyn = X_train_scaled.copy()
y_train_before_adasyn = y_train.copy()

# ADASYN parameters (K=5 for n_neighbors)
adasyn = ADASYN(random_state=420, n_neighbors=5) # n_neighbors=5 is the default
X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train_scaled, y_train)

print(f"\nShape of X_train after ADASYN: {X_train_resampled.shape}")
print(f"Class distribution in y_train after ADASYN: \n{pd.Series(y_train_resampled).value_counts(normalize=True)}")

# --- Visualize ADASYN Results with PCA ---
print("\nVisualizing ADASYN results with PCA (2 components)...")
pca = PCA(n_components=2)

# Apply PCA to training data BEFORE ADASYN
X_train_pca_before = pca.fit_transform(X_train_scaled_before_adasyn)

# Apply PCA to training data AFTER ADASYN
if isinstance(X_train_resampled, np.ndarray):
    X_train_resampled_df = pd.DataFrame(X_train_resampled, columns=X_train_scaled.columns)
else:
    X_train_resampled_df = X_train_resampled

X_train_pca_after = pca.transform(X_train_resampled_df) # Use transform as PCA is already fitted

plt.figure(figsize=(14, 7))

# Plot before ADASYN
plt.subplot(1, 2, 1)
plt.scatter(X_train_pca_before[y_train_before_adasyn == 0, 0], X_train_pca_before[y_train_before_adasyn == 0, 1], label='Majority (Class 0 - Fully Paid)', alpha=0.5, s=10)
plt.scatter(X_train_pca_before[y_train_before_adasyn == 1, 0], X_train_pca_before[y_train_before_adasyn == 1, 1], label='Minority (Class 1 - Default)', alpha=0.7, s=15, c='red')
plt.title('Training Data Before ADASYN (PCA 2D)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()

# Plot after ADASYN
plt.subplot(1, 2, 2)
# Separate original and synthetic data for plotting
num_original_minority_before = (y_train_before_adasyn == 1).sum()
num_original_majority_before = (y_train_before_adasyn == 0).sum()

plt.scatter(X_train_pca_after[y_train_resampled == 0, 0], X_train_pca_after[y_train_resampled == 0, 1], label='Majority (Class 0 - Fully Paid)', alpha=0.3, s=10)
plt.scatter(X_train_pca_after[y_train_resampled == 1, 0], X_train_pca_after[y_train_resampled == 1, 1], label='Minority + Synthetic (Class 1 - Default)', alpha=0.5, s=15, c='red')

plt.title('Training Data After ADASYN (PCA 2D)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend()
plt.tight_layout()
plt.show()

### 7. Deep Neural Network (DNN) Modeling

#### 7.1. Define DNN Architecture

In [None]:
model = Sequential([
    Input(shape=(X_train_resampled.shape[1],)), # Input shape based on number of features
    Dense(64, activation='relu'),
    Dropout(0.3), # Adding Dropout for regularization
    Dense(32, activation='relu'),
    Dropout(0.3),
    Dense(16, activation='relu'),
    Dropout(0.3),
    Dense(1, activation='sigmoid') # Sigmoid for binary classification output
])

# Compile the model
# Optimizer: Adam, Loss: binary_crossentropy (for binary classification)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'Precision', 'Recall', tf.keras.metrics.AUC(name='auc')])

model.summary()

# Callback for Early Stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=20, restore_best_weights=True)

#### 7.2. Train the DNN Model

In [None]:
print("\nStarting DNN model training...")
history = model.fit(
    X_train_resampled, y_train_resampled, # Use ADASYN-resampled data
    epochs=100,
    batch_size=32,
    validation_split=0.2, # Use 20% of training data for validation
    callbacks=[early_stopping],
    verbose=1
)
print("Model training finished.")

#### 7.3. Plot Training History
Visualize the model's training and validation accuracy and loss over epochs.

In [None]:
if history is not None:
    plt.figure(figsize=(14, 6))

    # Plot training & validation accuracy values
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    if 'val_accuracy' in history.history:
        plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.title('Model Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend(loc='lower right')

    # Plot training & validation loss values
    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Training Loss')
    if 'val_loss' in history.history:
        plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('Model Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend(loc='upper right')

    plt.tight_layout()
    plt.show()
else:
    print("Training history is not available for plotting.")

### 8. Model Evaluation
Evaluate the trained model on the (unseen) test set. Calculate and display accuracy, precision, recall, specificity, confusion matrix, and classification report.

In [None]:
model.evaluate(X_test_scaled, y_test, verbose=1)
# print(f"\nTest Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")


In [None]:
# Predictions on the test set
y_pred_proba = model.predict(X_test_scaled)
y_pred = (y_pred_proba > 0.5).astype(int) # Threshold 0.5 for binary classification

# Evaluation Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, pos_label=1) # Precision for the 'Default' class (1)
recall = recall_score(y_test, y_pred, pos_label=1)       # Recall for the 'Default' class (1)

cm = confusion_matrix(y_test, y_pred)

# Ensure cm has 4 elements before unpacking (tn, fp, fn, tp)
if cm.size == 4:
    tn, fp, fn, tp = cm.ravel()
    specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
else: # Handle cases where the confusion matrix might not be 2x2 (e.g., if one class has no predictions)
    print("Warning: Confusion matrix is not 2x2. Specificity might not be directly calculable or meaningful.")
    print(f"Confusion Matrix shape: {cm.shape}")
    print(f"y_test unique values and counts: {np.unique(y_test, return_counts=True)}")
    print(f"y_pred unique values and counts: {np.unique(y_pred, return_counts=True)}")
    # Set specificity to NaN or a default value if it cannot be calculated
    specificity = np.nan
    # If only one class is predicted, tn or tp might be the total count for that class, and the other pair (fp, fn) would be zero.
    # For example, if only class 0 is predicted: tn = sum(y_test==0), fp = 0, fn = sum(y_test==1), tp = 0.
    # Or if only class 1 is predicted: tn = 0, fp = sum(y_test==0), fn = 0, tp = sum(y_test==1).
    if len(np.unique(y_pred)) == 1:
        if np.unique(y_pred)[0] == 0: # All predicted as class 0
            tn = cm[0,0] if cm.shape == (1,1) or (cm.shape[0] > 0 and cm.shape[1] > 0) else np.sum(y_test==0) # approximation
            fp = 0
            specificity = 1.0 if tn > 0 else 0 # Or based on actual tn/(tn+fp)
        elif np.unique(y_pred)[0] == 1: # All predicted as class 1
            tn = 0
            fp = np.sum(y_test==0) # approximation
            specificity = 0.0

print("\n--- MODEL EVALUATION RESULTS ---")
print(f"Accuracy           : {accuracy:.4f} (Paper target: 0.941)")
print(f"Precision (Default): {precision:.4f} (Paper target: 0.972 for Default class)")
print(f"Recall (Sensitivity, Default): {recall:.4f} (Paper target: 0.960 for Default class)")
print(f"Specificity (Fully Paid): {specificity:.4f} (Paper target: 0.823 for Fully Paid class)")

print("\nConfusion Matrix:")
print(cm)
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Predicted Fully Paid (0)', 'Predicted Default (1)'],
            yticklabels=['Actual Fully Paid (0)', 'Actual Default (1)'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Fully Paid (0)', 'Default (1)']))

print("\nNOTES:")
print("1. Results can vary depending on the specific dataset split, preprocessing, and random weight initialization in the DNN.")
print("2. Ensure the dataset path ('/content/drive/MyDrive/Data/loan_ORI.parquet.gzip') is correct.")
print("3. The reference paper might use a different version or source of the Lending Club dataset (e.g., 2007-2015), which could lead to different metrics.")
print("4. Hyperparameters such as the number of units in hidden layers, dropout rate, learning rate, etc., can be further tuned for optimization.")
print("5. Feature scaling was applied: 'loan_amnt' scaled to [-1,1] with MinMaxScaler, other specified features with RobustScaler.")
print("6. PCA visualization was added to show the effect of ADASYN on the training data distribution.")