![banner-image](assets/banner-image.jpg)

<a id='top'></a>
# Project: Loan Default Prediction

## Table of Content
<ul>
    <li><a href='#intro'>Introduction</a></li>
    <li><a href='#gather'>Data Gathering</a></li>
    <li><a href='#explore'>Data Exploration</a></li>
    <li><a href='#merge'>Merging</a></li>
    <li><a href='#process'>Preprocessing</a></li>
    <li><a href='#engineer'>Feature Engineering</a></li>
    <li><a href='#imbalance'>Handling Imbalance</a></li>
    <li><a href='#prediction'>Predication</a></li>
    <li><a href='#results'>Results</a></li>
    <li><a href='#end'>End</a></li>
</ul>

<a id='intro'></a>

## Introduction

In [1]:
import pandas as pd
import numpy as np

<a id='gather'></a>

## Data Gathering

Collected data from Zindi Competiton - Nigeria dataset.

In [2]:
demo = pd.read_csv("data/train/traindemographics.csv")
perf = pd.read_csv("data/train/trainperf.csv")
prev = pd.read_csv("data/train/trainprevloans.csv")

<a id='explore'></a>

## Data Exploration

1. checking shape for demo, pref and prev datasets
2. checking for duplicate values in customer id in demo and pref. 

In [3]:
demo.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4346 entries, 0 to 4345
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   customerid                  4346 non-null   object 
 1   birthdate                   4346 non-null   object 
 2   bank_account_type           4346 non-null   object 
 3   longitude_gps               4346 non-null   float64
 4   latitude_gps                4346 non-null   float64
 5   bank_name_clients           4346 non-null   object 
 6   bank_branch_clients         51 non-null     object 
 7   employment_status_clients   3698 non-null   object 
 8   level_of_education_clients  587 non-null    object 
dtypes: float64(2), object(7)
memory usage: 305.7+ KB


In [4]:
perf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4368 entries, 0 to 4367
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   customerid     4368 non-null   object 
 1   systemloanid   4368 non-null   int64  
 2   loannumber     4368 non-null   int64  
 3   approveddate   4368 non-null   object 
 4   creationdate   4368 non-null   object 
 5   loanamount     4368 non-null   float64
 6   totaldue       4368 non-null   float64
 7   termdays       4368 non-null   int64  
 8   referredby     587 non-null    object 
 9   good_bad_flag  4368 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 341.4+ KB


In [5]:
demo.shape

(4346, 9)

In [6]:
perf.shape

(4368, 10)

In [7]:
prev.shape

(18183, 12)

In [8]:

demo['customerid'].nunique()

4334

In [9]:
perf['customerid'].nunique()

4368

In [10]:
prev['customerid'].nunique()

4359

<a id='merge'></a>

## Merge data

1. We merged all training data, demographics with performance

Performance (trainpref) is the MAIN dataset since it contains the TARGET (good_bad_flag)

In [11]:
df = perf.merge(demo, on="customerid", how="left")

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4376 entries, 0 to 4375
Data columns (total 18 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   customerid                  4376 non-null   object 
 1   systemloanid                4376 non-null   int64  
 2   loannumber                  4376 non-null   int64  
 3   approveddate                4376 non-null   object 
 4   creationdate                4376 non-null   object 
 5   loanamount                  4376 non-null   float64
 6   totaldue                    4376 non-null   float64
 7   termdays                    4376 non-null   int64  
 8   referredby                  589 non-null    object 
 9   good_bad_flag               4376 non-null   object 
 10  birthdate                   3277 non-null   object 
 11  bank_account_type           3277 non-null   object 
 12  longitude_gps               3277 non-null   float64
 13  latitude_gps                3277 

In [13]:
prev['approveddate'] = pd.to_datetime(prev['approveddate'])
prev['creationdate'] = pd.to_datetime(prev['creationdate'])
prev['closeddate'] = pd.to_datetime(prev['closeddate'])
prev['firstduedate'] = pd.to_datetime(prev['firstduedate'])
prev['firstrepaiddate'] = pd.to_datetime(prev['firstrepaiddate'])

# repayment delay
prev['repayment_delay'] = (prev['firstrepaiddate'] - prev['firstduedate']).dt.days

# aggregation
agg_prev = prev.groupby("customerid").agg({
    "systemloanid": "count",            # number of previous loans
    "loanamount": ["mean", "max"],
    "totaldue": ["mean"],
    "termdays": ["mean", "max"],
    "repayment_delay": ["mean"],
})

agg_prev.columns = ["_".join(col) for col in agg_prev.columns]
agg_prev.head()

Unnamed: 0_level_0,systemloanid_count,loanamount_mean,loanamount_max,totaldue_mean,termdays_mean,termdays_max,repayment_delay_mean
customerid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8a1088a0484472eb01484669e3ce4e0b,1,10000.0,10000.0,11500.0,15.0,15,6.0
8a1a1e7e4f707f8b014f797718316cad,4,17500.0,30000.0,22375.0,37.5,60,-0.25
8a1a32fc49b632520149c3b8fdf85139,7,12857.142857,20000.0,15214.285714,19.285714,30,-0.428571
8a1eb5ba49a682300149c3c068b806c7,8,16250.0,30000.0,20300.0,33.75,60,-3.125
8a1edbf14734127f0147356fdb1b1eb2,2,10000.0,10000.0,12250.0,22.5,30,-4.0


#### Merge aggregated previous-loan statistics

In [14]:
df = df.merge(agg_prev, on="customerid", how="left")

In [15]:
df.head()

Unnamed: 0,customerid,systemloanid,loannumber,approveddate,creationdate,loanamount,totaldue,termdays,referredby,good_bad_flag,...,bank_branch_clients,employment_status_clients,level_of_education_clients,systemloanid_count,loanamount_mean,loanamount_max,totaldue_mean,termdays_mean,termdays_max,repayment_delay_mean
0,8a2a81a74ce8c05d014cfb32a0da1049,301994762,12,2017-07-25 08:22:56.000000,2017-07-25 07:22:47.000000,30000.0,34500.0,30,,Good,...,,Permanent,Post-Graduate,11.0,18181.818182,30000.0,22081.818182,30.0,30.0,-0.909091
1,8a85886e54beabf90154c0a29ae757c0,301965204,2,2017-07-05 17:04:41.000000,2017-07-05 16:04:18.000000,15000.0,17250.0,30,,Good,...,"DUGBE,IBADAN",Permanent,Graduate,,,,,,,
2,8a8588f35438fe12015444567666018e,301966580,7,2017-07-06 14:52:57.000000,2017-07-06 13:52:51.000000,20000.0,22250.0,15,,Good,...,,Permanent,,6.0,10000.0,10000.0,11750.0,17.5,30.0,0.833333
3,8a85890754145ace015429211b513e16,301999343,3,2017-07-27 19:00:41.000000,2017-07-27 18:00:35.000000,10000.0,11500.0,15,,Good,...,,Permanent,,2.0,10000.0,10000.0,12250.0,22.5,30.0,7.5
4,8a858970548359cc0154883481981866,301962360,9,2017-07-03 23:42:45.000000,2017-07-03 22:42:39.000000,40000.0,44000.0,30,,Good,...,,Permanent,Primary,8.0,18750.0,30000.0,23550.0,37.5,60.0,-3.125


In [16]:
df.to_csv("output.csv", index=False)

<a id='process'></a>

## Data Preprocessing

#### Handling Categorical

In [17]:
cat_cols = ["bank_account_type", "bank_name_clients", "employment_status_clients",
            "level_of_education_clients"]

df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

In [18]:
X = df.drop("good_bad_flag", axis=1)
y = df["good_bad_flag"]

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [19]:
y_train.value_counts(normalize=True)

good_bad_flag
Good    0.782286
Bad     0.217714
Name: proportion, dtype: float64

## Feature Engineering

In [20]:
X = df.drop("good_bad_flag", axis=1)
y = df["good_bad_flag"]

from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## Check Imbalance

In [21]:
y_train.value_counts(normalize=True)

good_bad_flag
Good    0.782286
Bad     0.217714
Name: proportion, dtype: float64

 Handle Imbalance by implementing randowm under sampling

In [22]:
from imblearn.under_sampling import RandomUnderSampler

# Initialize RandomUnderSampler
rus = RandomUnderSampler(random_state=42)

# Apply random undersampling
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

In [23]:
y_train_resampled.value_counts(normalize=True)

good_bad_flag
Bad     0.5
Good    0.5
Name: proportion, dtype: float64

In [24]:
y_train_resampled.shape

(1524,)

In [25]:
# Check columns and data types
X_train_resampled.dtypes

customerid                                   object
systemloanid                                  int64
loannumber                                    int64
approveddate                                 object
creationdate                                 object
loanamount                                  float64
totaldue                                    float64
termdays                                      int64
referredby                                   object
birthdate                                    object
longitude_gps                               float64
latitude_gps                                float64
bank_branch_clients                          object
systemloanid_count                          float64
loanamount_mean                             float64
loanamount_max                              float64
totaldue_mean                               float64
termdays_mean                               float64
termdays_max                                float64
repayment_de

<a id='prediction'></a>

## Model Training - XGBoost Pipeline

In [26]:
import xgboost as xgb
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
import matplotlib.pyplot as plt

# Step 1: Prepare data - drop non-numeric columns
cols_to_drop = ['customerid', 'systemloanid', 'approveddate', 'creationdate', 
                'birthdate', 'referredby', 'bank_branch_clients']
existing_cols_to_drop = [col for col in cols_to_drop if col in X_train_resampled.columns]

X_train_clean = X_train_resampled.drop(columns=existing_cols_to_drop)
X_val_clean = X_val.drop(columns=existing_cols_to_drop)

# Step 2: Handle missing values by column type
# For numeric columns (int64, float64), fill with median
numeric_cols = X_train_clean.select_dtypes(include=['int64', 'float64']).columns
X_train_clean[numeric_cols] = X_train_clean[numeric_cols].fillna(X_train_clean[numeric_cols].median())
X_val_clean[numeric_cols] = X_val_clean[numeric_cols].fillna(X_train_clean[numeric_cols].median())

# For boolean columns, fill with False
bool_cols = X_train_clean.select_dtypes(include=['bool']).columns
X_train_clean[bool_cols] = X_train_clean[bool_cols].fillna(False)
X_val_clean[bool_cols] = X_val_clean[bool_cols].fillna(False)

print(f"Training shape: {X_train_clean.shape}")
print(f"Validation shape: {X_val_clean.shape}")

# Step 3: Encode target variable (Good=0, Bad=1)
y_train_encoded = (y_train_resampled == 'Bad').astype(int)
y_val_encoded = (y_val == 'Bad').astype(int)

print(f"\nTarget distribution - Train: {y_train_encoded.value_counts().to_dict()}")
print(f"Target distribution - Val: {y_val_encoded.value_counts().to_dict()}")

Training shape: (1524, 40)
Validation shape: (876, 40)

Target distribution - Train: {1: 762, 0: 762}
Target distribution - Val: {0: 685, 1: 191}


In [27]:
# Step 4: Train XGBoost Model
xgb_model = xgb.XGBClassifier(
    n_estimators=200,
    max_depth=2,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=3,
    gamma=0,
    random_state=42,
    eval_metric='logloss',
    early_stopping_rounds=20
)

print("\nTraining XGBoost model...")
xgb_model.fit(
    X_train_clean, 
    y_train_encoded,
    eval_set=[(X_val_clean, y_val_encoded)],
    verbose=10
)

print("\nModel training complete!")


Training XGBoost model...
[0]	validation_0-logloss:0.68443
[10]	validation_0-logloss:0.64840
[20]	validation_0-logloss:0.63622
[30]	validation_0-logloss:0.63393
[40]	validation_0-logloss:0.63334
[50]	validation_0-logloss:0.63476
[60]	validation_0-logloss:0.63486
[64]	validation_0-logloss:0.63416

Model training complete!


In [28]:
# Feature Importance
feature_importance = pd.DataFrame({
    'feature': X_train_clean.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

# Top 15 features
feature_importance.head(15)

Unnamed: 0,feature,importance
12,repayment_delay_mean,0.168386
9,totaldue_mean,0.071008
13,bank_account_type_Other,0.065331
7,loanamount_mean,0.062137
14,bank_account_type_Savings,0.054027
2,totaldue,0.052329
3,termdays,0.049509
0,loannumber,0.048204
20,bank_name_clients_GT Bank,0.047051
15,bank_name_clients_Diamond Bank,0.04567


In [None]:
# Plot Feature Importance
plt.figure(figsize=(10, 8))
feature_importance.head(15).plot(x='feature', y='importance', kind='barh', figsize=(10, 8))
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.title('Top 15 Most Important Features')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

<Figure size 1000x800 with 0 Axes>

<a id='results'></a>

## Results

In [None]:
# Make predictions
y_pred = xgb_model.predict(X_val_clean)
y_pred_proba = xgb_model.predict_proba(X_val_clean)[:, 1]

# Calculate ROC-AUC
roc_auc = roc_auc_score(y_val_encoded, y_pred_proba)
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Classification report
print("\n" + "="*60)
print("Classification Report:")
print("="*60)
print(classification_report(y_val_encoded, y_pred, target_names=['Good', 'Bad']))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_val_encoded, y_pred)
print("Confusion Matrix:")
print("="*60)
print(f"{'':12} {'Predicted Good':>15} {'Predicted Bad':>15}")
print(f"{'Actual Good':12} {cm[0][0]:>15} {cm[0][1]:>15}")
print(f"{'Actual Bad':12} {cm[1][0]:>15} {cm[1][1]:>15}")

<a id='end'></a> 

## End

<li><a href='#top'>Back to top</a></li>