# Project: Bank loan status prediction

Dataset features description ([download here](https://www.kaggle.com/datasets/zaurbegiev/my-dataset?select=credit_train.csv)):
- `Loan ID`: A unique identifier for each loan. This feature helps differentiate between various loans and is often used as a primary key in the data.
- `Customer ID`: A unique identifier for each customer. This feature helps track loans for each customer and can be used to connect with other customer information.
- `Current Loan Amount`: The current amount of the loan. This feature indicates the amount the customer is currently borrowing and can be used to analyze the customer’s current debt situation.
- `Term`: The term of the loan (e.g., 36 or 60 months). This feature indicates the period over which the customer must repay the loan and can affect their ability to repay.
- `Credit Score`: The customer’s credit score. This feature is often used to assess the customer’s creditworthiness; a higher score suggests a better ability to repay.
- `Annual Income`: The customer’s annual income. This feature helps assess the customer’s financial capacity; higher income can indicate a better ability to repay.
- `Years in Current Job`: The number of years the customer has been at their current job. This feature shows the customer’s job stability, which can positively impact their ability to repay.
- `Home Ownership`: The customer’s home ownership status (e.g., Rent, Own, Mortgage). This feature can reflect the customer’s financial stability and security.
- `Purpose`: The purpose of the loan (e.g., home purchase, car purchase, business, personal). This feature helps understand the reason for borrowing and can affect the loan's risk level.
- `Monthly Debt`: The customer’s monthly debt payment. This feature indicates the total amount the customer must pay each month for other debts, helping to evaluate their debt burden.
- `Years of Credit History`: The number of years of the customer’s credit history. This feature reflects the customer’s experience in managing credit; a longer history can indicate financial stability.
- `Months Since Last Delinquent`: The number of months since the customer last had a late payment. This feature helps assess the recent delinquency status of the customer, which can affect credit decisions.
- `Number of Open Accounts`: The number of open credit accounts the customer currently has. This feature indicates the current level of credit usage by the customer.
- `Number of Credit Problems`: The number of credit issues the customer has faced (e.g., defaults, late payments). This feature helps assess the customer’s credit risk.
- `Current Credit Balance`: The current balance on the customer’s credit accounts. This feature indicates the total amount the customer still owes on their current credit accounts.
- `Maximum Open Credit`: The maximum amount of credit the customer has ever had open. This feature can indicate the customer’s ability to manage large amounts of credit in the past.
- `Bankruptcies`: The number of times the customer has declared bankruptcy. This feature is an important indicator of credit risk, as bankruptcies usually lead to difficulties in obtaining credit.
- `Tax Liens`: The number of tax liens against the customer. This feature is also a credit risk indicator, showing the customer has unpaid tax debts.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import time

seed = 10

## 1. Import dataset

In [2]:
from sklearn.model_selection import train_test_split

df = pd.read_csv('./../data/dataset.csv')

print('Data shape:', df.shape)

# Display the first few rows of the dataset
print('Samples data:')
df.head()

Data shape: (100000, 19)
Samples data:


Unnamed: 0,Loan ID,Customer ID,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job,Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens,Loan Status
0,14dd8831-6af5-400b-83ec-68e61888a048,981165ec-3274-42f5-a3b4-d104041a9ca9,445412,Short Term,709.0,1167493.0,8 years,Home Mortgage,Home Improvements,5214.74,17.2,,6,1,228190,416746.0,1.0,0.0,Fully Paid
1,4771cc26-131a-45db-b5aa-537ea4ba5342,2de017a3-2e01-49cb-a581-08169e83be29,262328,Short Term,,,10+ years,Home Mortgage,Debt Consolidation,33295.98,21.1,8.0,35,0,229976,850784.0,0.0,0.0,Fully Paid
2,4eed4e6a-aa2f-4c91-8651-ce984ee8fb26,5efb2b2b-bf11-4dfd-a572-3761a2694725,99999999,Short Term,741.0,2231892.0,8 years,Own Home,Debt Consolidation,29200.53,14.9,29.0,18,1,297996,750090.0,0.0,0.0,Fully Paid
3,77598f7b-32e7-4e3b-a6e5-06ba0d98fe8a,e777faab-98ae-45af-9a86-7ce5b33b1011,347666,Long Term,721.0,806949.0,3 years,Own Home,Debt Consolidation,8741.9,12.0,,9,0,256329,386958.0,0.0,0.0,Fully Paid
4,d4062e70-befa-4995-8643-a0de73938182,81536ad9-5ccf-4eb8-befb-47a4d608658e,176220,Short Term,,,5 years,Rent,Debt Consolidation,20639.7,6.1,,15,0,253460,427174.0,0.0,0.0,Fully Paid


In [3]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=seed, stratify=df['Loan Status'])
train_df.reset_index(drop=True, inplace=True)
test_df.reset_index(drop=True, inplace=True)

print(f'Train dataset size: {len(train_df)}')
print(f'Test dataset size: {len(test_df)}')

Train dataset size: 80000
Test dataset size: 20000


## 2. Exploratory Data Analysis (EDA)

In [4]:
# List all features in the DataFrame
original_features = train_df.columns[train_df.columns != 'Loan Status'].tolist()

# Print the features
print("Features in the DataFrame:")
for feature in original_features:
    print(feature)

Features in the DataFrame:
Loan ID
Customer ID
Current Loan Amount
Term
Credit Score
Annual Income
Years in current job
Home Ownership
Purpose
Monthly Debt
Years of Credit History
Months since last delinquent
Number of Open Accounts
Number of Credit Problems
Current Credit Balance
Maximum Open Credit
Bankruptcies
Tax Liens


In [5]:
def drop_features(df, features):
    for feature in features:
        if feature in df.columns:
            df = df.drop([feature], axis=1)
    return df

# Drop irrelevant feature
train_df = drop_features(train_df, ['Loan ID', 'Customer ID'])
test_df = drop_features(test_df, ['Loan ID', 'Customer ID'])
train_df.head()

Unnamed: 0,Current Loan Amount,Term,Credit Score,Annual Income,Years in current job,Home Ownership,Purpose,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens,Loan Status
0,674256,Long Term,,,7 years,Rent,Other,22710.13,10.1,,7,0,520239,667832.0,0.0,0.0,Charged Off
1,99999999,Short Term,734.0,1045171.0,4 years,Home Mortgage,Debt Consolidation,20380.73,17.0,43.0,8,0,65189,237380.0,0.0,0.0,Fully Paid
2,583616,Short Term,731.0,1024841.0,9 years,Rent,Business Loan,9052.55,17.6,7.0,10,0,13509,186208.0,0.0,0.0,Charged Off
3,134068,Short Term,,,10+ years,Rent,Debt Consolidation,8427.07,13.8,,7,0,118674,234476.0,0.0,0.0,Fully Paid
4,154638,Short Term,698.0,745560.0,10+ years,Home Mortgage,Debt Consolidation,14849.07,15.6,39.0,19,0,156142,318274.0,0.0,0.0,Charged Off


In [6]:
categorical_features = ['Term', 'Years in current job', 'Home Ownership', 'Purpose', 'Loan Status']

for feature in categorical_features:
    train_df[feature] = train_df[feature].astype('category')
    test_df[feature] = test_df[feature].astype('category')

In [7]:
print('Categorical features summary:')
train_df.describe(include='category')

Categorical features summary:


Unnamed: 0,Term,Years in current job,Home Ownership,Purpose,Loan Status
count,80000,76608,80000,80000,80000
unique,2,11,4,15,2
top,Short Term,10+ years,Home Mortgage,Debt Consolidation,Fully Paid
freq,57762,24969,38768,62856,61889


In [8]:
# Get summary statistics of the dataset
print('Numerical features summary:')
train_df.describe()

Numerical features summary:


Unnamed: 0,Current Loan Amount,Credit Score,Annual Income,Monthly Debt,Years of Credit History,Months since last delinquent,Number of Open Accounts,Number of Credit Problems,Current Credit Balance,Maximum Open Credit,Bankruptcies,Tax Liens
count,80000.0,64600.0,64600.0,80000.0,80000.0,37539.0,80000.0,80000.0,80000.0,79999.0,79839.0,79991.0
mean,11764210.0,1074.426873,1376833.0,18472.860872,18.216044,34.905325,11.133588,0.16715,294595.1,757268.5,0.116484,0.029766
std,31788100.0,1471.517917,922334.8,12200.854672,7.023719,21.964263,5.026861,0.485246,374335.1,8448273.0,0.350582,0.263695
min,10802.0,585.0,76627.0,0.0,3.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,179696.0,705.0,848844.0,10222.8075,13.5,16.0,8.0,0.0,112822.0,273603.0,0.0,0.0
50%,312950.0,724.0,1175482.0,16236.26,16.9,32.0,10.0,0.0,210197.0,468930.0,0.0,0.0
75%,525481.0,741.0,1650592.0,24005.5025,21.7,51.0,14.0,0.0,368870.8,783112.0,0.0,0.0
max,100000000.0,7510.0,36475440.0,435843.28,65.0,176.0,76.0,15.0,32878970.0,1539738000.0,7.0,15.0


## 3. Modeling

In [9]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier

from tqdm import tqdm

In [79]:
def get_card_split(df, cols, n=11):
    """
    Splits categorical columns into 2 lists based on cardinality (i.e # of unique values)
    Parameters
    ----------
    df : Pandas DataFrame
        DataFrame from which the cardinality of the columns is calculated.
    cols : list-like
        Categorical columns to list
    n : int, optional (default=11)
        The value of 'n' will be used to split columns.
    Returns
    -------
    card_low : list-like
        Columns with cardinality < n
    card_high : list-like
        Columns with cardinality >= n
    """
    cond = df[cols].nunique() > n
    card_high = cols[cond]
    card_low = cols[~cond]
    return card_low, card_high


def fit(X_train, X_test, y_train, y_test, classifiers):
    # Initialize a LabelEncoder
    le = LabelEncoder()

    # Fit and transform the labels in the train set
    y_train = le.fit_transform(y_train)

    # Transform the labels in the test set
    y_test = le.transform(y_test)


    numeric_transformer = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler())
        ]
    )

    categorical_transformer_low = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="most_frequent", fill_value="missing")),
            ("encoding", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
        ]
    )

    categorical_transformer_high = Pipeline(
        steps=[
            ("imputer", SimpleImputer(strategy="most_frequent", fill_value="missing")),
            ("encoding", OrdinalEncoder()),
        ]
    )

    numeric_features = X_train.select_dtypes(include=[np.number]).columns
    categorical_features = X_train.select_dtypes(include=["object"]).columns

    categorical_low, categorical_high = get_card_split(
        X_train, categorical_features
    )

    if isinstance(X_train, np.ndarray):
        X_train = pd.DataFrame(X_train)
        X_test = pd.DataFrame(X_test)

    preprocessor = ColumnTransformer(
        transformers=[
            ("numeric", numeric_transformer, numeric_features),
            ("categorical_low", categorical_transformer_low, categorical_low),
            ("categorical_high", categorical_transformer_high, categorical_high),
        ]
    )

    models = {}
    results = {}

    accuracy = 0.0
    f1 = 0.0
    roc_auc = 0.0

    for name, model, param_grid in tqdm(classifiers):
        start = time.time()
        try:
            pipe = Pipeline(
                steps=[
                    ("preprocessor", preprocessor),
                    ("classifier", model)
                ]
            )

            search = RandomizedSearchCV(
                pipe,
                param_distributions=param_grid,
                n_iter=10,  # Number of parameter settings that are sampled
                scoring='f1_weighted',
                cv=5,  # 5-fold cross-validation
                random_state=seed,
                n_jobs=-1  # Use all available cores
            )

            search.fit(X_train, y_train)
            models[name] = search
            y_pred = search.predict(X_test)
            y_proba = search.predict_proba(X_test)[:, 1]
            accuracy = accuracy_score(y_test, y_pred, normalize=True)
            f1 = f1_score(y_test, y_pred, average="weighted")
            roc_auc = roc_auc_score(y_test, y_proba)

        except Exception as exception:
            print(name + " model failed to execute")
            print(exception)
        
        results[name] = {'Accuracy': round(accuracy, 2), 'F1 Score': round(f1, 2), 'ROC AUC': round(roc_auc, 2), 'Time': round(time.time() - start, 3)}
        
    return results, models


In [None]:
rf_param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 10, 20, 30],
    'classifier__min_samples_split': [2, 5, 10],
    'classifier__min_samples_leaf': [1, 2, 4]
}

xgb_param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [None, 3, 5],
    'classifier__learning_rate': [None, 0.01, 0.05],
    'classifier__subsample': [0.8, 1],
    'classifier__colsample_bytree': [None, 0.8, 1]
}

lgbm_param_grid = {
    'classifier__n_estimators': [200, 300],
    'classifier__learning_rate': [0.05, 0.2, 0.25],
    'classifier__max_depth': [-1, 2, 3],
    'classifier__num_leaves': [31, 62, 124],
    'classifier__subsample': [0.5, 0.7, 1],
    'classifier__colsample_bytree': [0.5, 0.6, 0.7],
}

cat_param_gird = {
    'classifier__iterations': [200, 300],
    'classifier__learning_rate': [0.01, 0.03, 0.1],
    'classifier__depth': [4, 6, 10],
    'classifier__l2_leaf_reg': [1.0, 3.0, 5.0],
    'classifier__border_count': [64, 128],
}

classifiers = [
    # ('Random Forest', RandomForestClassifier(random_state=seed), rf_param_grid),
    # ('XGBoost', XGBClassifier(random_state=seed), xgb_param_grid),
    ('LightGBM', LGBMClassifier(random_state=seed, verbose=-1), lgbm_param_grid),
    ('CatBoost', CatBoostClassifier(random_state=seed, verbose=False), cat_param_gird)
]

results, models = fit(train_df.drop('Loan Status', axis=1), test_df.drop('Loan Status', axis=1), train_df['Loan Status'], test_df['Loan Status'], classifiers)

results = pd.DataFrame(results).T

results

In [44]:
weights = {'F1 Score': 0.4, 'ROC AUC': 0.3, 'Time': 0.2, 'Accuracy': 0.1}

# Reverse the 'Time' column (assuming it's all positive and non-zero)
results['Time'] = 1 / results['Time']

results['Weighted Score'] = results.apply(lambda row: sum(row[metric] * weight for metric, weight in weights.items()), axis=1)

results = results.sort_values(by='Weighted Score', ascending=False)

best_model_name = results['Weighted Score'].idxmax()

print('Best Model:', best_model_name)
models[best_model_name].best_params_

Best Model: LightGBM


{'classifier__subsample': 0.7,
 'classifier__num_leaves': 31,
 'classifier__n_estimators': 300,
 'classifier__max_depth': -1,
 'classifier__learning_rate': 0.25,
 'classifier__colsample_bytree': 0.5}

In [52]:
import pickle

cat_model = models['CatBoost'].best_estimator_
with open('best_model.pkl', 'wb') as f:
    pickle.dump(cat_model, f)