# PREDICTING THE PROBABILITY OF DEFAULTING ON A LOAN

# I. OVERVIEW

## 1. DATA
The dataset includes various types of information such as demographic data, loan application history, payment records, and credit bureau data.

**application_train.csv/application_test.csv**: Contains the main training and test data, including demographic and financial information for each applicant.

**bureau.csv**: Contains data about the applicant's previous credits from other financial institutions.

**bureau_balance.csv**: Monthly balance of previous credits from the credit bureau.

**previous_application.csv**: Previous applications for loans at Home Credit.

**POS_CASH_balance.csv**: Monthly balance snapshots of previous point-of-sale and cash loans.

**credit_card_balance.csv**: Monthly balance snapshots of previous credit cards.

**installments_payments.csv**: Payments made on previous loans.

# II. DATA OVERVIEW AND ANALYSIS

## 1. DATA OVERVIEW

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# Load main datasets
train = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/application_train.csv')
test = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/application_test.csv')

# Load related datasets
bureau = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/bureau.csv')
bureau_balance = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/bureau_balance.csv')
previous_application = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/previous_application.csv')
pos_cash_balance = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/POS_CASH_balance.csv')
installments_payments = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/installments_payments.csv')
credit_card_balance = pd.read_csv('/content/drive/MyDrive/AI/home-credit-default-risk/data/credit_card_balance.csv')
train.shape, test.shape, bureau.shape, bureau_balance.shape, previous_application.shape, pos_cash_balance.shape, installments_payments.shape, credit_card_balance.shape

((307511, 122),
 (48744, 121),
 (1716428, 17),
 (27299925, 3),
 (1670214, 37),
 (10001358, 8),
 (13605401, 8),
 (3840312, 23))

## 2. MERGING DATASETS & FEATURE ENGINEERING
In this notebook, we will employ a single-model approach, using only one base model for this problem, namely the LightGBM (Light Gradient Boosting Machine) model.

To fit the data to the base model, we will aggregate and merge all the data available in the other datasets into the train and test datasets.

### 2.1 AGGREGATION FUNCTIONS

In [3]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

def encode_categorical(df):
    le = LabelEncoder()
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = le.fit_transform(df[col].astype(str))
    return df

def aggregate_bureau(bureau, bureau_balance):
    # Encode STATUS
    status_mapping = {'C': 0, 'X': -1, '0': 0, '1': 1, '2': 2, '3': 3, '4': 4, '5': 5}
    bureau_balance['STATUS'] = bureau_balance['STATUS'].map(status_mapping)

    # Aggregate bureau_balance
    bureau_balance_agg = bureau_balance.groupby('SK_ID_BUREAU').agg({
        'MONTHS_BALANCE': ['min', 'max', 'size'],
        'STATUS': ['mean']
    }).reset_index()
    bureau_balance_agg.columns = ['_'.join(col).strip() for col in bureau_balance_agg.columns.values]

    # Merge aggregated bureau_balance with bureau
    bureau = bureau.merge(bureau_balance_agg, how='left', left_on='SK_ID_BUREAU', right_on='SK_ID_BUREAU_')

    # Aggregate bureau
    bureau_agg = bureau.groupby('SK_ID_CURR').agg({
        'DAYS_CREDIT': ['mean', 'max', 'min'],
        'CREDIT_DAY_OVERDUE': ['mean', 'max'],
        'AMT_CREDIT_MAX_OVERDUE': ['mean', 'max'],
        'CNT_CREDIT_PROLONG': ['sum'],
        'AMT_CREDIT_SUM': ['mean', 'sum', 'max'],
        'AMT_CREDIT_SUM_DEBT': ['mean', 'sum'],
        'AMT_CREDIT_SUM_LIMIT': ['mean', 'sum'],
        'AMT_CREDIT_SUM_OVERDUE': ['mean', 'sum'],
        'DAYS_CREDIT_UPDATE': ['mean', 'max'],
        'AMT_ANNUITY': ['mean', 'sum'],
        'MONTHS_BALANCE_min': ['mean'],
        'MONTHS_BALANCE_max': ['mean'],
        'MONTHS_BALANCE_size': ['mean'],
        'STATUS_mean': ['mean']
    }).reset_index()

    # Flatten column names
    bureau_agg.columns = ['_'.join(col).strip() if col[1] else col[0] for col in bureau_agg.columns.values]
    return bureau_agg

def aggregate_prev_app(prev_app):
    prev_app = encode_categorical(prev_app)
    # Only use the most significant fields to avoid noise and overfitting
    prev_app_agg = prev_app.groupby('SK_ID_CURR').agg({
        'AMT_ANNUITY': ['mean', 'max', 'sum'],
        'AMT_APPLICATION': ['mean', 'max', 'sum'],
        'AMT_CREDIT': ['mean', 'max', 'sum'],
        'AMT_DOWN_PAYMENT': ['mean', 'max', 'sum'],
        'AMT_GOODS_PRICE': ['mean', 'max', 'sum'],
        'HOUR_APPR_PROCESS_START': ['mean', 'max', 'min'],
        'RATE_DOWN_PAYMENT': ['mean', 'max', 'min'],
        'DAYS_DECISION': ['mean', 'max', 'min'],
        'CNT_PAYMENT': ['mean', 'max', 'sum']
    }).reset_index()

    prev_app_agg.columns = ['_'.join(col).strip() if col[1] else col[0] for col in prev_app_agg.columns.values]
    return prev_app_agg

def aggregate_pos_cash(pos_cash):
    pos_cash = encode_categorical(pos_cash)
    # Only use the most significant fields to avoid noise and overfitting
    pos_cash_agg = pos_cash.groupby('SK_ID_CURR').agg({
        'MONTHS_BALANCE': ['mean', 'max', 'min', 'size'],
        'SK_DPD': ['mean', 'max', 'sum'],
        'SK_DPD_DEF': ['mean', 'max', 'sum']
    }).reset_index()

    pos_cash_agg.columns = ['_'.join(col).strip() if col[1] else col[0] for col in pos_cash_agg.columns.values]
    return pos_cash_agg

def aggregate_installments(installments):
    installments = encode_categorical(installments)
    # Only use the most significant fields to avoid noise and overfitting
    installments_agg = installments.groupby('SK_ID_CURR').agg({
        'NUM_INSTALMENT_VERSION': ['nunique'],
        'NUM_INSTALMENT_NUMBER': ['mean', 'max', 'sum'],
        'DAYS_INSTALMENT': ['mean', 'max', 'min'],
        'DAYS_ENTRY_PAYMENT': ['mean', 'max', 'min'],
        'AMT_INSTALMENT': ['mean', 'max', 'sum'],
        'AMT_PAYMENT': ['mean', 'max', 'sum']
    }).reset_index()

    installments_agg.columns = ['_'.join(col).strip() if col[1] else col[0] for col in installments_agg.columns.values]
    return installments_agg

def aggregate_credit_card(credit_card):
    credit_card = encode_categorical(credit_card)
    # Only use the most significant fields to avoid noise and overfitting
    credit_card_agg = credit_card.groupby('SK_ID_CURR').agg({
        'MONTHS_BALANCE': ['mean', 'max', 'min', 'size'],
        'AMT_BALANCE': ['mean', 'max', 'sum'],
        'AMT_CREDIT_LIMIT_ACTUAL': ['mean', 'max', 'sum'],
        'AMT_DRAWINGS_ATM_CURRENT': ['mean', 'max', 'sum'],
        'AMT_DRAWINGS_CURRENT': ['mean', 'max', 'sum'],
        'AMT_DRAWINGS_OTHER_CURRENT': ['mean', 'max', 'sum'],
        'AMT_DRAWINGS_POS_CURRENT': ['mean', 'max', 'sum'],
        'AMT_INST_MIN_REGULARITY': ['mean', 'max', 'sum'],
        'AMT_PAYMENT_TOTAL_CURRENT': ['mean', 'max', 'sum'],
        'AMT_RECEIVABLE_PRINCIPAL': ['mean', 'max', 'sum'],
        'AMT_RECIVABLE': ['mean', 'max', 'sum'],
        'AMT_TOTAL_RECEIVABLE': ['mean', 'max', 'sum'],
        'CNT_DRAWINGS_ATM_CURRENT': ['mean', 'max', 'sum'],
        'CNT_DRAWINGS_CURRENT': ['mean', 'max', 'sum'],
        'CNT_DRAWINGS_OTHER_CURRENT': ['mean', 'max', 'sum'],
        'CNT_DRAWINGS_POS_CURRENT': ['mean', 'max', 'sum'],
        'CNT_INSTALMENT_MATURE_CUM': ['mean', 'max', 'sum'],
        'SK_DPD': ['mean', 'max', 'sum'],
        'SK_DPD_DEF': ['mean', 'max', 'sum']
    }).reset_index()

    credit_card_agg.columns = ['_'.join(col).strip() if col[1] else col[0] for col in credit_card_agg.columns.values]
    return credit_card_agg

### 2.2 MERGING ALL OTHER DATASETS INTO TRAIN AND TEST

In [4]:
# Aggregate and merge additional datasets
bureau_agg = aggregate_bureau(bureau, bureau_balance)
prev_app_agg = aggregate_prev_app(previous_application)
pos_cash_agg = aggregate_pos_cash(pos_cash_balance)
installments_agg = aggregate_installments(installments_payments)
credit_card_agg = aggregate_credit_card(credit_card_balance)

# Merge aggregated datasets into train and test
train = train.merge(bureau_agg, on='SK_ID_CURR', how='left')
test = test.merge(bureau_agg, on='SK_ID_CURR', how='left')

train = train.merge(prev_app_agg, on='SK_ID_CURR', how='left')
test = test.merge(prev_app_agg, on='SK_ID_CURR', how='left')

train = train.merge(pos_cash_agg, on='SK_ID_CURR', how='left')
test = test.merge(pos_cash_agg, on='SK_ID_CURR', how='left')

train = train.merge(installments_agg, on='SK_ID_CURR', how='left')
test = test.merge(installments_agg, on='SK_ID_CURR', how='left')

train = train.merge(credit_card_agg, on='SK_ID_CURR', how='left')
test = test.merge(credit_card_agg, on='SK_ID_CURR', how='left')

In [5]:
train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'CNT_DRAWINGS_POS_CURRENT_sum', 'CNT_INSTALMENT_MATURE_CUM_mean',
       'CNT_INSTALMENT_MATURE_CUM_max', 'CNT_INSTALMENT_MATURE_CUM_sum',
       'SK_DPD_mean_y', 'SK_DPD_max_y', 'SK_DPD_sum_y', 'SK_DPD_DEF_mean_y',
       'SK_DPD_DEF_max_y', 'SK_DPD_DEF_sum_y'],
      dtype='object', length=258)

In [6]:
train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,CNT_DRAWINGS_POS_CURRENT_sum,CNT_INSTALMENT_MATURE_CUM_mean,CNT_INSTALMENT_MATURE_CUM_max,CNT_INSTALMENT_MATURE_CUM_sum,SK_DPD_mean_y,SK_DPD_max_y,SK_DPD_sum_y,SK_DPD_DEF_mean_y,SK_DPD_DEF_max_y,SK_DPD_DEF_sum_y
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,,,,,,,,,,
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,,,,,,,,,,
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,,,,,,,,,,
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,,,,,,,,,,


## 3. MISSING DATA & IMPUTATION

### 3.1 MISSING DATA

In [7]:
# Calculate total missing values in the train dataset
total_missing = train.isnull().sum().sum()

# Calculate total entries in the dataset
total_entries = np.product(train.shape)

# Calculate the percentage of missing data
missing_percentage = (total_missing / total_entries) * 100

print(f"Total missing data in train: {total_missing}")
print(f"Total entries in train: {total_entries}")
print(f"Percentage of missing data in train: {missing_percentage:.2f}%")

Total missing data in train: 25383166
Total entries in train: 79337838
Percentage of missing data in train: 31.99%


In [8]:
# Calculate total missing values in the test dataset
total_missing = test.isnull().sum().sum()

# Calculate total entries in the dataset
total_entries = np.product(test.shape)

# Calculate the percentage of missing data
missing_percentage = (total_missing / total_entries) * 100

print(f"Total missing data in test: {total_missing}")
print(f"Total entries in test: {total_entries}")
print(f"Percentage of missing data in test: {missing_percentage:.2f}%")

Total missing data in test: 3590712
Total entries in test: 12527208
Percentage of missing data in test: 28.66%


### 3.2 IMPUTATION OF MISSING DATA
Given that not too much data is missing in the train and test datasets, in this notebook, we will employ simple strategies for filling missing values:
* Missing nummerical values will be filled with mean
* Missing categorical values will be filled with "missing"

We will go on to experiment with K-Nearest Neighbors (KNN) Imputation later.

In [9]:
def fill_missing_values(df):
    # Separate numerical and categorical columns
    num_cols = df.select_dtypes(include=[np.number]).columns
    cat_cols = df.select_dtypes(include=['object']).columns

    # Fill missing values in numerical columns with mean
    for col in num_cols:
        df[col].fillna(df[col].mean(), inplace=True)

    # Fill missing values in categorical columns with 'missing'
    for col in cat_cols:
        df[col].fillna('missing', inplace=True)

    return df

In [10]:
# Fill missing values in main datasets
train = fill_missing_values(train)
test = fill_missing_values(test)

# Encode
train = encode_categorical(train)
test = encode_categorical(test)

In [11]:
# Check for missing values in train dataset
missing_values_train = train.isnull().sum()
missing_values_train = missing_values_train[missing_values_train > 0]  # Filter out columns with no missing values
print("Missing values in train dataset:")
print(missing_values_train)

# Check for missing values in test dataset
missing_values_test = test.isnull().sum()
missing_values_test = missing_values_test[missing_values_test > 0]  # Filter out columns with no missing values
print("Missing values in test dataset:")
print(missing_values_test)

Missing values in train dataset:
Series([], dtype: int64)
Missing values in test dataset:
Series([], dtype: int64)


# IV. MODELLING

In this notebook, we will use a single model for the problem. One of the best choices would be LightGBM (Light Gradient Boosting Machine). LightGBM is highly efficient and capable of handling large datasets with many features like our problem in question.

The competition uses Area Under the ROC Curve as the metric.

In [24]:
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Prepare data for LightGBM
X = train.drop(['TARGET', 'SK_ID_CURR'], axis=1)
y = train['TARGET']
X_test = test.drop(['SK_ID_CURR'], axis=1)

# Split data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Create LightGBM dataset
dtrain = lgb.Dataset(X_train, label=y_train)
dvalid = lgb.Dataset(X_valid, label=y_valid, reference=dtrain)

# Set parameters
params = {
    'objective': 'binary',
    'boosting_type': 'gbdt',
    'metric': 'auc',
    'learning_rate': 0.01,
    'num_leaves': 31,
    'max_bin': 255,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.9,
    'bagging_freq': 5,
    'early_stopping_round':100,
    'verbose': -1
}

# Train the model
clf = lgb.train(
    params,
    dtrain,
    num_boost_round=10000,
    valid_sets=[dtrain, dvalid]
)

# Predict
y_pred = clf.predict(X_valid, num_iteration=clf.best_iteration)
print('Validation AUC score:', roc_auc_score(y_valid, y_pred))

Validation AUC score: 0.7812848211284044


# V. HYPERPARAMETER TUNING
A AUC score of 0.7812 is quite encouraging. Let's see if we can further improve it by hyperparameter tuning with RandomSearchCV.

## 1. HYPERPARAMETER TUNING

In [13]:
# This section can be run independently from section IV
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import RandomizedSearchCV

# Prepare data for LightGBM
X = train.drop(['TARGET', 'SK_ID_CURR'], axis=1)
y = train['TARGET']
X_test = test.drop(['SK_ID_CURR'], axis=1)

# Split data for validation
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the parameter grid
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.15],
    'num_leaves': [31, 50, 70, 90],
    'max_depth': [-1, 10, 20, 30],
    'min_data_in_leaf': [20, 50, 100],
    'feature_fraction': [0.8, 0.9, 1.0],
    'bagging_fraction': [0.8, 0.9, 1.0],
    'bagging_freq': [0, 5, 10, 15],
    'lambda_l1': [0, 0.1, 0.5, 1.0],
    'lambda_l2': [0, 0.1, 0.5, 1.0]
}

# Create LightGBM dataset
dtrain = lgb.Dataset(X_train, label=y_train)

# Initialize the LightGBM model
lgbm = lgb.LGBMClassifier(boosting_type='gbdt', objective='binary', metric='auc')

# Perform RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=lgbm,
    param_distributions=param_grid,
    n_iter=10,
    scoring='roc_auc',
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Fit the model
random_search.fit(X_train, y_train)

# Get the best parameters
best_params = random_search.best_params_
print("Best parameters found by RandomizedSearchCV:", best_params)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Number of positive: 19876, number of negative: 226132
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.256402 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 41417
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 251
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080794 -> initscore=-2.431606
[LightGBM] [Info] Start training from score -2.431606
Best parameters found by RandomizedSearchCV: {'num_leaves': 70, 'min_data_in_leaf': 50, 'max_depth': 30, 'learning_rate': 0.1, 'lambda_l2': 1.0, 'lambda_l1': 0.1, 'feature_fraction': 0.9, 'bagging_freq': 5, 'bagging_fraction': 1.0}


In [15]:
best_params = random_search.best_params_
best_params

{'num_leaves': 70,
 'min_data_in_leaf': 50,
 'max_depth': 30,
 'learning_rate': 0.1,
 'lambda_l2': 1.0,
 'lambda_l1': 0.1,
 'feature_fraction': 0.9,
 'bagging_freq': 5,
 'bagging_fraction': 1.0}

Best parameters found by RandomizedSearchCV: {'num_leaves': 70, 'min_data_in_leaf': 50, 'max_depth': 30, 'learning_rate': 0.1, 'lambda_l2': 1.0, 'lambda_l1': 0.1, 'feature_fraction': 0.9, 'bagging_freq': 5, 'bagging_fraction': 1.0}

In [22]:
best_params.update({
    'objective': 'binary',
    'metric': 'auc',
    'early_stopping_round':100,
    'boosting_type': 'gbdt'
})
# Create LightGBM dataset
dtrain = lgb.Dataset(X_train, label=y_train)
dvalid = lgb.Dataset(X_valid, label=y_valid, reference=dtrain)

# Train the model
clf_tuned = lgb.train(
    best_params,
    dtrain,
    num_boost_round=10000,
    valid_sets=[dtrain, dvalid]
)

[LightGBM] [Info] Number of positive: 19876, number of negative: 226132
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.167707 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 41417
[LightGBM] [Info] Number of data points in the train set: 246008, number of used features: 251
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080794 -> initscore=-2.431606
[LightGBM] [Info] Start training from score -2.431606
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[186]	training's auc: 0.909627	valid_1's auc: 0.777496


In [23]:
# Predict
y_pred = clf_tuned.predict(X_valid, num_iteration=clf_tuned.best_iteration)
print('Validation AUC score:', roc_auc_score(y_valid, y_pred))

Validation AUC score: 0.7774955749264917


After several experiments with different options, we can see the the AUC score does not really improve. The conclusion to draw here is that for this kind of problem where there are multiple large datasets that are interconnected and sophisticated, feature aggregation and engineering should be of more significant than tuning the hyper parameters alone.

This will be the strategy that we will take in the second notebook of this problem where we will better process the raw data and employ ensembles for better modelling.

## 2. PREDICTIONS ON THE TEST DATASET AND SUBMISSION FILE

In [27]:
# Predict on the validation set
y_pred = clf.predict(X_valid, num_iteration=clf.best_iteration)
print('Validation AUC score:', roc_auc_score(y_valid, y_pred))

# Generate predictions for the test set
test_pred = clf.predict(X_test, num_iteration=clf.best_iteration)

# Create submission file
submission = pd.DataFrame({'SK_ID_CURR': test['SK_ID_CURR'], 'TARGET': test_pred})
submission.to_csv('/content/drive/MyDrive/AI/home-credit-default-risk/submission_1.csv', index=False)

Validation AUC score: 0.7812848211284044
