# Home Credit Default Risk - Machine Learning Project  

## Project Overview  
This project aims to predict **loan default risk** using historical credit data provided by the **Home Credit dataset**.  
By analyzing multiple financial datasets from past loan applications, we extract insights to improve risk assessment and minimize losses for lenders.  
While this model is trained specifically on Home Credit’s dataset, the process—data collection, preprocessing, feature engineering, and modeling—can be adapted to other financial institutions. 

## Live Application Deployment  
This project is also deployed as an **interactive Angular + Flask application**, allowing users to observe real-time model inference.  
🔗 **Try it here:** [Live Loan Default Predictor](https://ai.fullstackista.com/ai-loan-default-predictor/)  

### Key Steps in the Project  
1. **Understanding the Problem** – Define the objective: predict loan default risk using Home Credit data.  
2. **Data Processing & Feature Engineering** – Process multiple datasets, clean missing values, extract features, and aggregate information.  
3. **Exploratory Data Analysis (EDA)** – Identify trends, correlations, and risk factors in loan applications.  
4. **Merging Datasets** – Integrate primary (`application_train.csv`) and secondary datasets (e.g., `bureau.csv`, `credit_card_balance.csv`) for a unified view.  
5. **Model Training & Hyperparameter Tuning** – Train and optimize models (e.g., LightGBM) for predictive performance.  
6. **Model Evaluation** – Validate performance using metrics such as AUC-ROC.  
7. **Final Prediction** – Apply the trained model to `application_test.csv` and generate predictions.  

## About This Notebook  

This notebook prepares the **test dataset (`application_test.csv`)** for **predictions** using the trained model.  

### Key Steps:
- **Load & Merge Datasets** – Processed datasets are loaded and merged in the same way as during model training.  
- **Generate Predictions** – The trained LightGBM model is applied to the test data.  
- **Create Prediction File** – While real-time model deployment is often preferred in production (as demonstrated in my [Angular/Flask-based deployed app](https://ai.fullstackista.com/ai-loan-default-predictor/)), structured output files are still used in certain batch-processing scenarios, such as risk assessment and regulatory reporting in financial applications.

## Project Notebooks  

### Main Dataset and Model Training  
- [1. Application Train (Main Dataset)](./01_application_train.ipynb)
- [2. Model Training and Final Pipeline](./02_model_training_pipeline.ipynb)  

### Secondary Datasets Processing  
- [3. Bureau Data](./03_bureau_data.ipynb)  
- [4. Bureau Balance Data](./04_bureau_balance.ipynb)  
- [5. Credit Card Balance](./05_credit_card_balance.ipynb)  
- [6. Previous Applications](./06_previous_applications.ipynb)  
- [7. POS Cash Balance](./07_pos_cash_balance.ipynb)  
- [8. Installments Payments](./08_installments_payments.ipynb)  

### Final Prediction  
- [9. Model Predictions on Test Data](./09_model_predictions.ipynb) _(Current Notebook)_
- [10. Application Test Data Processing](./10_application_test_processing.ipynb)

# Model Predictions (All Datasets)  

## 1. Load All Processed Datasets  

### 1.1 Load Required Libraries  
Import necessary Python libraries for **data processing, loading the trained model, and making predictions.**  

In [1]:
import os
import pandas as pd
import json
import lightgbm as lgb

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

### 1.2 Define Dataset Path  
Set the file path where the processed **test dataset** is stored.  

In [2]:
# Define dataset path
DATASET_PATH = "/kaggle/input/home-credit-processed-data-and-model/"

### 1.3 Load Processed Datasets  
Load all preprocessed datasets from the specified directory.  
These datasets include the main **test dataset** and various secondary datasets that provide additional financial and credit history details for loan applicants.  

In [3]:
# Load data from processed dataset
df_application_test_processed = pd.read_pickle(DATASET_PATH + "application_test_processed.pkl")
df_bureau_aggregated = pd.read_pickle(DATASET_PATH + "bureau_aggregated.pkl")
df_bureau_balance_aggregated_with_curr_final = pd.read_pickle(DATASET_PATH + "bureau_balance_aggregated_with_curr_final.pkl")
df_pos_cash_balance_aggregated = pd.read_pickle(DATASET_PATH + "pos_cash_balance_aggregated.pkl")
df_credit_card_balance_aggregated = pd.read_pickle(DATASET_PATH + "credit_card_balance_aggregated.pkl")
df_previous_application_aggregated = pd.read_pickle(DATASET_PATH + "previous_application_aggregated.pkl")
df_installments_payments_aggregated = pd.read_pickle(DATASET_PATH + "installments_payments_aggregated.pkl")

# Confirm successful loading
print("✅ All datasets loaded successfully from Kaggle dataset")

✅ All datasets loaded successfully from Kaggle dataset


### 1.4 Check Dataset Shapes  
Display the number of rows and columns in each dataset to ensure proper loading.  

In [4]:
# Quick check of shapes
print("Dataset Shapes:")
print(f"Application Test Processed: {df_application_test_processed.shape}")
print(f"Bureau Aggregated: {df_bureau_aggregated.shape}")
print(f"Bureau Balance Aggregated with Curr Final: {df_bureau_balance_aggregated_with_curr_final.shape}")
print(f"POS Cash Balance Aggregated: {df_pos_cash_balance_aggregated.shape}")
print(f"Credit Card Balance Aggregated: {df_credit_card_balance_aggregated.shape}")
print(f"Previous Application Aggregated: {df_previous_application_aggregated.shape}")
print(f"Installments Payments Aggregated: {df_installments_payments_aggregated.shape}")

Dataset Shapes:
Application Test Processed: (48744, 80)
Bureau Aggregated: (305811, 56)
Bureau Balance Aggregated with Curr Final: (134542, 63)
POS Cash Balance Aggregated: (337252, 36)
Credit Card Balance Aggregated: (103558, 99)
Previous Application Aggregated: (338857, 109)
Installments Payments Aggregated: (339587, 62)


### 1.5 Final Check: ID Columns in Merged Datasets
- This check ensures that **no unwanted ID-related columns** (e.g., system-generated IDs) accidentally sneak into the training dataset.
- **`SK_ID_CURR` is the primary key** used to merge datasets and should not be removed.
- **`DAYS_ID_PUBLISH` is an important time-related feature** and should not be excluded.
- If additional unexpected ID columns appear, they should be investigated and removed if necessary.

In [5]:
# Define datasets
datasets = {
    "Application Test": df_application_test_processed,
    "Bureau Aggregated": df_bureau_aggregated,
    "Bureau Balance Aggregated with Curr Final": df_bureau_balance_aggregated_with_curr_final,
    "POS Cash Balance Aggregated": df_pos_cash_balance_aggregated,
    "Credit Card Balance Aggregated": df_credit_card_balance_aggregated,
    "Previous Application Aggregated": df_previous_application_aggregated,
    "Installments Payments Aggregated": df_installments_payments_aggregated,
}

### 1.6 Display Summary of ID Columns in Merged Datasets  
This table lists the ID-related columns found in each dataset.  
It helps verify that only the correct primary key (`SK_ID_CURR`) is present and that no unnecessary ID columns remain.  

In [6]:
# Create a list to store results
sk_id_summary_list = []

for name, df in datasets.items():
    # Get all column names
    all_columns = df.columns.tolist()
    
    # Extract columns that contain "_id_" (case-insensitive search)
    id_cols = [col for col in all_columns if "_id_" in col.lower()]

    # Append results
    sk_id_summary_list.append({
        "Dataset": name,
        "ID Columns": ", ".join(id_cols) if id_cols else "None"
    })

# Convert to DataFrame
sk_id_summary = pd.DataFrame(sk_id_summary_list)

# Display results
print("Updated ID Column Summary:")
print(sk_id_summary.to_string(index=False))

Updated ID Column Summary:
                                  Dataset                  ID Columns
                         Application Test SK_ID_CURR, DAYS_ID_PUBLISH
                        Bureau Aggregated                  SK_ID_CURR
Bureau Balance Aggregated with Curr Final                  SK_ID_CURR
              POS Cash Balance Aggregated                  SK_ID_CURR
           Credit Card Balance Aggregated                  SK_ID_CURR
          Previous Application Aggregated                  SK_ID_CURR
         Installments Payments Aggregated                  SK_ID_CURR


### Observation  
- The **primary key `SK_ID_CURR`** is correctly present in all datasets.  
- The **`DAYS_ID_PUBLISH`** column appears in `application_test_processed`, but this is a time-related feature, not an unwanted ID column.  
- **No unexpected ID columns were found, confirming that data merging has been done correctly without introducing data leakage.**  

## 2. Merging Datasets  
Combine the processed secondary datasets with the main **application test dataset**.  
- All datasets are merged on **`SK_ID_CURR`**, the unique identifier for loan applications.  
- **Left joins (`how='left'`)** ensure that all records from the test dataset are retained, while missing values from secondary datasets are handled appropriately.  
- This merged dataset will be used for **making final predictions**.  

In [7]:
df_final_test = df_application_test_processed.merge(df_bureau_aggregated, on='SK_ID_CURR', how='left') \
                                             .merge(df_bureau_balance_aggregated_with_curr_final, on='SK_ID_CURR', how='left') \
                                             .merge(df_pos_cash_balance_aggregated, on='SK_ID_CURR', how='left') \
                                             .merge(df_credit_card_balance_aggregated, on='SK_ID_CURR', how='left') \
                                             .merge(df_previous_application_aggregated, on='SK_ID_CURR', how='left') \
                                             .merge(df_installments_payments_aggregated, on='SK_ID_CURR', how='left')

print(f"✅ Final df_final_test shape: {df_final_test.shape}")

✅ Final df_final_test shape: (48744, 499)


## 3. Check Train and Test Dataset Feature Alignment  

### 3.1 Load Feature Metadata  
Load the stored **feature names, data types, and category mappings** from the training phase.  
This ensures that the test dataset is processed consistently with the training dataset.

In [8]:
# Load feature names, dtypes & category mappings from JSON
json_path = "/kaggle/input/home-credit-processed-data-and-model/df_final_features.json"

# Load feature names, dtypes & category mappings from JSON
with open(json_path, "r") as f:
    train_metadata = json.load(f)


train_features = train_metadata["features"]
train_dtypes = {k: pd.api.types.pandas_dtype(v) for k, v in train_metadata["dtypes"].items()}
category_mappings = train_metadata.get("category_mappings", {})  # Load category mappings

print("✅ Loaded df_final features, dtypes & category mappings successfully!")

✅ Loaded df_final features, dtypes & category mappings successfully!


### 3.2 Compare Feature Names and Order  
Check if the test dataset has the same feature names and order as the training dataset.  

In [9]:
# Compare feature names and order
test_features = df_final_test.columns.tolist()

if train_features == test_features:
    print("✅ Feature names and order match!")
else:
    print("⚠️ Feature names/order mismatch!")
    print("Missing in Test:", set(train_features) - set(test_features))
    print("Extra in Test:", set(test_features) - set(train_features))

⚠️ Feature names/order mismatch!
Missing in Test: set()
Extra in Test: {'SK_ID_CURR'}


### Observation  
- The test dataset contains **one extra feature (`SK_ID_CURR`)** compared to the final feature list from training.  
- This happens because `SK_ID_CURR` was **dropped in the training phase before saving the final feature list** in JSON.  
- The `TARGET` column does not appear in the mismatch because it was never part of the test dataset.  
- This difference is expected and does not affect predictions, as `SK_ID_CURR` is only used for merging and identification.  

### 3.3 Align Data Types  
Ensure that all columns in the test dataset have the **same data types** as in the training dataset.  
- **Categorical columns** are cast to match the training set categories.  
- **Integer vs. Float columns** are adjusted where necessary.  
- This step ensures consistency in model input formatting.  

In [10]:
# Explicitly recast each column using the training metadata
for col in train_features:
    if col in df_final_test.columns:
        desired_dtype = train_dtypes[col]
        
        # Handle categorical columns
        if str(desired_dtype) == 'category':
            cats = category_mappings.get(col, None)
            if cats is not None:
                # Force the column to have exactly these categories (assuming unordered; if ordered, add ordered=True)
                cat_dtype = pd.CategoricalDtype(categories=cats, ordered=False)
                df_final_test[col] = df_final_test[col].astype(cat_dtype)
            else:
                df_final_test[col] = df_final_test[col].astype('category')
                
        # Handle special case: training column is nullable integer (Int64) but test column is float64
        elif desired_dtype.name == "Int64" and df_final_test[col].dtype == "float64":
            # Convert from float64 to pandas' nullable integer type
            df_final_test[col] = df_final_test[col].astype("Int64")
            
        else:
            # For non-categorical columns, cast directly
            df_final_test[col] = df_final_test[col].astype(desired_dtype)


### 3.4 Verify Data Type Consistency
This check ensures that:
- All columns in the test dataset match the data types from training.
- Categorical columns retain their expected category values.
- Integer and float columns are correctly aligned.

In [11]:
# Compare category lists
dtype_mismatches = {}

for col in train_features:
    dt_train = train_dtypes[col]
    dt_test = df_final_test[col].dtype

    # If both training and test dtypes are categorical, compare their category lists
    if str(dt_train) == 'category' and str(dt_test) == 'category':
        cats_train = category_mappings.get(col, [])
        cats_test = list(df_final_test[col].cat.categories)
        if cats_train != cats_test:
            dtype_mismatches[col] = (f"categories={cats_train}", f"categories={cats_test}")
    else:
        if dt_train != dt_test:
            dtype_mismatches[col] = (dt_train, dt_test)

if not dtype_mismatches:
    print("✅ All data types match!")
else:
    print("⚠️ Data Type Mismatches:")
    for col, (train_dtype, test_dtype) in dtype_mismatches.items():
        print(f"❌ {col}: Train = {train_dtype}, Test = {test_dtype}")

✅ All data types match!


## 4. Prediction

### 4.1 Load Trained Model
Load the trained LightGBM model to generate predictions on the test dataset.

In [12]:
# 1. Load the model using LightGBM's native method
model_path = "/kaggle/input/home-credit-processed-data-and-model/lightgbm_model.txt"
model = lgb.Booster(model_file=model_path)

print("✅ LightGBM model loaded successfully!")

✅ LightGBM model loaded successfully!


### 4.2 Generate Predictions and Save Output File
- Apply the trained LightGBM model to the processed test dataset.
- Save the final predictions as a structured CSV file for analysis or further processing.

In [13]:
# 2. Prepare your test dataset using the same features as in training
X_test = df_final_test[train_features]

# 3. Run predictions on the test dataset
#    For LightGBM, model.predict returns probabilities for binary classification.
predictions = model.predict(X_test)

# 4. Create a prediction DataFrame and save to CSV.
prediction_output = pd.DataFrame({
    "SK_ID_CURR": df_final_test["SK_ID_CURR"],
    "TARGET": predictions.round(1)
})

prediction_output.to_csv("predictions.csv", index=False)

print("✅ Predictions complete. Prediction file saved as 'predictions.csv'.")

✅ Predictions complete. Prediction file saved as 'predictions.csv'.


## ✅ Final Summary
- The trained LightGBM model was successfully applied to the test dataset.
- Predictions were generated and saved as **`predictions.csv`**.
- While real-time inference is preferred in production (as demonstrated in my [Angular/Flask-based deployed app](https://ai.fullstackista.com/ai-loan-default-predictor/)), structured output files like this are useful for batch processing and reporting.