# Home Credit Default Risk - Machine Learning Project  

## Project Overview  
This project aims to predict **loan default risk** using historical credit data provided by the **Home Credit dataset**.  
By analyzing multiple financial datasets from past loan applications, we extract insights to improve risk assessment and minimize losses for lenders.  
While this model is trained specifically on Home Credit’s dataset, the process—data collection, preprocessing, feature engineering, and modeling—can be adapted to other financial institutions.  

## Live Application Deployment  
This project is also deployed as an **interactive Angular + Flask application**, allowing users to observe real-time model inference.  
🔗 **Try it here:** [Live Loan Default Predictor](https://ai.fullstackista.com/ai-loan-default-predictor/)  

### Key Steps in the Project  
1. **Understanding the Problem** – Define the objective: predict loan default risk using Home Credit data.  
2. **Data Processing & Feature Engineering** – Process multiple datasets, clean missing values, extract features, and aggregate information.  
3. **Exploratory Data Analysis (EDA)** – Identify trends, correlations, and risk factors in loan applications.  
4. **Merging Datasets** – Integrate primary (`application_train.csv`) and secondary datasets (e.g., `bureau.csv`, `credit_card_balance.csv`) for a unified view.  
5. **Model Training & Hyperparameter Tuning** – Train and optimize models (e.g., LightGBM) for predictive performance.  
6. **Model Evaluation** – Validate performance using metrics such as AUC-ROC.  
7. **Final Prediction** – Apply the trained model to `application_test.csv` and generate predictions.  

## About This Notebook  
This notebook processes the `application_test.csv` dataset, which contains loan application data for predictions.  
Unlike `application_train.csv`, this dataset does not include the target variable (`TARGET`).  
The processed features will be used for generating predictions with the trained model.

## Project Notebooks  

### Main Dataset and Model Training  
- [1. Application Train (Main Dataset)](./01_application_train.ipynb)
- [2. Model Training and Final Pipeline](./02_model_training_pipeline.ipynb)  

### Secondary Datasets Processing  
- [3. Bureau Data](./03_bureau_data.ipynb)  
- [4. Bureau Balance Data](./04_bureau_balance.ipynb)  
- [5. Credit Card Balance](./05_credit_card_balance.ipynb)  
- [6. Previous Applications](./06_previous_applications.ipynb)  
- [7. POS Cash Balance](./07_pos_cash_balance.ipynb)  
- [8. Installments Payments](./08_installments_payments.ipynb)  

### Final Prediction  
- [9. Model Predictions on Test Data](./09_model_predictions.ipynb)  
- [10. Application Test Data Processing](./10_application_test_processing.ipynb) _(Current Notebook)_

# Processing `application_test.csv` (Main Loan Application Data for Predictions)

## 1. Load Data (`application_test.csv`)
We start by loading the dataset and inspecting its structure to understand its key features and statistics.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import gdown
import os

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

In [2]:
# Load dataset 
df_application_test = pd.read_csv("/kaggle/input/home-credit-default-risk/application_test.csv")

## 2. Initial Data Inspection (`application_test.csv`)
Before processing, we inspect the dataset for potential issues such as **infinite values**, **missing values**, and other inconsistencies.

### 2.1 Checking for Infinite Values  
Infinite values (e.g., `inf`, `-inf`) can break numerical calculations and should be identified before proceeding.  
The code below scans for any **positive or negative infinite values** in the dataset.  

In [3]:
# Check for infinite values
print("Checking for infinite values in dataset...")
inf_count = (df_application_test == np.inf).sum().sum()
neg_inf_count = (df_application_test == -np.inf).sum().sum()

if inf_count > 0 or neg_inf_count > 0:
    print(f"⚠️ Found {inf_count} positive and {neg_inf_count} negative infinite values!")
else:
    print("✅ No infinite values detected.")

Checking for infinite values in dataset...
✅ No infinite values detected.


### 2.2 Checking for Missing Values (NaNs)  
Missing values (**NaNs**) can affect model performance and should be handled properly.  
Here, we count the number of missing values in each column and print the results.

In [4]:
# Check for missing values
missing_values = df_application_test.isnull().sum()
missing_values = missing_values[missing_values > 0]  

if not missing_values.empty:
    print("⚠️ Missing values detected in columns:")
    print(missing_values)
else:
    print("✅ No missing values detected.")

⚠️ Missing values detected in columns:
AMT_ANNUITY                      24
NAME_TYPE_SUITE                 911
OWN_CAR_AGE                   32312
OCCUPATION_TYPE               15605
EXT_SOURCE_1                  20532
                              ...  
AMT_REQ_CREDIT_BUREAU_DAY      6049
AMT_REQ_CREDIT_BUREAU_WEEK     6049
AMT_REQ_CREDIT_BUREAU_MON      6049
AMT_REQ_CREDIT_BUREAU_QRT      6049
AMT_REQ_CREDIT_BUREAU_YEAR     6049
Length: 64, dtype: int64


### 2.3 Print Dataset Columns  
To get an overview of the dataset structure, we print the column names.  
This helps us understand the available features and identify any inconsistencies.

In [5]:
# Display all columns and their data types
pd.set_option('display.max_rows', None) 
print("✅ Data Types:")
print(df_application_test.dtypes)
pd.reset_option('display.max_rows')  

✅ Data Types:
SK_ID_CURR                        int64
NAME_CONTRACT_TYPE               object
CODE_GENDER                      object
FLAG_OWN_CAR                     object
FLAG_OWN_REALTY                  object
CNT_CHILDREN                      int64
AMT_INCOME_TOTAL                float64
AMT_CREDIT                      float64
AMT_ANNUITY                     float64
AMT_GOODS_PRICE                 float64
NAME_TYPE_SUITE                  object
NAME_INCOME_TYPE                 object
NAME_EDUCATION_TYPE              object
NAME_FAMILY_STATUS               object
NAME_HOUSING_TYPE                object
REGION_POPULATION_RELATIVE      float64
DAYS_BIRTH                        int64
DAYS_EMPLOYED                     int64
DAYS_REGISTRATION               float64
DAYS_ID_PUBLISH                   int64
OWN_CAR_AGE                     float64
FLAG_MOBIL                        int64
FLAG_EMP_PHONE                    int64
FLAG_WORK_PHONE                   int64
FLAG_CONT_MOBILE          

### 2.4 Detecting Extreme Values (Outliers)  

Extreme values (**outliers**) can skew model performance and lead to **unstable predictions**.  
We detect outliers using **percentile thresholds**:  

- **Above the 99th percentile** → Very large values.  
- **Below the 1st percentile** → Very small values.  

In [6]:
# Check for extreme values using percentile-based thresholds
print("Checking for extreme values in dataset using percentile thresholds...")

# Exclude ID columns from percentile-based detection
exclude_cols = ['SK_ID_CURR', 'DAYS_ID_PUBLISH']  
numeric_columns = df_application_test.select_dtypes(include=["number"]).drop(columns=exclude_cols, errors='ignore')

# Compute percentile-based thresholds
upper_threshold = numeric_columns.quantile(0.99)
lower_threshold = numeric_columns.quantile(0.01)

# Identify extreme values
extreme_columns = numeric_columns.max() > upper_threshold
small_columns = numeric_columns.min() < lower_threshold

# Print results
if extreme_columns.any():
    print(f"⚠️ Columns with very large values (above 99th percentile):\n{numeric_columns.loc[:, extreme_columns].max()}")
if small_columns.any():
    print(f"⚠️ Columns with very small values (below 1st percentile):\n{numeric_columns.loc[:, small_columns].min()}")
if not extreme_columns.any() and not small_columns.any():
    print("✅ No extreme values detected.")


Checking for extreme values in dataset using percentile thresholds...
⚠️ Columns with very large values (above 99th percentile):
CNT_CHILDREN                       20.0
AMT_INCOME_TOTAL              4410000.0
AMT_CREDIT                    2245500.0
AMT_ANNUITY                    180576.0
AMT_GOODS_PRICE               2245500.0
                                ...    
AMT_REQ_CREDIT_BUREAU_DAY           2.0
AMT_REQ_CREDIT_BUREAU_WEEK          2.0
AMT_REQ_CREDIT_BUREAU_MON           6.0
AMT_REQ_CREDIT_BUREAU_QRT           7.0
AMT_REQ_CREDIT_BUREAU_YEAR         17.0
Length: 71, dtype: float64
⚠️ Columns with very small values (below 1st percentile):
AMT_INCOME_TOTAL                26941.500000
AMT_CREDIT                      45000.000000
AMT_ANNUITY                      2295.000000
AMT_GOODS_PRICE                 45000.000000
REGION_POPULATION_RELATIVE          0.000253
DAYS_BIRTH                     -25195.000000
DAYS_EMPLOYED                  -17463.000000
DAYS_REGISTRATION              

### 2.5 Checking Dataset Shape  

The dataset's shape provides a quick view of its size, showing the number of **rows** (loan records) and **columns** (features).  

In [7]:
# Check the shape of the dataset
print("DataFrame Shape:", df_application_test.shape)

DataFrame Shape: (48744, 121)


### 2.6 Viewing Sample Data (`head()`)  

To understand the dataset, we display the **first few rows**.  
This helps verify that data is loaded correctly and gives an initial sense of feature distributions.  

In [8]:
print("First few rows of the DataFrame:")
display(df_application_test.head())

First few rows of the DataFrame:


Unnamed: 0,SK_ID_CURR,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100001,Cash loans,F,N,Y,0,135000.0,568800.0,20560.5,450000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
1,100005,Cash loans,M,N,Y,0,99000.0,222768.0,17370.0,180000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
2,100013,Cash loans,M,Y,Y,0,202500.0,663264.0,69777.0,630000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,4.0
3,100028,Cash loans,F,N,Y,2,315000.0,1575000.0,49018.5,1575000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,3.0
4,100038,Cash loans,M,Y,N,1,180000.0,625500.0,32067.0,625500.0,...,0,0,0,0,,,,,,


### 2.7 Dataset Summary (`info()`)  

The `info()` function provides:  
- **Column names and types** (e.g., integer, float, categorical).  
- **Non-null counts** (to check for missing data).  
- **Memory usage**, which is useful for optimizing performance.  

In [9]:
# Get a concise summary of the DataFrame
print("DataFrame Info:")
df_application_test.info()

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48744 entries, 0 to 48743
Columns: 121 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(40), object(16)
memory usage: 45.0+ MB


### 2.8 Descriptive Statistics  

We generate summary statistics for:  
- **Numeric columns** (`describe()`) → Mean, standard deviation, min/max, and quartiles.  
- **Categorical columns** (`describe(include='object')`) → Count of unique values, most frequent categories.  
This helps in **understanding distributions** and identifying possible anomalies.  

In [10]:
# Get summary statistics of numeric columns
print("Descriptive Statistics:")
display(df_application_test.describe())
display(df_application_test.describe(include="object"))

Descriptive Statistics:


Unnamed: 0,SK_ID_CURR,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,DAYS_REGISTRATION,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
count,48744.0,48744.0,48744.0,48744.0,48720.0,48744.0,48744.0,48744.0,48744.0,48744.0,...,48744.0,48744.0,48744.0,48744.0,42695.0,42695.0,42695.0,42695.0,42695.0,42695.0
mean,277796.67635,0.397054,178431.8,516740.4,29426.240209,462618.8,0.021226,-16068.084605,67485.366322,-4967.652716,...,0.001559,0.0,0.0,0.0,0.002108,0.001803,0.002787,0.009299,0.546902,1.983769
std,103169.547296,0.709047,101522.6,365397.0,16016.368315,336710.2,0.014428,4325.900393,144348.507136,3552.612035,...,0.039456,0.0,0.0,0.0,0.046373,0.046132,0.054037,0.110924,0.693305,1.838873
min,100001.0,0.0,26941.5,45000.0,2295.0,45000.0,0.000253,-25195.0,-17463.0,-23722.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,188557.75,0.0,112500.0,260640.0,17973.0,225000.0,0.010006,-19637.0,-2910.0,-7459.25,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,277549.0,0.0,157500.0,450000.0,26199.0,396000.0,0.01885,-15785.0,-1293.0,-4490.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
75%,367555.5,1.0,225000.0,675000.0,37390.5,630000.0,0.028663,-12496.0,-296.0,-1901.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,3.0
max,456250.0,20.0,4410000.0,2245500.0,180576.0,2245500.0,0.072508,-7338.0,365243.0,0.0,...,1.0,0.0,0.0,0.0,2.0,2.0,2.0,6.0,7.0,17.0


Unnamed: 0,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,NAME_TYPE_SUITE,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,OCCUPATION_TYPE,WEEKDAY_APPR_PROCESS_START,ORGANIZATION_TYPE,FONDKAPREMONT_MODE,HOUSETYPE_MODE,WALLSMATERIAL_MODE,EMERGENCYSTATE_MODE
count,48744,48744,48744,48744,47833,48744,48744,48744,48744,33139,48744,48744,15947,25125,24851,26535
unique,2,2,2,2,7,7,5,5,6,18,7,58,4,3,7,2
top,Cash loans,F,N,Y,Unaccompanied,Working,Secondary / secondary special,Married,House / apartment,Laborers,TUESDAY,Business Entity Type 3,reg oper account,block of flats,Panel,No
freq,48305,32678,32311,33658,39727,24533,33988,32283,43645,8655,9751,10840,12124,24659,11269,26179


## 3. Initial Data Cleaning (`application_test.csv`)
After identifying potential issues, we clean the dataset by handling **infinite values, categorical features, and potential misclassified columns**.

### 3.1 Replacing Infinite Values  
Since infinite values (`inf`, `-inf`) can interfere with model training, we replace them with `NaN` to handle them properly later.  

In [11]:
# Replace infinite values with NaN before handling missing values
df_application_test.replace([np.inf, -np.inf], np.nan, inplace=True)
print("✅ Infinite values replaced with NaN.")

✅ Infinite values replaced with NaN.


### 3.2 Converting Categorical Features  
Some columns are stored as `object` or numeric types but should be categorical.  
We identify and convert relevant columns to the **category dtype** for efficiency and proper encoding.  

In [12]:
# Step 1: Identify categorical columns
categorical_candidates = df_application_test.select_dtypes(include=['object']).columns.tolist()

# Step 2: Identify numeric columns with low unique values (possible categorical)
low_unique_counts = df_application_test.nunique()
numeric_categoricals = low_unique_counts[
    (low_unique_counts < 20) & (df_application_test.dtypes != 'object')
].index.tolist()

# Step 3: Combine all categorical columns
final_categorical_columns = categorical_candidates + numeric_categoricals

# Step 4: Convert detected columns to 'category' dtype
if final_categorical_columns:
    df_application_test[final_categorical_columns] = df_application_test[final_categorical_columns].astype("category")
    print(f"✅ Converted categorical columns: {final_categorical_columns}")
else:
    print("✅ No categorical columns detected in this dataset. No dtype conversion needed.")

# Step 5: Manual Verification
print("\nChecking final dtype distribution:")
print(df_application_test.dtypes.value_counts())

✅ Converted categorical columns: ['NAME_CONTRACT_TYPE', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'NAME_TYPE_SUITE', 'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'OCCUPATION_TYPE', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE', 'CNT_CHILDREN', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT', 'REGION_RATING_CLIENT_W_CITY', 'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'DEF_30_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE', 'FLAG_DOCUMENT_2', 'FLAG_DOCUMENT_3', 'FLAG_DOCUMENT_4', 'FLAG_DOCUMENT_5', 'FLAG_DOCUMENT_6', 'FLAG_DOCUMENT_7', 'FLAG_DOCUMENT_8', 'FLAG_DOCUMENT_9', 'FLAG_DOCUMENT_10', 'FLAG_DOCUMENT_11', 'FLAG_DOCUMENT_12', 'FLAG_D

### 3.3 Final Dtype Adjustments  
After the initial programmatic conversion, we manually adjust specific columns to ensure correct data types.  

In [13]:
# Adjust column data types after initial programmatic conversion (allowing NaNs in integers)
convert_to_int = [
    "CNT_CHILDREN", "CNT_FAM_MEMBERS", "REGION_RATING_CLIENT", "REGION_RATING_CLIENT_W_CITY"
]

convert_to_float = [
    "AMT_REQ_CREDIT_BUREAU_HOUR", "AMT_REQ_CREDIT_BUREAU_DAY",
    "AMT_REQ_CREDIT_BUREAU_WEEK", "AMT_REQ_CREDIT_BUREAU_QRT"
]

# Use nullable integer type to avoid NaN conversion issues
df_application_test[convert_to_int] = df_application_test[convert_to_int].astype(pd.Int64Dtype())
df_application_test[convert_to_float] = df_application_test[convert_to_float].astype("float64")

print("✅ Final dtype adjustments applied (without disrupting NaN handling)!")

✅ Final dtype adjustments applied (without disrupting NaN handling)!


### 3.4 Checking Decimal Values in Float Columns  

Some columns are stored as floats but should contain only **integer values** (e.g., counts of transactions or installments).  
To verify correctness, we check how many rows in each float column contain non-integer (decimal) values.  
This helps detect potential **data type mismatches** or **unexpected floating-point precision issues**.  

In [14]:
# Checking if float columns contain decimal values
print("\nChecking if float columns contain decimal values:")

# Identify all float columns
float_cols = df_application_test.select_dtypes(include=['float']).columns

# Count the number of rows in each float column that contain decimal values
decimal_counts = df_application_test[float_cols].map(lambda x: x % 1 != 0).sum()

# Print results
print(decimal_counts)


Checking if float columns contain decimal values:
AMT_INCOME_TOTAL                   87
AMT_CREDIT                       5380
AMT_ANNUITY                     24054
AMT_GOODS_PRICE                   148
REGION_POPULATION_RELATIVE      48744
DAYS_REGISTRATION                   0
OWN_CAR_AGE                     32312
EXT_SOURCE_1                    48744
EXT_SOURCE_2                    48744
EXT_SOURCE_3                    48744
APARTMENTS_AVG                  48588
BASEMENTAREA_AVG                46482
YEARS_BEGINEXPLUATATION_AVG     48676
YEARS_BUILD_AVG                 48718
COMMONAREA_AVG                  47351
ELEVATORS_AVG                   35253
ENTRANCES_AVG                   48677
FLOORSMAX_AVG                   48293
FLOORSMIN_AVG                   48328
LANDAREA_AVG                    46068
LIVINGAPARTMENTS_AVG            48676
LIVINGAREA_AVG                  48669
NONLIVINGAPARTMENTS_AVG         39915
NONLIVINGAREA_AVG               39249
APARTMENTS_MODE                 48561

### 3.5 Identifying Integer-Like Float Columns  
Some columns are stored as floats but only contain whole numbers.  
To optimize memory usage and ensure data consistency, we:  
- Identify float columns where all values are actually integers.  
- Convert these columns to `int64` type while preserving `NaN` values.

In [15]:
# Identify float columns that contain only integer values
int_like_columns = [
    col for col in float_cols 
    if df_application_test[col].dropna().apply(lambda x: x % 1 == 0).all()
]

print("Columns that are actually integers but stored as floats:", int_like_columns)

Columns that are actually integers but stored as floats: ['DAYS_REGISTRATION', 'OWN_CAR_AGE', 'OBS_30_CNT_SOCIAL_CIRCLE', 'OBS_60_CNT_SOCIAL_CIRCLE', 'DAYS_LAST_PHONE_CHANGE', 'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_QRT']


### 3.6 Converting Integer-Like Float Columns to `int64`  
Now that we have identified float columns that contain only whole numbers,  
we convert them to `int64` to optimize storage and improve performance.  
- This conversion ensures proper data representation without affecting missing values.

In [16]:
# Convert float columns that contain only whole numbers to int64
convert_floats_to_int = [
    "OWN_CAR_AGE", "OBS_30_CNT_SOCIAL_CIRCLE", "OBS_60_CNT_SOCIAL_CIRCLE",
    "DAYS_LAST_PHONE_CHANGE", "AMT_REQ_CREDIT_BUREAU_HOUR", "AMT_REQ_CREDIT_BUREAU_DAY",
    "AMT_REQ_CREDIT_BUREAU_WEEK", "AMT_REQ_CREDIT_BUREAU_QRT"
]

df_application_test[convert_floats_to_int] = df_application_test[convert_floats_to_int].astype(pd.Int64Dtype())

print("✅ Converted integer-like float columns to int64 (without affecting NaNs).")

✅ Converted integer-like float columns to int64 (without affecting NaNs).


### 3.7 Displaying Updated Data Types  

After data cleaning, we check if all columns have the correct data types.  
This ensures that categorical, numeric, and ID columns are properly assigned before further processing.  

In [17]:
# Display all columns and their data types
pd.set_option('display.max_rows', None) 
print("✅ Updated Data Types:")
print(df_application_test.dtypes)
pd.reset_option('display.max_rows')  

✅ Updated Data Types:
SK_ID_CURR                         int64
NAME_CONTRACT_TYPE              category
CODE_GENDER                     category
FLAG_OWN_CAR                    category
FLAG_OWN_REALTY                 category
CNT_CHILDREN                       Int64
AMT_INCOME_TOTAL                 float64
AMT_CREDIT                       float64
AMT_ANNUITY                      float64
AMT_GOODS_PRICE                  float64
NAME_TYPE_SUITE                 category
NAME_INCOME_TYPE                category
NAME_EDUCATION_TYPE             category
NAME_FAMILY_STATUS              category
NAME_HOUSING_TYPE               category
REGION_POPULATION_RELATIVE       float64
DAYS_BIRTH                         int64
DAYS_EMPLOYED                      int64
DAYS_REGISTRATION                float64
DAYS_ID_PUBLISH                    int64
OWN_CAR_AGE                        Int64
FLAG_MOBIL                      category
FLAG_EMP_PHONE                  category
FLAG_WORK_PHONE                 cat

### 3.8 Fixing Data Type Inconsistencies
After converting integer-like floats to `int64`, we review all data types to ensure correctness.  
Some numerical columns were mistakenly identified as categorical. We now convert them back to `float64` for consistency.

In [18]:
# Identify columns that should be numerical but were categorized incorrectly
convert_to_numeric = ["AMT_REQ_CREDIT_BUREAU_MON", "AMT_REQ_CREDIT_BUREAU_YEAR"]

# Convert these columns back to float64 for numerical consistency
df_application_test[convert_to_numeric] = df_application_test[convert_to_numeric].astype("Int64")

print("✅ Corrected numerical columns misidentified as categorical.")

✅ Corrected numerical columns misidentified as categorical.


### 3.8 Handling Missing Values  
Missing values can impact model performance, so we analyze and handle them based on their percentage.  

#### 3.8.1 Checking for Missing Values  
We first identify missing values in each column to understand their distribution.  

In [19]:
# Check for missing values
pd.set_option('display.max_rows', None) 
print("Missing values in each column:")
print(df_application_test.isnull().sum())
pd.reset_option('display.max_rows')  

Missing values in each column:
SK_ID_CURR                          0
NAME_CONTRACT_TYPE                  0
CODE_GENDER                         0
FLAG_OWN_CAR                        0
FLAG_OWN_REALTY                     0
CNT_CHILDREN                        0
AMT_INCOME_TOTAL                    0
AMT_CREDIT                          0
AMT_ANNUITY                        24
AMT_GOODS_PRICE                     0
NAME_TYPE_SUITE                   911
NAME_INCOME_TYPE                    0
NAME_EDUCATION_TYPE                 0
NAME_FAMILY_STATUS                  0
NAME_HOUSING_TYPE                   0
REGION_POPULATION_RELATIVE          0
DAYS_BIRTH                          0
DAYS_EMPLOYED                       0
DAYS_REGISTRATION                   0
DAYS_ID_PUBLISH                     0
OWN_CAR_AGE                     32312
FLAG_MOBIL                          0
FLAG_EMP_PHONE                      0
FLAG_WORK_PHONE                     0
FLAG_CONT_MOBILE                    0
FLAG_PHONE         

#### 3.8.2 Calculating Missing Value Percentages  
To categorize missing values, we define thresholds:  
- **Low**: Less than 1% missing (likely safe to fill with median/mean).  
- **Moderate**: 1%–20% missing (requires careful handling).  
- **High**: More than 50% missing (usually dropped unless critical).  

In [20]:
# Define missing value thresholds
low_threshold = 1    # Less than 1% missing
moderate_threshold = 20  # Between 1% and 20% missing
high_threshold = 50   # More than 50% missing 

# Calculate missing value percentage for df_application_test
missing_percent = (df_application_test.isnull().sum() / len(df_application_test)) * 100  

# Display missing percentages sorted from highest to lowest
print("🔍 Missing Value Percentages:")
display(missing_percent[missing_percent > 0].sort_values(ascending=False).apply(lambda x: f"{x:.2f}%"))

🔍 Missing Value Percentages:


COMMONAREA_MODE             68.72%
COMMONAREA_MEDI             68.72%
COMMONAREA_AVG              68.72%
NONLIVINGAPARTMENTS_MEDI    68.41%
NONLIVINGAPARTMENTS_AVG     68.41%
                             ...  
OBS_60_CNT_SOCIAL_CIRCLE     0.06%
DEF_30_CNT_SOCIAL_CIRCLE     0.06%
OBS_30_CNT_SOCIAL_CIRCLE     0.06%
AMT_ANNUITY                  0.05%
EXT_SOURCE_2                 0.02%
Length: 64, dtype: object

#### 3.8.3 Dropping Columns with Excessive Missing Data  
Columns with **more than 50% missing values** are removed to prevent data leakage and reduce noise.  

In [21]:
# Identify columns to drop (more than 50% missing)
columns_to_drop = missing_percent[missing_percent > high_threshold].index

# Drop columns
df_application_test.drop(columns=columns_to_drop, inplace=True)

# Print removed columns
print(f"✅ Dropped {len(columns_to_drop)} columns with more than 50% missing values.")
print("Dropped columns:", list(columns_to_drop))

✅ Dropped 29 columns with more than 50% missing values.
Dropped columns: ['OWN_CAR_AGE', 'BASEMENTAREA_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'BASEMENTAREA_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'BASEMENTAREA_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE']


### 3.9 Filling Missing Values  

To ensure data completeness, we fill missing values in both **numeric** and **categorical** columns using appropriate strategies.  
- **Numeric values** → Filled with **median** (robust to outliers).  
- **Categorical values** → Filled with **mode** (most frequent category).  

In [22]:
# Fill numeric columns with median (for application_test)
numeric_cols = df_application_test.select_dtypes(include=['int64', 'float64']).columns
df_application_test[numeric_cols] = df_application_test[numeric_cols].fillna(df_application_test[numeric_cols].median())

print("✅ Filled numeric missing values with median.")

✅ Filled numeric missing values with median.


In [23]:
# Fill categorical columns with mode (for application_test)
categorical_cols = df_application_test.select_dtypes(include=['category']).columns

for col in categorical_cols:
    df_application_test[col] = df_application_test[col].fillna(df_application_test[col].mode()[0])

print("✅ Filled categorical missing values with mode.")

✅ Filled categorical missing values with mode.


### 3.10 Final Missing Values Check  
After filling missing values, we perform a final check to confirm that **no NaNs remain** in the dataset.  

In [24]:
print("Final Missing Values Check:")
print(df_application_test.isnull().sum().sum())

Final Missing Values Check:
0


### 3.11 Checking for Duplicates  
Duplicate rows can cause data leakage and distort model training.  
We check for duplicates and ensure that no redundant rows exist.  

In [25]:
# Check for duplicates
print("Number of duplicate rows:", df_application_test.duplicated().sum())

Number of duplicate rows: 0


## 4. Comparison with application_train (`application_test.csv`)
To ensure consistency between training and test datasets, we load the processed `application_train.pkl` file and compare it with `application_test.csv`.  
This step checks for:
- **Missing columns** in the test dataset that exist in train.
- **Extra columns** in test that do not exist in train.
- **Feature count consistency** (ensuring the same number of features, except `TARGET`).
- **Feature order consistency** (ensuring the test dataset has the same column order as train).

This alignment is necessary before making predictions using the trained model.

### 4.1 Load Processed `application_train.pkl` for Comparison  
We load the processed `application_train.pkl` file from the dataset input to compare it with `application_test.csv`.

In [26]:
# Define the path to the processed application_train dataset from the dataset input
input_path = "/kaggle/input/home-credit-processed-data-and-model/application_train_processed.pkl"

# Load the processed application_train dataset
df_application_train = pd.read_pickle(input_path)

print("✅ application_train_processed.pkl loaded successfully!")

✅ application_train_processed.pkl loaded successfully!


### 4.2 Compare Columns in Train and Test Datasets  
To ensure feature consistency, we compare columns between `application_train` and `application_test`.  
This step identifies:
- **Columns in train but missing in test** (expected: `TARGET`, since it's not available in test).
- **Columns in test but missing in train** (these extra columns will be dropped to align datasets).

In [27]:
# Get sets of column names
train_columns = set(df_application_train.columns)
test_columns = set(df_application_test.columns)

# Columns in train but missing in test
missing_in_test = train_columns - test_columns
# Columns in test but missing in train
extra_in_test = test_columns - train_columns

print("✅ Columns in Train but Missing in Test:", missing_in_test)
print("✅ Columns in Test but Missing in Train:", extra_in_test)

✅ Columns in Train but Missing in Test: {'TARGET'}
✅ Columns in Test but Missing in Train: {'WALLSMATERIAL_MODE', 'LIVINGAREA_MEDI', 'LIVINGAREA_AVG', 'LIVINGAREA_MODE', 'ENTRANCES_MEDI', 'APARTMENTS_MEDI', 'APARTMENTS_MODE', 'EXT_SOURCE_1', 'HOUSETYPE_MODE', 'ENTRANCES_AVG', 'APARTMENTS_AVG', 'ENTRANCES_MODE'}


### 4.3 Drop Extra Columns from Test Dataset  
The previous step identified extra columns in `application_test` that do not exist in `application_train`.  
We remove these columns to ensure both datasets have identical feature sets before model inference.

In [28]:
# Drop extra columns that are not in application_train
df_application_test.drop(columns=extra_in_test, inplace=True)

print("✅ Dropped extra columns from df_application_test to align with df_application_train.")

✅ Dropped extra columns from df_application_test to align with df_application_train.


### 4.4 Verify Feature Count Consistency  
To ensure consistency between `application_train` and `application_test`, we verify that:  
- Both datasets have the same number of features (excluding `TARGET`, which is absent in test).  
- No unexpected feature differences exist after column alignment.  

In [29]:
# Check that feature count matches exactly (excluding TARGET)
print("Train feature count:", df_application_train.drop(columns=["TARGET"]).shape[1])
print("Test feature count:", df_application_test.shape[1])

Train feature count: 80
Test feature count: 80


### 4.5 Ensure Feature Order Consistency  
For model inference, the order of features in `application_test` must match `application_train`.  
We check whether the feature order is identical across both datasets and flag mismatches.  

In [30]:
# Compare Feature Order
train_feature_order = list(df_application_train.drop(columns=["TARGET"]).columns)
test_feature_order = list(df_application_test.columns)

if train_feature_order == test_feature_order:
    print("✅ Feature order matches!")
else:
    print("⚠️ Feature order mismatch!")

✅ Feature order matches!


### 4.6 Validate Data Type Consistency  
To ensure that `application_test` matches `application_train`, we:  
- Identify mismatches in column data types.  
- Filter out **category vs. category** mismatches (which are not actual issues).  
- Report only **real** data type inconsistencies (e.g., int vs. float mismatches).  

A fully aligned dataset ensures smooth model inference without conversion errors. 

In [31]:
# Compare dtypes of common columns
common_columns = train_columns.intersection(test_columns)

# Identify mismatches
dtype_mismatches = {
    col: (df_application_train[col].dtype, df_application_test[col].dtype)
    for col in common_columns if df_application_train[col].dtype != df_application_test[col].dtype
}

# Filter out category vs. category mismatches (which are not real issues)
real_dtype_mismatches = {
    col: (train_dtype, test_dtype)
    for col, (train_dtype, test_dtype) in dtype_mismatches.items()
    if not (train_dtype.name == "category" and test_dtype.name == "category")
}

# Display results
if real_dtype_mismatches:
    print("⚠️ Real Data Type Mismatches Found:")
    for col, (train_dtype, test_dtype) in real_dtype_mismatches.items():
        print(f"❌ {col}: Train = {train_dtype}, Test = {test_dtype}")
else:
    print("✅ No real data type mismatches! Train and Test datasets are fully aligned.")

✅ No real data type mismatches! Train and Test datasets are fully aligned.


### 4.7 Handling Missing Values  
Before finalizing `application_test`, we perform **data quality checks** to detect potential issues:  

✅ **Check for standard missing values (`NaN`)**  
✅ **Detect hidden missing values (empty strings, spaces)**  
✅ **Identify infinite values (`inf`) in numeric columns**  

In [32]:
# Check missing values
missing_values_test = df_application_test.isnull().sum()
missing_values_test = missing_values_test[missing_values_test > 0].sort_values(ascending=False)

# Check for hidden NaNs (empty strings, spaces)
hidden_nans = (df_application_test == "").sum().sum() + (df_application_test == " ").sum().sum()

# Check for infinite values **only in numeric columns**
numeric_cols = df_application_test.select_dtypes(include=['number']).columns
hidden_infs = np.isinf(df_application_test[numeric_cols]).sum().sum()

# Print results
print("**Data Quality Checks in application_test:**")

if not missing_values_test.empty:
    print("\n⚠️ Missing Values Detected:")
    display(missing_values_test.apply(lambda x: f"{x} ({(x / len(df_application_test)) * 100:.2f}%)"))
else:
    print("\n✅ No Missing Values!")

print(f"\nHidden NaNs (Empty Strings, Spaces): {hidden_nans}")
print(f"Hidden Infinite Values: {hidden_infs}")

if hidden_nans == 0 and hidden_infs == 0:
    print("\n✅ No Hidden NaNs or Infinite Values Found! Dataset is clean.")
else:
    print("\n⚠️ Warning: Hidden NaNs or Infinite Values Found! Consider fixing them before merging.")


**Data Quality Checks in application_test:**

✅ No Missing Values!

Hidden NaNs (Empty Strings, Spaces): 0
Hidden Infinite Values: 0

✅ No Hidden NaNs or Infinite Values Found! Dataset is clean.


## 5. Finalizing Processed application_test Dataset  
The `application_test.csv` dataset is now **fully processed**, cleaned, and aligned with `application_train.csv`.  
It is saved as `application_test_processed.csv` and `application_test_processed.pkl` for **model inference**. 

In [33]:
# Save processed application_test dataset
df_application_test.to_csv("application_test_processed.csv", index=False)
df_application_test.to_pickle("application_test_processed.pkl")

print("✅ Application test dataset saved successfully!")

✅ Application test dataset saved successfully!
