# Feature Engineering for Credit Risk Prediction

## Objective
The objective of this notebook is to create meaningful, business-driven
features that improve the predictive power of the credit risk model.

Feature engineering is critical in credit risk modeling because:
- Raw features rarely capture risk patterns directly
- Ratios and derived features reflect real financial behavior
- Well-designed features often matter more than complex models


In [1]:
import numpy as np
import pandas as pd


## Load Cleaned Dataset

We load the cleaned datasets created in Step 1.
All missing values have already been handled.


In [2]:
train_df = pd.read_csv("../data/processed/train_cleaned.csv")
test_df  = pd.read_csv("../data/processed/test_cleaned.csv")

print(train_df.shape, test_df.shape)


(307511, 73) (48744, 72)


## Why Feature Engineering Matters in Credit Risk

Raw features such as income, loan amount, or age do not fully capture risk.
However, relationships between features provide stronger signals.

Examples:
- Loan amount relative to income
- Credit amount relative to annuity
- Employment stability relative to age

These engineered features reflect real-world financial behavior.


## Why Feature Engineering Matters in Credit Risk

Raw features such as income, loan amount, or age do not fully capture risk.
However, relationships between features provide stronger signals.

Examples:
- Loan amount relative to income
- Credit amount relative to annuity
- Employment stability relative to age

These engineered features reflect real-world financial behavior.


## Financial Ratio Features

We create ratio-based features that normalize values and make them comparable
across applicants with different income levels.


In [3]:
for df in [train_df, test_df]:
    df["CREDIT_INCOME_RATIO"] = df["AMT_CREDIT"] / df["AMT_INCOME_TOTAL"]
    df["ANNUITY_INCOME_RATIO"] = df["AMT_ANNUITY"] / df["AMT_INCOME_TOTAL"]
    df["CREDIT_ANNUITY_RATIO"] = df["AMT_CREDIT"] / df["AMT_ANNUITY"]


## Employment and Age-Based Features

Employment stability and age are strong indicators of repayment capacity.
We create features that capture these aspects more effectively.


In [5]:
for df in [train_df, test_df]:
    df["AGE_YEARS"] = (-df["DAYS_BIRTH"]) / 365
    df["EMPLOYMENT_YEARS"] = (-df["DAYS_EMPLOYED"]) / 365
    df["EMPLOYMENT_AGE_RATIO"] = df["EMPLOYMENT_YEARS"] / df["AGE_YEARS"]


## Income and Family Responsibility Features

Family size and dependents affect financial obligations.
We create features to capture this impact.


In [6]:
for df in [train_df, test_df]:
    df["INCOME_PER_PERSON"] = df["AMT_INCOME_TOTAL"] / df["CNT_FAM_MEMBERS"]


## Binary Risk Flags

Binary flags help models capture threshold-based risk patterns.
These features encode domain knowledge explicitly.


In [7]:
for df in [train_df, test_df]:
    df["HIGH_CREDIT_INCOME"] = (df["CREDIT_INCOME_RATIO"] > 0.5).astype(int)
    df["LOW_EMPLOYMENT"] = (df["EMPLOYMENT_YEARS"] < 1).astype(int)
    df["HIGH_ANNUITY_BURDEN"] = (df["ANNUITY_INCOME_RATIO"] > 0.3).astype(int)


## Handling Infinite and Invalid Values

Some ratio features may produce infinite or invalid values.
These must be handled before modeling.


In [8]:
train_df.replace([np.inf, -np.inf], np.nan, inplace=True)
test_df.replace([np.inf, -np.inf], np.nan, inplace=True)

num_cols = train_df.select_dtypes(include=["int64", "float64"]).columns
num_cols = num_cols.drop("TARGET")

train_df[num_cols] = train_df[num_cols].fillna(train_df[num_cols].median())
test_df[num_cols]  = test_df[num_cols].fillna(train_df[num_cols].median())


## Feature Engineering Summary

New features created include:
- Financial burden ratios
- Employment stability indicators
- Income normalization features
- Binary risk flags based on domain knowledge

These engineered features are designed to:
- Improve model interpretability
- Capture real-world risk behavior
- Enhance predictive performance


## Save Feature-Engineered Dataset

The transformed dataset is saved for model training and evaluation.


In [None]:
train_df.to_csv("../data/processed/train_fe.csv", index=False)
test_df.to_csv("../data/processed/test_fe.csv", index=False)


## Summary of Step 2

In this notebook, we:
- Designed business-driven features
- Created ratio-based financial indicators
- Added employment and family-related features
- Incorporated domain-specific risk flags
- Ensured numerical stability
- Saved the final feature-engineered dataset

The dataset is now ready for model training and evaluation.
