# Predictive Modeling

## Objective
The goal of this notebook is to build and evaluate baseline predictive models for loan default risk using cross-validation.

This notebook focuses on:
- Establishing a strong, interpretable baseline using logistic regression
- Comparing feature sets to assess the incremental value of engineered features
- Evaluating model performance using appropriate metrics for imbalanced data
- Prioritizing robustness and stability over marginal performance gains

## Modeling Strategy

Two baseline logistic regression models are evaluated:

**Model A (Interpretable Baseline)**
- Uses features with strong monotonic or well-understood relationships to default risk

**Model B (Extended Baseline)**
- Includes all Model A features plus the credit-to-income ratio
- Designed to test whether empirically observed improvements justify added complexity

Both models are evaluated using identical preprocessing, cross-validation folds, and metrics.

In [1]:
import pandas as pd
import numpy as np

from pathlib import Path

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

In [2]:
PROJECT_ROOT = Path.cwd().parents[0]  # notebooks/ -> repo root
DATA_DIR = PROJECT_ROOT / "data"

df = pd.read_csv(DATA_DIR / "application_train.csv")
print("Shape:", df.shape)
df[["TARGET"]].value_counts(normalize=True).rename("rate")

Shape: (307511, 122)


TARGET
0         0.919271
1         0.080729
Name: rate, dtype: float64

In [3]:
def add_engineered_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Age in years
    df["AGE_YEARS"] = (-df["DAYS_BIRTH"] / 365).round(1)

    # Ratios (guard against divide-by-zero)
    income = df["AMT_INCOME_TOTAL"].replace(0, np.nan)
    df["CREDIT_INCOME_RATIO"] = df["AMT_CREDIT"] / income
    df["ANNUITY_INCOME_RATIO"] = df["AMT_ANNUITY"] / income

    # Missingness indicators for EXT_SOURCE
    ext_features = ["EXT_SOURCE_1", "EXT_SOURCE_2", "EXT_SOURCE_3"]
    for col in ext_features:
        df[f"{col}_MISSING"] = df[col].isnull().astype(int)

    return df

df_fe = add_engineered_features(df)
df_fe[["AGE_YEARS","CREDIT_INCOME_RATIO","ANNUITY_INCOME_RATIO","EXT_SOURCE_1_MISSING"]].head()

Unnamed: 0,AGE_YEARS,CREDIT_INCOME_RATIO,ANNUITY_INCOME_RATIO,EXT_SOURCE_1_MISSING
0,25.9,2.007889,0.121978,0
1,45.9,4.79075,0.132217,0
2,52.2,2.0,0.1,1
3,52.1,2.316167,0.2199,1
4,54.6,4.222222,0.179963,1


In [4]:
TARGET = "TARGET"

# Categorical features (low-cardinality, easy to explain)
CAT_FEATURES = [
    "NAME_CONTRACT_TYPE",
    "CODE_GENDER",
    "FLAG_OWN_CAR",
    "FLAG_OWN_REALTY"
]

# Numeric features (signal-rich + engineered)
NUM_BASE = [
    "AGE_YEARS",
    "ANNUITY_INCOME_RATIO",
    "EXT_SOURCE_1",
    "EXT_SOURCE_2",
    "EXT_SOURCE_3",
    "EXT_SOURCE_1_MISSING",
    "EXT_SOURCE_2_MISSING",
    "EXT_SOURCE_3_MISSING",
]

# Model A: interpretable baseline
BASE_FEATURES = CAT_FEATURES + NUM_BASE

# Model B: baseline + empirically tested non-monotonic feature
EXTENDED_FEATURES = BASE_FEATURES + ["CREDIT_INCOME_RATIO"]

X_base = df_fe[BASE_FEATURES].copy()
X_ext = df_fe[EXTENDED_FEATURES].copy()
y = df_fe[TARGET].astype(int).copy()

print("Base feature count:", X_base.shape[1])
print("Extended feature count:", X_ext.shape[1])

Base feature count: 12
Extended feature count: 13


In [5]:
def build_logreg_pipeline(numeric_features, categorical_features):
    numeric_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ])

    categorical_pipe = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("onehot", OneHotEncoder(handle_unknown="ignore"))
    ])

    preprocessor = ColumnTransformer(
        transformers=[
            ("num", numeric_pipe, numeric_features),
            ("cat", categorical_pipe, categorical_features),
        ],
        remainder="drop"
    )

    model = LogisticRegression(
        max_iter=2000,
        solver="lbfgs",
        class_weight="balanced",   # important with ~8% positive class
        n_jobs=None
    )

    return Pipeline(steps=[
        ("preprocess", preprocessor),
        ("model", model)
    ])

In [6]:
def evaluate_cv(pipeline, X, y, n_splits=5, random_state=42):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    scores = cross_validate(
        pipeline,
        X, y,
        cv=cv,
        scoring={
            "roc_auc": "roc_auc",
            "pr_auc": "average_precision"
        },
        return_train_score=False
    )

    return {
        "roc_auc_mean": float(np.mean(scores["test_roc_auc"])),
        "roc_auc_std": float(np.std(scores["test_roc_auc"])),
        "pr_auc_mean": float(np.mean(scores["test_pr_auc"])),
        "pr_auc_std": float(np.std(scores["test_pr_auc"]))
    }

In [7]:
pipe_base = build_logreg_pipeline(
    numeric_features=NUM_BASE,
    categorical_features=CAT_FEATURES
)

pipe_ext = build_logreg_pipeline(
    numeric_features=NUM_BASE + ["CREDIT_INCOME_RATIO"],
    categorical_features=CAT_FEATURES
)

results_base = evaluate_cv(pipe_base, X_base, y)
results_ext  = evaluate_cv(pipe_ext,  X_ext,  y)

results = pd.DataFrame([
    {"model": "LogReg - Base", **results_base},
    {"model": "LogReg - Extended (+CIR)", **results_ext},
])

results

Unnamed: 0,model,roc_auc_mean,roc_auc_std,pr_auc_mean,pr_auc_std
0,LogReg - Base,0.732087,0.004943,0.207924,0.005663
1,LogReg - Extended (+CIR),0.732191,0.004896,0.208168,0.005532


## CV Comparison Takeaways
- Compare ROC AUC and PR AUC (PR AUC is especially informative given ~8% default rate)
- Prefer the model that is both better on average and more stable (lower std across folds)
- If Extended improves only marginally or inconsistently, keep CIR for tree-based models rather than the linear baseline

In [8]:
from sklearn.metrics import roc_auc_score

def coef_stability(pipeline, X, y, n_splits=5, random_state=42):
    cv = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    coefs = []
    aucs = []

    for fold, (train_idx, test_idx) in enumerate(cv.split(X, y), 1):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]

        pipeline.fit(X_train, y_train)
        proba = pipeline.predict_proba(X_test)[:, 1]
        aucs.append(roc_auc_score(y_test, proba))

        # Extract coefficients aligned to transformed feature names
        pre = pipeline.named_steps["preprocess"]
        feat_names = pre.get_feature_names_out()
        coef = pipeline.named_steps["model"].coef_.ravel()
        coefs.append(pd.Series(coef, index=feat_names, name=f"fold_{fold}"))

    coef_df = pd.concat(coefs, axis=1)
    summary = pd.DataFrame({
        "coef_mean": coef_df.mean(axis=1),
        "coef_std": coef_df.std(axis=1),
        "abs_mean": coef_df.abs().mean(axis=1),
    }).sort_values("abs_mean", ascending=False)

    return summary, aucs

base_coef_summary, base_fold_aucs = coef_stability(pipe_base, X_base, y)
ext_coef_summary,  ext_fold_aucs  = coef_stability(pipe_ext,  X_ext,  y)

print("Base fold AUCs:", [round(x, 4) for x in base_fold_aucs])
print("Ext  fold AUCs:", [round(x, 4) for x in ext_fold_aucs])

base_coef_summary.head(15)

Base fold AUCs: [0.7302, 0.737, 0.7317, 0.7374, 0.724]
Ext  fold AUCs: [0.7302, 0.7371, 0.7318, 0.7376, 0.7243]


Unnamed: 0,coef_mean,coef_std,abs_mean
num__EXT_SOURCE_3,-0.507822,0.002101,0.507822
num__EXT_SOURCE_2,-0.439829,0.005024,0.439829
cat__NAME_CONTRACT_TYPE_Revolving loans,-0.288051,0.006388,0.288051
cat__CODE_GENDER_F,-0.271293,0.004419,0.271293
num__EXT_SOURCE_1,-0.229897,0.005305,0.229897
cat__FLAG_OWN_CAR_Y,-0.227635,0.005152,0.227635
cat__NAME_CONTRACT_TYPE_Cash loans,0.128446,0.005103,0.128446
num__EXT_SOURCE_1_MISSING,0.123123,0.004684,0.123123
num__AGE_YEARS,-0.119853,0.004349,0.119853
num__EXT_SOURCE_3_MISSING,0.113754,0.004489,0.113754


In [9]:
ext_coef_summary.head(15)

Unnamed: 0,coef_mean,coef_std,abs_mean
num__EXT_SOURCE_3,-0.507269,0.002137,0.507269
num__EXT_SOURCE_2,-0.439082,0.005114,0.439082
cat__NAME_CONTRACT_TYPE_Revolving loans,-0.288567,0.006268,0.288567
cat__CODE_GENDER_F,-0.269231,0.004544,0.269231
num__EXT_SOURCE_1,-0.229298,0.005348,0.229298
cat__FLAG_OWN_CAR_Y,-0.226409,0.005132,0.226409
cat__NAME_CONTRACT_TYPE_Cash loans,0.128845,0.004996,0.128845
num__EXT_SOURCE_1_MISSING,0.12319,0.004684,0.12319
num__ANNUITY_INCOME_RATIO,0.11849,0.003047,0.11849
num__AGE_YEARS,-0.11644,0.004082,0.11644


## Feature Set Comparison: Linear Baseline

Two logistic regression baselines were evaluated using stratified cross-validation:
- Model A: Interpretable baseline feature set
- Model B: Baseline + credit-to-income ratio

### Key Findings
- Adding the credit-to-income ratio resulted in negligible improvements in ROC AUC (~0.0001) and PR AUC
- Performance differences were not consistent or meaningful across folds
- Coefficient rankings and magnitudes remained stable, indicating no interaction or stabilization effects
- The annuity-to-income ratio exhibited stronger and more stable signal than total credit exposure

### Decision
- The credit-to-income ratio was excluded from the linear baseline model
- The feature will be retained for evaluation in nonlinear, tree-based models where non-monotonic effects can be exploited

This decision prioritizes robustness, interpretability, and empirical evidence over intuitive but unsupported feature inclusion.