# Loan Payback — Meta-Boosted XGBoost (loan_meta_optimized)

This notebook builds a **two–stage boosted model** for the loan payback competition:

1. **Stage 1**: A strong XGBoost classifier is trained on preprocessed features.
2. **Stage 2 (Meta Boost)**:  
   - We convert Stage‑1 predicted probabilities to **logits** (log‑odds).  
   - These logits are passed to a *second* XGBoost model via `base_margin`, which means
     Stage 2 **boosts over the residuals** of Stage 1 instead of starting from scratch.
   - The meta model outputs refined probabilities.

The goal is to provide a clean, reproducible pipeline that you can easily extend and
tune further on Kaggle.


In [13]:

# 1) Imports & basic configuration
import os
from pathlib import Path
from datetime import datetime

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, accuracy_score, f1_score, log_loss
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

import xgboost as xgb

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

DATA_DIR = Path("Data")
# If running locally, you can override DATA_DIR, e.g.:
# DATA_DIR = Path("/mnt/data") / "loan-payback"

def log(msg: str):
    ts = datetime.now().strftime("%H:%M:%S")
    print(f"[{ts}] {msg}")


In [14]:

# 2) Data loading and automatic target / id detection

train_path = None
test_path = None

# Heuristic: pick first train/test-looking CSVs
csv_files = sorted(list(DATA_DIR.glob("*.csv")))
for p in csv_files:
    name = p.name.lower()
    if "train" in name and train_path is None:
        train_path = p
    if "test" in name and test_path is None and "train" not in name:
        test_path = p

if train_path is None or test_path is None:
    raise FileNotFoundError(
        f"Could not detect train/test CSVs inside {DATA_DIR}. "
        "Please set train_path and test_path manually."
    )

log(f"Using train: {train_path.name}")
log(f"Using test : {test_path.name}")

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

log(f"Train shape: {train_df.shape}")
log(f"Test  shape: {test_df.shape}")

def detect_target(train_df: pd.DataFrame, test_df: pd.DataFrame) -> str:
    diff = list(set(train_df.columns) - set(test_df.columns))
    # Prefer a binary label
    candidates = []
    for c in diff:
        if train_df[c].nunique() <= 3:
            candidates.append(c)
    if len(candidates) == 1:
        return candidates[0]
    if len(diff) == 1:
        return diff[0]
    for name in ["loan_paid_back", "target", "label", "is_default", "default", "paid"]:
        if name in train_df.columns and name not in test_df.columns:
            return name
    raise ValueError(f"Could not detect target. Diff columns: {diff}")

target_col = detect_target(train_df, test_df)
log(f"Detected target column: {target_col}")

# Simple ID detection: column whose values are unique in train and test
id_col = None
for col in train_df.columns:
    if col == target_col:
        continue
    if col in test_df.columns:
        if train_df[col].is_unique and test_df[col].is_unique:
            id_col = col
            break

log(f"Detected id column: {id_col}")

y = train_df[target_col].astype(int).values

feature_cols = [c for c in train_df.columns if c not in [target_col, id_col]]
X = train_df[feature_cols].copy()
X_test = test_df[feature_cols].copy()

log(f"Number of features: {len(feature_cols)}")


[22:30:50] Using train: train.csv
[22:30:50] Using test : test.csv
[22:30:51] Train shape: (593994, 13)
[22:30:51] Test  shape: (254569, 12)
[22:30:51] Detected target column: loan_paid_back
[22:30:51] Detected id column: id
[22:30:51] Number of features: 11
[22:30:51] Train shape: (593994, 13)
[22:30:51] Test  shape: (254569, 12)
[22:30:51] Detected target column: loan_paid_back
[22:30:51] Detected id column: id
[22:30:51] Number of features: 11


In [15]:

# 3) Preprocessing: numeric + categorical pipelines

numeric_cols = X.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object", "category", "bool"]).columns.tolist()

log(f"Numeric features    : {len(numeric_cols)}")
log(f"Categorical features: {len(categorical_cols)}")

numeric_transformer = SimpleImputer(strategy="median")

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=True)),
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)

# Fit on full training data, then transform train & test
from sklearn.pipeline import Pipeline as SklearnPipeline  # avoid name clash

dummy_model = SklearnPipeline(steps=[
    ("preprocess", preprocess),
])

log("Fitting preprocessing on full training data...")
dummy_model.fit(X)

X_proc = dummy_model.transform(X)
X_test_proc = dummy_model.transform(X_test)

log(f"Processed X shape      : {X_proc.shape}")
log(f"Processed X_test shape : {X_test_proc.shape}")


[22:30:51] Numeric features    : 5
[22:30:51] Categorical features: 6
[22:30:51] Fitting preprocessing on full training data...
[22:30:53] Processed X shape      : (593994, 60)
[22:30:53] Processed X_test shape : (254569, 60)
[22:30:53] Processed X shape      : (593994, 60)
[22:30:53] Processed X_test shape : (254569, 60)


In [16]:

# 4) Stage 1: XGBoost base model with StratifiedKFold OOF

n_splits = 5
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=RANDOM_STATE)

oof_pred_stage1 = np.zeros(X_proc.shape[0])
test_pred_stage1_folds = []

params_stage1 = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 4,
    "learning_rate": 0.05,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "reg_lambda": 1.0,
    "reg_alpha": 0.0,
    "tree_method": "hist",
    "random_state": RANDOM_STATE,
}

log("Training Stage 1 model (base XGBoost)...")

for fold, (tr_idx, val_idx) in enumerate(skf.split(X_proc, y), 1):
    log(f"Fold {fold}/{n_splits}")
    X_tr, X_val = X_proc[tr_idx], X_proc[val_idx]
    y_tr, y_val = y[tr_idx], y[val_idx]

    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dval = xgb.DMatrix(X_val, label=y_val)
    dtest = xgb.DMatrix(X_test_proc)

    evals = [(dtrain, "train"), (dval, "valid")]

    booster = xgb.train(
        params_stage1,
        dtrain,
        num_boost_round=2000,
        evals=evals,
        early_stopping_rounds=100,
        verbose_eval=200,
    )

    oof_pred_stage1[val_idx] = booster.predict(dval, iteration_range=(0, booster.best_iteration + 1))
    test_pred_stage1_folds.append(
        booster.predict(dtest, iteration_range=(0, booster.best_iteration + 1))
    )

auc_stage1 = roc_auc_score(y, oof_pred_stage1)
log(f"Stage 1 OOF ROC-AUC: {auc_stage1:.5f}")


[22:30:53] Training Stage 1 model (base XGBoost)...
[22:30:53] Fold 1/5
[0]	train-auc:0.88926	valid-auc:0.88960
[0]	train-auc:0.88926	valid-auc:0.88960
[200]	train-auc:0.91682	valid-auc:0.91735
[200]	train-auc:0.91682	valid-auc:0.91735
[400]	train-auc:0.91986	valid-auc:0.91945
[400]	train-auc:0.91986	valid-auc:0.91945
[600]	train-auc:0.92224	valid-auc:0.92089
[600]	train-auc:0.92224	valid-auc:0.92089
[800]	train-auc:0.92405	valid-auc:0.92177
[800]	train-auc:0.92405	valid-auc:0.92177
[1000]	train-auc:0.92543	valid-auc:0.92225
[1000]	train-auc:0.92543	valid-auc:0.92225
[1200]	train-auc:0.92658	valid-auc:0.92252
[1200]	train-auc:0.92658	valid-auc:0.92252
[1400]	train-auc:0.92757	valid-auc:0.92272
[1400]	train-auc:0.92757	valid-auc:0.92272
[1600]	train-auc:0.92853	valid-auc:0.92289
[1600]	train-auc:0.92853	valid-auc:0.92289
[1800]	train-auc:0.92936	valid-auc:0.92294
[1800]	train-auc:0.92936	valid-auc:0.92294
[1999]	train-auc:0.93011	valid-auc:0.92296
[1999]	train-auc:0.93011	valid-auc:0.92

In [17]:

# 5) Stage 2: Meta XGBoost boosting over Stage‑1 logits (base_margin trick)

def prob_to_logit(p: np.ndarray, eps: float = 1e-6) -> np.ndarray:
    p = np.clip(p, eps, 1 - eps)
    return np.log(p / (1 - p))

log("Computing Stage 1 logits for OOF and test...")

logits_oof_stage1 = prob_to_logit(oof_pred_stage1)
test_pred_stage1_folds = np.vstack(test_pred_stage1_folds)  # (n_splits, n_test)
logits_test_stage1_folds = prob_to_logit(test_pred_stage1_folds)

oof_pred_stage2 = np.zeros(X_proc.shape[0])
test_pred_stage2_folds = []

params_stage2 = {
    "objective": "binary:logistic",
    "eval_metric": "auc",
    "max_depth": 3,
    "learning_rate": 0.03,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "reg_lambda": 1.5,
    "reg_alpha": 0.2,
    "tree_method": "hist",
    "random_state": RANDOM_STATE + 1,
}

log("Training Stage 2 meta model with base_margin (logit boosting)...")

for fold, (tr_idx, val_idx) in enumerate(skf.split(X_proc, y), 1):
    log(f"Stage 2 Fold {fold}/{n_splits}")
    X_tr, X_val = X_proc[tr_idx], X_proc[val_idx]
    y_tr, y_val = y[tr_idx], y[val_idx]

    dtrain = xgb.DMatrix(X_tr, label=y_tr)
    dval = xgb.DMatrix(X_val, label=y_val)
    dtest = xgb.DMatrix(X_test_proc)

    # base_margin = Stage‑1 logits
    dtrain.set_base_margin(logits_oof_stage1[tr_idx])
    dval.set_base_margin(logits_oof_stage1[val_idx])
    dtest.set_base_margin(logits_test_stage1_folds[fold - 1])

    evals = [(dtrain, "train"), (dval, "valid")]

    booster_meta = xgb.train(
        params_stage2,
        dtrain,
        num_boost_round=2000,
        evals=evals,
        early_stopping_rounds=100,
        verbose_eval=200,
    )

    oof_pred_stage2[val_idx] = booster_meta.predict(
        dval, iteration_range=(0, booster_meta.best_iteration + 1)
    )
    test_pred_stage2_folds.append(
        booster_meta.predict(dtest, iteration_range=(0, booster_meta.best_iteration + 1))
    )

auc_stage2 = roc_auc_score(y, oof_pred_stage2)
log(f"Stage 2 OOF ROC-AUC: {auc_stage2:.5f}")



[22:40:02] Computing Stage 1 logits for OOF and test...
[22:40:02] Training Stage 2 meta model with base_margin (logit boosting)...
[22:40:02] Stage 2 Fold 1/5
[0]	train-auc:0.92168	valid-auc:0.92297
[0]	train-auc:0.92168	valid-auc:0.92297
[99]	train-auc:0.92189	valid-auc:0.92284
[99]	train-auc:0.92189	valid-auc:0.92284
[22:40:08] Stage 2 Fold 2/5
[0]	train-auc:0.92175	valid-auc:0.92271
[22:40:08] Stage 2 Fold 2/5
[0]	train-auc:0.92175	valid-auc:0.92271
[99]	train-auc:0.92188	valid-auc:0.92259
[99]	train-auc:0.92188	valid-auc:0.92259
[22:40:13] Stage 2 Fold 3/5
[0]	train-auc:0.92221	valid-auc:0.92083
[22:40:13] Stage 2 Fold 3/5
[0]	train-auc:0.92221	valid-auc:0.92083
[99]	train-auc:0.92235	valid-auc:0.92069
[99]	train-auc:0.92235	valid-auc:0.92069
[22:40:20] Stage 2 Fold 4/5
[22:40:20] Stage 2 Fold 4/5
[0]	train-auc:0.92196	valid-auc:0.92186
[0]	train-auc:0.92196	valid-auc:0.92186
[100]	train-auc:0.92210	valid-auc:0.92169
[100]	train-auc:0.92210	valid-auc:0.92169
[22:40:25] Stage 2 Fol

In [18]:

# 6) Compare Stage 1 vs Stage 2 and find a good classification threshold

def evaluate_at_threshold(y_true, proba, thr: float) -> dict:
    pred = (proba >= thr).astype(int)
    return {
        "accuracy": accuracy_score(y_true, pred),
        "f1": f1_score(y_true, pred),
        "logloss": log_loss(y_true, proba),
    }

thr_grid = np.linspace(0.1, 0.9, 17)

log("Searching threshold on Stage 2 OOF probabilities...")
best_thr = 0.5
best_f1 = -1.0
for thr in thr_grid:
    metrics = evaluate_at_threshold(y, oof_pred_stage2, thr)
    if metrics["f1"] > best_f1:
        best_f1 = metrics["f1"]
        best_thr = thr

log(f"Best threshold on OOF: {best_thr:.3f} (F1={best_f1:.4f})")

metrics1 = evaluate_at_threshold(y, oof_pred_stage1, best_thr)
metrics2 = evaluate_at_threshold(y, oof_pred_stage2, best_thr)

print('=== Stage 1 (base model) ===')
print(f"ROC-AUC : {roc_auc_score(y, oof_pred_stage1):.5f}")
print(f"Accuracy: {metrics1['accuracy']:.5f}")
print(f"F1      : {metrics1['f1']:.5f}")
print(f"LogLoss : {metrics1['logloss']:.5f}")

print('\n=== Stage 2 (meta boosted) ===')
print(f"ROC-AUC : {roc_auc_score(y, oof_pred_stage2):.5f}")
print(f"Accuracy: {metrics2['accuracy']:.5f}")
print(f"F1      : {metrics2['f1']:.5f}")
print(f"LogLoss : {metrics2['logloss']:.5f}")


[22:40:31] Searching threshold on Stage 2 OOF probabilities...
[22:40:32] Best threshold on OOF: 0.450 (F1=0.9430)
=== Stage 1 (base model) ===
[22:40:32] Best threshold on OOF: 0.450 (F1=0.9430)
=== Stage 1 (base model) ===
ROC-AUC : 0.92194
Accuracy: 0.90480
F1      : 0.94296
LogLoss : 0.24518

=== Stage 2 (meta boosted) ===
ROC-AUC : 0.92194
Accuracy: 0.90481
F1      : 0.94296
LogLoss : 0.24518
ROC-AUC : 0.92194
Accuracy: 0.90480
F1      : 0.94296
LogLoss : 0.24518

=== Stage 2 (meta boosted) ===
ROC-AUC : 0.92194
Accuracy: 0.90481
F1      : 0.94296
LogLoss : 0.24518


In [19]:

# 7) Build final test predictions and submission file

test_pred_stage1 = test_pred_stage1_folds.mean(axis=0)
test_pred_stage2 = np.mean(np.vstack(test_pred_stage2_folds), axis=0)

# If meta model improved ROC-AUC, we use Stage 2; otherwise fall back to Stage 1
use_stage2 = auc_stage2 >= auc_stage1
final_test_proba = test_pred_stage2 if use_stage2 else test_pred_stage1

log(f"Using {'Stage 2 meta' if use_stage2 else 'Stage 1 base'} predictions for submission.")

sub = pd.DataFrame()
if id_col is not None:
    sub[id_col] = test_df[id_col]
else:
    sub["id"] = np.arange(len(test_df))

sub[target_col] = final_test_proba

ts = datetime.now().strftime("%Y%m%d_%H%M%S")
sub_path = Path("loan_meta_optimized_submission.csv")
sub.to_csv(sub_path, index=False)
log(f"Saved submission to: {sub_path.resolve()}")


[22:40:33] Using Stage 1 base predictions for submission.
[22:40:33] Saved submission to: /Users/lionelweng/Downloads/s5e11-Predicting-Loan-Payback/loan_meta_optimized_submission.csv
[22:40:33] Saved submission to: /Users/lionelweng/Downloads/s5e11-Predicting-Loan-Payback/loan_meta_optimized_submission.csv



## 8) Quick domain insights

Some intuitive risk directions that are helpful when *interpreting* feature importances
or partial dependence plots (the model learns these patterns directly):

- **Debt‑to‑Income Ratio** — higher ratio ⇒ typically **riskier** (harder to take on new debt).
- **Credit Score** — lower score ⇒ **riskier**.
- **Interest Rate** — higher interest ⇒ higher repayment burden ⇒ **riskier**.
- **Annual Income** — lower income ⇒ **riskier**.
- **Employment Status** — unemployed borrowers are **riskier** on average.
- **Loan Grade / Subgrade** — poorer grades (e.g. *E, F, G*) encode higher credit risk.

This notebook does not hard‑code these rules; instead, the boosted trees can learn
non‑linear interactions between all of the above and more.
You can plug SHAP / feature importance plots on top of the trained models to
inspect whether the learned behaviour matches your expectations.
