# Loan Default Prediction — Reproducible Project

This project contains:
- EDA, data cleaning, feature engineering
- Classical ML: XGBoost
- Deep Learning: Keras ANN and PyTorch MLP
- Scripts / Jupyter Notebook cells and saved artifacts

## File list
- `Loan_Project_Consolidated.ipynb` — consolidated notebook (or paste cells into a new notebook)
- `preprocessor.joblib` — saved preprocessing pipeline (created after running notebook)
- `xgb_model.json`, `keras_ann.h5`, `pytorch_mlp.pt` — saved models (created after running)
- `requirements.txt` — Python package list
- `FINAL_REPORT.pdf` — concise report (2-3 pages) summarizing results (you can create from provided text)


and run cells in order.

## Reproducibility notes
- Random seed defined as `RANDOM_SEED = 42`. Results may vary slightly by hardware (GPU/CPU), package versions.
- Preprocessing pipeline saved to `preprocessor.joblib`. Use the same preprocessor at inference.
- For offline-RL evaluation, the notebook uses a naive policy-value estimator (average realized reward). For robust evaluation, use IPS/DR estimators.

## What to inspect
XGBoost metrics and feature importance.
ANN training curves & test AUC.



## Notes / Next steps
- Consider temporal splits (train on older loans, test on newer loans) to avoid data leakage.
- Consider using `d3rlpy` for a formal offline RL algorithm (CQL/AWAC) and proper off-policy evaluation tools.


## Requirements.txt

In [4]:
# numpy>=1.21
# pandas>=1.3
# scikit-learn>=1.0
# matplotlib>=3.4
# seaborn>=0.11
# xgboost>=1.5
# tensorflow>=2.9
# torch>=1.12
# joblib>=1.1
# nbformat>=5.0

Loan Default Prediction — Final Report
=====================================

Author: Pranay Kumbhare
Date: 16 sept 2025

1. Project summary
------------------
This project analyzes a loan dataset to predict loan default (binary: Fully Paid = 0, Default/Charged Off = 1) and to learn an offline policy (approve/deny) using a contextual-bandit style reward model. We compare a classical ML model (XGBoost) and two deep learning approaches (Keras ANN and PyTorch MLP). 

2. Data & preprocessing
-----------------------
Dataset: (user-provided CSV)

Key preprocessing steps:
- Normalized column names (lowercase, underscores).
- Parsed common fields: `int_rate` normalized from '13.56%' to 0.1356; `term` parsed to numeric months; `emp_length` parsed to years; date fields parsed to datetime; derived `credit_length_years`.
- Mapped `loan_status` to `target_default` using domain rules: 'Fully Paid' -> 0, 'Charged Off'/'Default'/'Late (120)' -> 1. Rows without clearly-defined outcome (e.g., current loans) were dropped to avoid label noise.
- Dropped columns with >75% missingness (adjustable threshold) to reduce noise.
- Built a reproducible `preprocessor` pipeline (ColumnTransformer) that:
  - median-imputes numeric data, winsorizes (1–99 percentile), and standard-scales.
  - imputes categorical variables with 'missing' and one-hot encodes them.
- Produced stratified train/test split to preserve class balance.

Rationale:
This pipeline is conservative and reproduces transformations in test/inference. Winsorizing limits extreme outliers that skew models; imputation choices are simple but robust. Temporal split is recommended for deployment but stratified split was used for model development.

3. Models & training
--------------------
a) XGBoost (classical baseline)
- Hyperparameters: n_estimators=300, lr=0.05, max_depth=6, subsample=0.8, colsample_bytree=0.8.
- Training on preprocessed features.
- Metrics reported: ROC AUC, F1, precision/recall, and confusion matrix.
- Feature importance extracted (if preprocessor supports feature names).

b) Keras ANN (deep learning)
- Architecture: Dense(128)->Dropout->Dense(64)->Dropout->Dense(1, sigmoid)
- Loss: binary_crossentropy; optimizer: Adam; metrics: accuracy & AUC.
- Trained with 15 epochs and early validation split.

c) PyTorch MLP (alternate deep model)
- Architecture similar to Keras model; trained with BCEWithLogitsLoss and Adam.

4. Results (example)
--------------------
(Replace with your dataset-specific numbers after running the notebook.)

- XGBoost: AUC = 0.78, F1 = 0.45
- Keras ANN: Test AUC = 0.76, Test accuracy = 0.88
- PyTorch MLP: AUC = 0.75
- RL policy estimated value (avg reward per applicant): e.g., 120.5 (compare with always-approve and always-deny baselines)

Interpretation:
- AUC/F1 for classifiers capture ranking ability and balance of precision/recall. AUC is threshold-agnostic and useful for model selection; F1 summarizes performance at the chosen threshold and is sensitive to class imbalance.


5. Comparison and disagreement analysis
--------------------------------------
- DL classifier gives a score for default probability; a business rule (threshold) converts it to an approve/deny decision.


6. Limitations & future steps
-----------------------------
- Off-policy evaluation used is naive (direct averaging). For trustworthy deployment evaluate policies using IPS / Doubly Robust estimators or conduct randomized A/B testing.
- Reward engineering here is simplified: it ignores time value of money, recovery rates, and collections costs. A better reward should discount future payments, include expected recovery fractions, and account for operational costs.
- The dataset may have leakage if features include post-funding events; we printed potential leakage columns for review. For production, remove any feature that wouldn't be available at decision time.
- Consider temporal splitting (train on loans issued before year N, test on later loans) to simulate live deployment.


7. Reproducibility
------------------
- Preprocessing pipeline saved as `preprocessor.joblib`.
- Models saved as `xgb_model.json`, `keras_ann.h5`, and `pytorch_mlp.pt`.
- Use `requirements.txt` to recreate environment.

Appendix: Code pointers
- See `Loan_Project_Consolidated.ipynb` cells for exact code used for preprocessing, modeling, policy derivation, and analysis.


In [6]:
# Cell 1: Title + imports
# Loan Default Prediction: EDA, ML, DL & Offline RL (contextual-bandit)
import os, sys, math, re, json
import numpy as np, pandas as pd
import matplotlib.pyplot as plt, seaborn as sns
sns.set(style='whitegrid')

# ML & preprocessing
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, f1_score, classification_report, confusion_matrix

# XGBoost (classical ML)
from xgboost import XGBClassifier

# Keras (DL)
import tensorflow as tf
from tensorflow import keras

# PyTorch (alternate DL)
import torch, torch.nn as nn, torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
import sklearn


# Reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)


<torch._C.Generator at 0x16d865f70>

In [7]:
df = pd.read_csv("accepted_2007_to_2018Q4.csv", low_memory=False)

KeyboardInterrupt: 

In [None]:
print("Columns:", df.columns.tolist()[:50])
print(df.dtypes.value_counts())
print("\nTop missing fractions:")
print(df.isnull().mean().sort_values(ascending=False).head(30))


In [None]:
if 'loan_status' in df.columns:
    print("\nloan_status values sample:")
    print(df['loan_status'].value_counts().head(30))

In [None]:
num_candidates = ['loan_amnt','int_rate','annual_inc','dti']  # adjust to your dataset columns
num_present = [c for c in num_candidates if c in df.columns]

In [None]:
plt.figure(figsize=(4*len(num_present),4))
for i,c in enumerate(num_present):
    plt.subplot(1,len(num_present),i+1)
    sns.histplot(df[c].dropna(), kde=False, bins=40)
    plt.title(c)
plt.tight_layout()

In [None]:
cat_candidates = ['home_ownership','verification_status','purpose']
cat_present = [c for c in cat_candidates if c in df.columns]

In [None]:
for c in cat_present:
    plt.figure(figsize=(8,4))
    sns.countplot(y=c, data=df, order=df[c].value_counts().index)
    plt.title(c)
    plt.show()

In [None]:
import joblib

In [None]:
df.columns = df.columns.astype(str).str.strip().str.lower().str.replace(' ', '_').str.replace('-', '_')

In [None]:
def parse_pct(x):
    if pd.isna(x): return np.nan
    s = str(x).strip()
    if s.endswith('%'):
        try: return float(s.rstrip('%'))/100.0
        except: return np.nan
    try:
        return float(s)
    except:
        return np.nan


In [None]:
def emp_len_to_int(x):
    if pd.isna(x): return np.nan
    s = str(x).lower().strip()
    if s=='n/a': return np.nan
    if '<' in s: return 0.0
    if '10+' in s: return 10.0
    m = re.search(r'(\d+)', s)
    return float(m.group(1)) if m else np.nan

In [None]:
if 'int_rate' in df.columns: df['int_rate'] = df['int_rate'].apply(parse_pct)
if 'term' in df.columns: df['term'] = df['term'].astype(str).str.extract(r'(\d+)').astype(float)
if 'emp_length' in df.columns: df['emp_length'] = df['emp_length'].apply(emp_len_to_int)
for c in ['issue_d','earliest_cr_line','last_pymnt_d','last_credit_pull_d']:
    if c in df.columns:
        df[c] = pd.to_datetime(df[c], errors='coerce', infer_datetime_format=True)
if 'issue_d' in df.columns and 'earliest_cr_line' in df.columns:
    df['credit_length_years'] = (df['issue_d'] - df['earliest_cr_line']).dt.days / 365.25


In [None]:
def map_target(status):
    s = str(status).lower()
    if 'fully paid' in s: return 0
    if 'charged off' in s or 'default' in s: return 1
    if 'late' in s and '120' in s: return 1
    return np.nan

In [None]:
if 'loan_status' not in df.columns:
    raise ValueError("Expected column 'loan_status' but not found.")

In [None]:
df['target_default'] = df['loan_status'].apply(map_target)
print("target counts (with nulls):", df['target_default'].value_counts(dropna=False))

In [None]:
df = df[df['target_default'].notna()].copy()
df['target_default'] = df['target_default'].astype(int)

In [None]:
miss_frac = df.isnull().mean()
sparse_cols = miss_frac[miss_frac>0.75].index.tolist()

In [None]:
print("Dropping", len(sparse_cols), "sparse columns")
df.drop(columns=sparse_cols, inplace=True, errors='ignore')

In [None]:
exclude = set(['loan_status','target_default'])
num_cols = df.select_dtypes(include=[np.number]).columns.difference(exclude).tolist()
cat_cols = df.select_dtypes(include=['object','category']).columns.difference(exclude).tolist()

In [None]:
for c in ['id','member_id','url','zip_code','addr_state']: 
    if c in num_cols: num_cols.remove(c)
    if c in cat_cols: cat_cols.remove(c)


In [None]:
print("num_cols:", len(num_cols), "cat_cols:", len(cat_cols))

In [None]:
# def winsorize_array(X):
#     X = np.array(X, dtype=float)
#     lower = np.nanpercentile(X, 1, axis=0)
#     upper = np.nanpercentile(X, 99, axis=0)
#     return np.minimum(np.maximum(X, lower), upper)

In [None]:
from sklearn.preprocessing import FunctionTransformer

In [None]:

# num_transformer = Pipeline([
#     ('imputer', SimpleImputer(strategy='median')),
#     ('winsor', FunctionTransformer(winsorize_array, validate=False)),
#     ('scaler', StandardScaler())
# ])

# ohe_params = {'handle_unknown': 'ignore'}

# if sklearn.__version__ >= "1.2":
#     ohe_params['sparse_output'] = False  
# else:
#     ohe_params['sparse'] = False          
# cat_transformer = Pipeline([
#     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
#     ('onehot', OneHotEncoder(**ohe_params))
# ])


In [None]:
# preprocessor = ColumnTransformer([
#     ('num', num_transformer, num_cols),
#     ('cat', cat_transformer, cat_cols)
# ], remainder='drop', verbose_feature_names_out=False)

In [None]:
print("Rows:", df.shape[0])
print("Num cols:", len(num_cols))
print("Cat cols:", len(cat_cols))
for c in cat_cols:
    print(c, df[c].nunique())


In [None]:
high_card_cols = [c for c in cat_cols if df[c].nunique() > 50]  # threshold can be tuned
print("Dropping high-cardinality cols:", high_card_cols)

cat_cols = [c for c in cat_cols if c not in high_card_cols]


In [None]:
ohe_params = {'handle_unknown': 'ignore'}

import sklearn
if sklearn.__version__ >= "1.2":
    ohe_params['sparse_output'] = True   # ✅ keep sparse
else:
    ohe_params['sparse'] = True

cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(**ohe_params))
])


In [None]:
print("df columns:", df.columns.tolist())
print("num_cols:", num_cols)
print("cat_cols:", cat_cols)
missing = [c for c in (num_cols + cat_cols) if c not in df.columns]
print("Missing from df:", missing)


In [None]:
num_cols = [c for c in num_cols if c in df.columns]
cat_cols = [c for c in cat_cols if c in df.columns]


In [None]:
# Winsorization helper
def winsorize_array(X):
    X = np.array(X, dtype=float)
    lower = np.nanpercentile(X, 1, axis=0)
    upper = np.nanpercentile(X, 99, axis=0)
    return np.minimum(np.maximum(X, lower), upper)

# Numeric pipeline
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('winsor', FunctionTransformer(winsorize_array, validate=False)),
    ('scaler', StandardScaler())
])

# Handle sklearn version
ohe_params = {'handle_unknown': 'ignore'}
if sklearn.__version__ >= "1.2":
    ohe_params['sparse_output'] = True
else:
    ohe_params['sparse'] = True

# Categorical pipeline
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(**ohe_params))
])

# Column transformer
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_cols),
    ('cat', cat_transformer, cat_cols)
], remainder='drop', verbose_feature_names_out=False)

In [None]:
X_all = preprocessor.fit_transform(df)
y_all = df['target_default'].values

print(type(X_all))
print("Preprocessing done. Shape:", X_all.shape)


In [None]:
joblib.dump(preprocessor, 'preprocessor.joblib')

In [None]:
X_train, X_test, y_train, y_test, df_train, df_test = train_test_split(
    X_all, y_all, df, test_size=0.2, random_state=RANDOM_SEED, stratify=y_all
)
print("Train/Test shapes:", X_train.shape, X_test.shape)


In [None]:
xgb = XGBClassifier(
    n_estimators=300, learning_rate=0.05, max_depth=6,
    subsample=0.8, colsample_bytree=0.8, random_state=RANDOM_SEED,
    use_label_encoder=False, eval_metric='logloss'
)

In [None]:
xgb.fit(X_train, y_train)
y_proba_xgb = xgb.predict_proba(X_test)[:,1]
y_pred_xgb = (y_proba_xgb >= 0.5).astype(int)
print("XGBoost AUC:", roc_auc_score(y_test, y_proba_xgb))
print(classification_report(y_test, y_pred_xgb))

In [None]:
try:
    feat_names = preprocessor.get_feature_names_out()
    fi = pd.Series(xgb.feature_importances_, index=feat_names).sort_values(ascending=False).head(20)
    print("Top features:\n", fi)
    plt.figure(figsize=(8,6)); sns.barplot(x=fi.values, y=fi.index); plt.title("XGBoost feature importance (top20)"); plt.show()
except Exception:
    pass

In [None]:
input_dim = X_train.shape[1]
ann = keras.Sequential([
    keras.layers.Input(shape=(input_dim,)),
    keras.layers.Dense(128, activation='relu'),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(64, activation='relu'),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(1, activation='sigmoid')
])

In [None]:
ann.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', keras.metrics.AUC(name='auc')])
history = ann.fit(X_train, y_train, validation_split=0.15, epochs=15, batch_size=256, verbose=1)
ann_eval = ann.evaluate(X_test, y_test, verbose=0)
print("ANN test acc, auc:", ann_eval[1], ann_eval[2])

In [None]:
plt.figure(figsize=(10,4))
plt.subplot(1,2,1); plt.plot(history.history['loss'], label='train_loss'); plt.plot(history.history['val_loss'], label='val_loss'); plt.legend(); plt.title('Loss')
plt.subplot(1,2,2); plt.plot(history.history['auc'], label='train_auc'); plt.plot(history.history['val_auc'], label='val_auc'); plt.legend(); plt.title('AUC')
plt.show()

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Xtr = torch.tensor(X_train, dtype=torch.float32).to(device)
ytr = torch.tensor(y_train, dtype=torch.float32).to(device)
Xte = torch.tensor(X_test, dtype=torch.float32).to(device)
yte = torch.tensor(y_test, dtype=torch.float32).to(device)


In [None]:
class MLP(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim,128), nn.ReLU(), nn.Dropout(0.3),
            nn.Linear(128,64), nn.ReLU(), nn.Dropout(0.2),
            nn.Linear(64,1)
        )
    def forward(self,x): return self.net(x).squeeze(-1)

In [None]:
model = MLP(X_train.shape[1]).to(device)
loss_fn = nn.BCEWithLogitsLoss()
opt = optim.Adam(model.parameters(), lr=1e-3)
loader = DataLoader(TensorDataset(Xtr,ytr), batch_size=256, shuffle=True)

In [None]:
for epoch in range(8):
    model.train()
    total=0
    for xb,yb in loader:
        opt.zero_grad()
        logits = model(xb)
        loss = loss_fn(logits, yb)
        loss.backward()
        opt.step()
        total += loss.item()*xb.size(0)
    print(f'Epoch {epoch+1} avg loss: {total/len(Xtr):.5f}')

In [None]:
model.eval()
with torch.no_grad():
    logits = model(Xte).cpu().numpy()
    probs = 1/(1+np.exp(-logits))
print('PyTorch MLP AUC:', roc_auc_score(y_test, probs.flatten()))

In [None]:
import joblib
joblib.dump(preprocessor, 'preprocessor.joblib')
xgb.save_model('xgb_model.json')
ann.save('keras_ann.h5')
torch.save(model.state_dict(), 'pytorch_mlp.pt')
