# TRUS.AI Loan Model Prep

This notebook trains a lightweight loan approval classifier on the Kaggle dataset and exports artifacts for the Node.js backend.

## Workflow Outline

1. Load raw Kaggle data (CSV files) from `../data/raw/`.
2. Perform cleaning, feature engineering, and train/validation split.
3. Fit a baseline logistic regression classifier (scikit-learn).
4. Compute SHAP values for validation set examples.
5. Export model coefficients, encoders, and SHAP summaries to `../data/artifacts/`.
6. Generate seed documents for MongoDB collections (`customers`, `loan_applications`).

> **Note:** Run this notebook in a Python environment with scikit-learn, pandas, numpy, and shap installed.



In [None]:
import pathlib
from datetime import datetime

import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

import shap

RAW_DATA_DIR = pathlib.Path("../data/raw")
ARTIFACT_DIR = pathlib.Path("../data/artifacts")
ARTIFACT_DIR.mkdir(parents=True, exist_ok=True)

print("Raw data dir exists:", RAW_DATA_DIR.exists())
print("Artifacts dir:", ARTIFACT_DIR.resolve())


In [None]:
# TODO: Replace filenames with actual Kaggle CSV paths once downloaded.
train_path = RAW_DATA_DIR / "train_u6lujuX_CVtuZ9i.csv"

df = pd.read_csv(train_path)
print("Dataset shape:", df.shape)
df.head()


In [None]:
target_column = "Loan_Status"
feature_columns = [col for col in df.columns if col not in {target_column, "Loan_ID"}]

X = df[feature_columns]
y = (df[target_column] == "Y").astype(int)

numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_features = [col for col in feature_columns if col not in numeric_features]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median"))
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression(max_iter=1000))
])

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model.fit(X_train, y_train)
print(classification_report(y_val, model.predict(X_val)))


In [None]:
# SHAP value computation placeholder.
# To keep the notebook light for the hackathon, consider sampling 200 records max.
# Example (pseudocode):
# explainer = shap.KernelExplainer(model.predict_proba, background_sample)
# shap_values = explainer(data_sample)
#
# TODO:
# 1. Derive per-example top feature impacts (positive/negative).
# 2. Persist aggregated SHAP outputs to JSON for backend explanations.



In [None]:
# Artifact export placeholders
artifacts = {
    "trained_at": datetime.utcnow().isoformat(),
    "model_type": "logistic_regression",
    "feature_columns": feature_columns,
    "numeric_features": numeric_features,
    "categorical_features": categorical_features,
    # TODO: add coefficient matrix, intercept, encoder metadata
}

artifacts_path = ARTIFACT_DIR / "model_metadata.json"
artifacts_path.write_text(pd.Series(artifacts).to_json())
print("Wrote artifact metadata to", artifacts_path)



In [None]:
# TODO: Generate seed records for MongoDB collections.
# Example structure:
# seed = {
#     "customers": [...],
#     "loan_applications": [...],
#     "shap_explanations": [...]
# }
# with open(ARTIFACT_DIR / "seed_data.json", "w") as fp:
#     json.dump(seed, fp, indent=2)



## Next Actions

- [ ] Download Kaggle dataset into `data/raw/`.
- [ ] Finalize preprocessing choices and ensure categorical handling matches backend expectations.
- [ ] Export trained model coefficients (or pickled model) and serialization helpers.
- [ ] Produce sample SHAP explanations and narrative templates.
- [ ] Generate MongoDB seed JSON for demo users and loan applications.

