# Credit Risk Project — End-to-end Notebook

This notebook documents and runs the full workflow for the credit risk project:

- Load dataset
- Exploratory Data Analysis (EDA)
- Data cleaning and outlier handling
- Model training & evaluation (LogReg, RandomForest, GradientBoosting)
- Save best model artifact and run sample predictions

Notes:
- This notebook uses project helper functions in `src/` (e.g., `data_io`, `cleaning`, `modeling`).
- Update paths or configs as needed to suit your environment.

In [None]:
# 1) Setup: imports and environment
import sys
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Project helpers
from src.data_io import read_dataset_from_path
from src.constants import DEFAULT_DATASET_PATH, TARGET_COL, ARTIFACTS_DIR
from src.cleaning import CleaningConfig, clean_dataframe, infer_column_types
from src.modeling import TrainConfig, train_and_evaluate, save_artifact, load_artifact

# Quick versions check
import sklearn, joblib
print(f"Python: {sys.version.splitlines()[0]}")
print(f"pandas: {pd.__version__}")
print(f"scikit-learn: {sklearn.__version__}")
print(f"joblib: {joblib.__version__}")

# plotting defaults
sns.set(style="whitegrid")
%matplotlib inline

# Paths
DATA_PATH = DEFAULT_DATASET_PATH
ARTIFACTS_DIR = ARTIFACTS_DIR
print(f"Dataset path: {DATA_PATH}")
print(f"Artifacts dir: {ARTIFACTS_DIR}")

## 2) Load dataset

In [None]:
# Load data
df = read_dataset_from_path(DATA_PATH)
print("Dataset loaded:", df.shape)
df.head()

## 3) Exploratory Data Analysis (EDA)

Check basic distribution, target balance and quick diagnostics for numeric/categorical features.

In [None]:
# Basic summary
print(df.info())

# Target distribution
print('\nTarget distribution:')
print(df[TARGET_COL].value_counts(dropna=False))

# Numeric summary
display(df.describe(include=["number"]).T)

# Categorical top values
display(df.describe(include=["object", "category"]).T)

## 4) Data cleaning

Use project `cleaning` helpers to handle missing values, duplicates, and outliers.

In [None]:
# Infer numeric/categorical columns
numeric_cols, categorical_cols = infer_column_types(df, TARGET_COL)
print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

# Configure cleaning
cfg = CleaningConfig(
    target_col=TARGET_COL,
    numeric_missing="median",
    categorical_missing="mode",
    drop_duplicates=True,
    outlier_method="zscore",  # 'none' | 'zscore' | 'mean_std'
    outlier_cols=numeric_cols,  # you can restrict to specific important numeric cols
    zscore_threshold=3.0,
    mean_std_k=3.0,
)

cleaned_df, clean_report = clean_dataframe(df, cfg)
print("Cleaning report:")
for k, v in clean_report.items():
    print(k, ":", v)

print("After cleaning shape:", cleaned_df.shape)

## 5) Modeling

Train multiple models and compare evaluation metrics. Uses `train_and_evaluate` helper from `src/modeling.py`.

In [None]:
models = ["LogReg", "RandomForest", "GradientBoosting"]
results = {}

for m in models:
    print(f"Training: {m}")
    train_cfg = TrainConfig(target_col=TARGET_COL, test_size=0.2, random_state=42, model_name=m)
    pipe, metrics = train_and_evaluate(
        cleaned_df,
        numeric_cols=numeric_cols,
        categorical_cols=categorical_cols,
        cfg=train_cfg,
    )
    results[m] = (pipe, metrics)
    print(f" -> accuracy: {metrics['accuracy']:.4f}, f1: {metrics['f1']:.4f}, roc_auc: {metrics.get('roc_auc')}")

# Summarize
summary = pd.DataFrame([
    {"model": m, **r[1]}
    for m, r in results.items()
])
summary = summary.set_index("model")[["accuracy", "f1", "precision", "recall", "roc_auc"]]
summary

## 6) Save the best model

In [None]:
# Choose best by F1 score
best_name, (best_pipe, best_metrics) = max(results.items(), key=lambda kv: kv[1][1]["f1"])
print(f"Best model: {best_name} — f1: {best_metrics['f1']:.4f}")

artifact_path = ARTIFACTS_DIR / "credit_risk_model_notebook.joblib"
save_artifact(artifact_path, pipeline=best_pipe, metadata=best_metrics)
print(f"Saved artifact to: {artifact_path}")

# Verify load
loaded = load_artifact(artifact_path)
print("Loaded artifact keys:", list(loaded.keys()))

## 7) Sample predictions

In [None]:
# Predict on a small sample
pipeline = loaded["pipeline"]
sample_X = cleaned_df.drop(columns=[TARGET_COL]).iloc[:6]
preds = pipeline.predict(sample_X)
prob = pipeline.predict_proba(sample_X)[:, 1] if hasattr(pipeline, "predict_proba") else None

out = sample_X.copy()
out["pred"] = preds
if prob is not None:
    out["proba"] = prob

display(out)

## 8) Next steps & notes ✅

- Run hyperparameter tuning (GridSearchCV / RandomizedSearchCV) for the best model.
- Add cross-validation and confidence intervals for metrics.
- Feature importance or SHAP explanations to inspect important predictors.
- Integrate the saved artifact into the Streamlit `pages/4_Predict.py` for deployment.

---

This notebook provides an executable end-to-end flow using existing project helpers. Update configuration values (paths, cleaning strategy, outlier columns, model choice) to iterate further.