# Financial Offer Propensity — From Exploration to Production

This notebook simulates the **exploratory phase** of the project: data analysis, model training, and evaluation. The rest of the codebase (pipelines, serving, Streamlit app, Docker, monitoring) was then created to **productionize** this prototype.

**Flow:** Load UCI Bank Marketing data → EDA → Feature engineering → Train models → Evaluate → (Production pipeline mirrors this in `src/pipelines/`).

## 1. Setup and load data

We use the UCI Bank Marketing dataset. Run `make data` from the repo root first, or the path below will need to point to where the CSV lives.

In [None]:
import os
import sys
from pathlib import Path

# Run from repo root so imports work
ROOT = Path(os.path.abspath("")).resolve()
if str(ROOT.name) == "notebooks":
    ROOT = ROOT.parent
os.chdir(ROOT)
sys.path.insert(0, str(ROOT))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

RAW_PATH = ROOT / "data" / "raw" / "bank-additional-full.csv"
if not RAW_PATH.exists():
    raise FileNotFoundError(f"Data not found at {RAW_PATH}. Run 'make data' from repo root.")

df = pd.read_csv(RAW_PATH, sep=";", encoding="utf-8")
print(f"Shape: {df.shape}")
df.head()

## 2. Data analysis (EDA)

Understand schema, target balance, and key distributions.

In [None]:
# Schema and dtypes
print("Columns and dtypes:")
df.dtypes

In [None]:
# Target: subscription (yes/no)
target = "y"
print("Target distribution:")
print(df[target].value_counts())
print(f"\nAcceptance rate: {df[target].eq('yes').mean():.2%}")

In [None]:
# Numerical summary
num_cols = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
num_cols = [c for c in num_cols if c in df.columns]
df[num_cols].describe()

In [None]:
# Categorical value counts (sample)
cat_cols = ["job", "marital", "education", "contact", "poutcome"]
for c in cat_cols:
    if c in df.columns:
        print(f"\n{c}:")
        print(df[c].value_counts().head(8))

In [None]:
# Visual: age and balance by target
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
df["y_bin"] = (df["y"] == "yes").astype(int)
df.boxplot(column="age", by="y", ax=axes[0])
axes[0].set_title("Age by subscription")
df.boxplot(column="balance", by="y", ax=axes[1])
axes[1].set_title("Balance by subscription")
plt.suptitle("")
plt.tight_layout()
plt.show()

## 3. Feature engineering

Same choices as in production: numerical columns scaled, categorical one-hot encoded (drop first).

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

NUM_COLS = ["age", "balance", "day", "duration", "campaign", "pdays", "previous"]
CAT_COLS = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"]
NUM_COLS = [c for c in NUM_COLS if c in df.columns]
CAT_COLS = [c for c in CAT_COLS if c in df.columns]

y = (df["y"].astype(str).str.lower() == "yes").astype(int).values
preprocessor = ColumnTransformer(
    [
        ("num", StandardScaler(), NUM_COLS),
        ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), CAT_COLS),
    ],
    remainder="drop",
)
X = df[NUM_COLS + CAT_COLS]
X_enc = preprocessor.fit_transform(X)
feature_names = NUM_COLS + list(preprocessor.named_transformers_["cat"].get_feature_names_out(CAT_COLS))
print(f"Features: {X_enc.shape[1]}, samples: {X_enc.shape[0]}")

In [None]:
RANDOM_STATE = 42
X_train, X_test, y_train, y_test = train_test_split(X_enc, y, train_size=0.8, random_state=RANDOM_STATE, stratify=y)
print(f"Train: {X_train.shape[0]}, Test: {X_test.shape[0]}")

## 4. Model training

We train a **Logistic Regression** (baseline) and **Gradient Boosting** (primary), same as in `src/pipelines/train.py`.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

lr = LogisticRegression(C=1.0, max_iter=1000, solver="lbfgs", random_state=RANDOM_STATE)
lr.fit(X_train, y_train)
print("Logistic Regression — train accuracy:", lr.score(X_train, y_train).round(4))

gb = GradientBoostingClassifier(n_estimators=100, max_depth=5, learning_rate=0.1, min_samples_leaf=10, random_state=RANDOM_STATE)
gb.fit(X_train, y_train)
print("Gradient Boosting — train accuracy:", gb.score(X_train, y_train).round(4))

## 5. Evaluation

ROC-AUC, precision, recall, F1, confusion matrix, and calibration (same metrics as `src/pipelines/evaluate.py`).

In [None]:
from sklearn.metrics import (
    roc_auc_score,
    average_precision_score,
    precision_recall_fscore_support,
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
)
from sklearn.calibration import calibration_curve

model = gb  # primary model
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("--- Test set metrics ---")
print(f"ROC-AUC:           {roc_auc_score(y_test, y_prob):.4f}")
print(f"Average precision: {average_precision_score(y_test, y_prob):.4f}")
print(f"Accuracy:          {accuracy_score(y_test, y_pred):.4f}")
prec, rec, f1, _ = precision_recall_fscore_support(y_test, y_pred, average="binary", zero_division=0)
print(f"Precision:         {prec:.4f}")
print(f"Recall:            {rec:.4f}")
print(f"F1:                {f1:.4f}")

In [None]:
# Calibration
prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10)
cal_mae = np.abs(prob_true - prob_pred).mean()
print(f"Calibration MAE:   {cal_mae:.4f}")

plt.figure(figsize=(5, 4))
plt.plot(prob_pred, prob_true, "s-", label="Model")
plt.plot([0, 1], [0, 1], "k--", label="Perfect")
plt.xlabel("Mean predicted probability")
plt.ylabel("Fraction of positives")
plt.title("Calibration curve")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["No", "Yes"])
disp.plot(cmap="Blues")
plt.title("Confusion matrix (test set)")
plt.tight_layout()
plt.show()

## 6. From prototype to production

This exploratory workflow was **productionized** into:

- **`src/pipelines/ingest.py`** — load and validate raw data
- **`src/pipelines/features.py`** — same preprocessor (ColumnTransformer + feature names)
- **`src/pipelines/train.py`** — same models and train/test split
- **`src/pipelines/evaluate.py`** — same metrics and calibration
- **`src/serving/predict.py`** — load saved model + preprocessor for inference
- **`src/app/streamlit_app.py`** — UI that calls the same inference for propensity and ranking

Run `make train` and `make run` (or the Docker image) to use the production pipeline and app.