# BA x AI Final Project (Option A): Pre‑Call Targeting for Bank Telemarketing

This notebook implements **Steps 1–3** from `OptionA_DirectMarketing_TechnicalPlan.md`:

1. Define the **technical outputs** required for the project.
2. Set up a **reproducible environment** (seeds + package versions).
3. **Fetch the dataset** and print **data provenance** + a compact dataset snapshot.

> Note: If `ucimlrepo` is not installed in your environment, install it before running Step 3.


In [1]:
# Step 1 — Required technical outputs (high level)

PROJECT_OUTPUTS = {
    "provenance": [
        "dataset source URL + DOI + licence + access date",
        "rows/features + target definition + class balance",
        "Option A feature availability table (included vs excluded)",
    ],
    "models": [
        "Dummy baseline",
        "Logistic regression (interpretable)",
        "One tree-based model (nonlinear)",
        "(Optional) calibrated probabilities",
    ],
    "evaluation": [
        "PR curve + PR-AUC (primary)",
        "ROC curve + ROC-AUC (secondary)",
        "Calibration curve + Brier score",
        "Lift/gains and precision@K / recall@K",
        "Profit curves vs K and vs threshold + sensitivity over (P, C)",
    ],
    "interpretation": [
        "Top logistic coefficients (business-readable)",
        "Permutation importance for tree model",
    ],
    "reproducibility": [
        "random seeds",
        "train/validation/test split details",
        "model hyperparameters",
        "package versions",
    ],
}

for section, items in PROJECT_OUTPUTS.items():
    print(f"\n{section.upper()}")
    for item in items:
        print(f"- {item}")



PROVENANCE
- dataset source URL + DOI + licence + access date
- rows/features + target definition + class balance
- Option A feature availability table (included vs excluded)

MODELS
- Dummy baseline
- Logistic regression (interpretable)
- One tree-based model (nonlinear)
- (Optional) calibrated probabilities

EVALUATION
- PR curve + PR-AUC (primary)
- ROC curve + ROC-AUC (secondary)
- Calibration curve + Brier score
- Lift/gains and precision@K / recall@K
- Profit curves vs K and vs threshold + sensitivity over (P, C)

INTERPRETATION
- Top logistic coefficients (business-readable)
- Permutation importance for tree model

REPRODUCIBILITY
- random seeds
- train/validation/test split details
- model hyperparameters
- package versions


In [2]:
# Step 2 — Environment & reproducibility setup

import os
import platform
import random
import sys
from datetime import datetime, timezone
import importlib.metadata as md
import importlib.util

SEED = 42
random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)

def require_importable(module_name: str) -> None:
    if importlib.util.find_spec(module_name) is None:
        raise ModuleNotFoundError(
            f"Missing required module '{module_name}'. "
            "Install dependencies for this notebook environment before continuing."
        )

# Only require packages needed for Steps 1–3. (Later steps will also need scikit-learn + plotting libs.)
for module in ["numpy", "pandas", "ucimlrepo", "IPython"]:
    require_importable(module)

import numpy as np
import pandas as pd

np.random.seed(SEED)

ACCESS_UTC = datetime.now(timezone.utc)

def pkg_version(dist_name: str) -> str:
    try:
        return md.version(dist_name)
    except md.PackageNotFoundError:
        return "not-installed"

env_info = {
    "python": sys.version.split()[0],
    "platform": platform.platform(),
    "seed": SEED,
    "accessed_utc": ACCESS_UTC.isoformat(timespec="seconds"),
    "numpy": pkg_version("numpy"),
    "pandas": pkg_version("pandas"),
    "scikit-learn": pkg_version("scikit-learn"),
    "matplotlib": pkg_version("matplotlib"),
    "seaborn": pkg_version("seaborn"),
    "ucimlrepo": pkg_version("ucimlrepo"),
}

print("Environment / Reproducibility Info")
for k, v in env_info.items():
    print(f"- {k}: {v}")


Environment / Reproducibility Info
- python: 3.11.14
- platform: macOS-26.2-arm64-arm-64bit
- seed: 42
- accessed_utc: 2025-12-29T03:07:54+00:00
- numpy: 2.4.0
- pandas: 2.3.3
- scikit-learn: not-installed
- matplotlib: not-installed
- seaborn: not-installed
- ucimlrepo: 0.0.7


## Step 3 — Data Ingest and Provenance

This cell fetches the UCI **Bank Marketing** dataset (id=222) via `ucimlrepo` and prints:
- A provenance block (source URLs, DOI, licence, access timestamp) for referencing
- A compact dataset snapshot (shape, target balance, and top missingness rates)
- The variable information table provided by UCI metadata


In [3]:
# Step 3 — Data ingest + provenance report (UCI Bank Marketing, id=222)

from ucimlrepo import fetch_ucirepo
from IPython.display import display

DATASET_ID = 222

try:
    bank_marketing = fetch_ucirepo(id=DATASET_ID)
except Exception as e:
    raise RuntimeError(
        "Failed to fetch UCI dataset id=222 via ucimlrepo. "
        "Check your internet access (or whether the dataset is cached in your environment)."
    ) from e

# Data
X_raw = bank_marketing.data.features.copy()
y_raw = bank_marketing.data.targets.iloc[:, 0].copy()

# Provenance (for assignment reporting)
provenance = {
    "dataset_name": bank_marketing.metadata.get("name", "Bank Marketing"),
    "uci_id": bank_marketing.metadata.get("uci_id", DATASET_ID),
    "repository_url": bank_marketing.metadata.get(
        "repository_url", "https://archive.ics.uci.edu/dataset/222/bank+marketing"
    ),
    "data_url": bank_marketing.metadata.get(
        "data_url", "https://archive.ics.uci.edu/static/public/222/data.csv"
    ),
    "dataset_doi": bank_marketing.metadata.get("dataset_doi", "10.24432/C5K306"),
    "licence": "CC BY 4.0 (as listed on UCI)",
    "accessed_utc": ACCESS_UTC.isoformat(timespec="seconds"),
}

print("Data provenance")
for k, v in provenance.items():
    print(f"- {k}: {v}")

print("\nDataset snapshot")
print(f"- rows: {X_raw.shape[0]:,}")
print(f"- features: {X_raw.shape[1]}")
print(f"- feature columns: {list(X_raw.columns)}")

target_counts = y_raw.value_counts(dropna=False)
positive_rate = float((y_raw == "yes").mean()) if y_raw.dtype == object else float("nan")
print("\nTarget distribution (y)")
display(target_counts.to_frame(name="count"))
print(f"- positive rate (y == 'yes'): {positive_rate:.4f}")

print("\nMissingness (NaN) — top columns")
missing_rate = X_raw.isna().mean().sort_values(ascending=False)
missing_tbl = (
    missing_rate[missing_rate > 0]
    .mul(100)
    .round(2)
    .rename("missing_%")
    .to_frame()
)
display(missing_tbl.head(10))

print("\nVariable information (from UCI metadata)")
display(bank_marketing.variables)


Data provenance
- dataset_name: Bank Marketing
- uci_id: 222
- repository_url: https://archive.ics.uci.edu/dataset/222/bank+marketing
- data_url: https://archive.ics.uci.edu/static/public/222/data.csv
- dataset_doi: 10.24432/C5K306
- licence: CC BY 4.0 (as listed on UCI)
- accessed_utc: 2025-12-29T03:07:54+00:00

Dataset snapshot
- rows: 45,211
- features: 16
- feature columns: ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day_of_week', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome']

Target distribution (y)


Unnamed: 0_level_0,count
y,Unnamed: 1_level_1
no,39922
yes,5289


- positive rate (y == 'yes'): 0.1170

Missingness (NaN) — top columns


Unnamed: 0,missing_%
poutcome,81.75
contact,28.8
education,4.11
job,0.64



Variable information (from UCI metadata)


Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,,no
1,job,Feature,Categorical,Occupation,"type of job (categorical: 'admin.','blue-colla...",,no
2,marital,Feature,Categorical,Marital Status,"marital status (categorical: 'divorced','marri...",,no
3,education,Feature,Categorical,Education Level,"(categorical: 'basic.4y','basic.6y','basic.9y'...",,no
4,default,Feature,Binary,,has credit in default?,,no
5,balance,Feature,Integer,,average yearly balance,euros,no
6,housing,Feature,Binary,,has housing loan?,,no
7,loan,Feature,Binary,,has personal loan?,,no
8,contact,Feature,Categorical,,contact communication type (categorical: 'cell...,,yes
9,day_of_week,Feature,Date,,last contact day of the week,,no
