In [1]:
from pathlib import Path
from IPython.display import HTML, display
css = Path("../../../css/custom.css").read_text(encoding="utf-8")
display(HTML(f"<style>{css}</style>"))

# Chapter 1 — Lesson 4: ML Workflow (Data, Model, Evaluation, Deployment)

This notebook is a practical, end-to-end walkthrough of a classical Machine Learning workflow. The focus is not on any single algorithm; it is on the engineering discipline that turns data into a **reliable** model and then into a **usable** artifact.

You will see the same workflow patterns across three tasks:

- **Classification**: predict a binary label (diabetic vs non-diabetic).
- **Regression**: predict a numeric target (diamond price).
- **Clustering**: discover structure without labels (airport coordinates).

The code uses dataset paths consistent with your repository layout, such as:

- `../../../Datasets/Classification/diabetes.csv`
- `../../../Datasets/Regression/diamonds.csv`
- `../../../Datasets/Clustering/airports.csv`

If a dataset file is not present in the runtime environment, the notebook synthesizes a dataset with a compatible schema so that every section remains runnable.

---

## Workflow overview

```mermaid
flowchart TD
A[Data] --> B[Problem formulation]
B --> C[Split strategy]
C --> D[Preprocess & Feature engineering]
D --> E[Model training]
E --> F[Evaluation & error analysis]
F --> G{Good enough?}
G -- No --> D
G -- Yes --> H[Package & Deploy]
H --> I[Monitoring & Feedback]
I --> D
```

---

## A compact mathematical view

Training is often written as empirical risk minimization:

$$
\hat{f} = \arg\min_{f \in \mathcal{F}} \frac{1}{n}\sum_{i=1}^{n} \ell\big(y_i, f(x_i)\big) + \lambda \Omega(f)
$$

- $x_i$ are features, $y_i$ are targets.
- $\ell$ is a loss (log-loss, squared error, etc.).
- $\Omega$ is a regularizer.
- $\mathcal{F}$ is a hypothesis class (linear models, trees, kernels, …).

In real work, the *workflow around* this objective determines whether the model generalizes.

---

## 0) Problem formulation (what you must decide before modeling)

### 0.1 Unit of prediction

Define what a single row represents. For example:

- one patient snapshot
- one loan application
- one transaction
- one listing

This choice affects leakage, splitting, and interpretability.

### 0.2 Target definition and timing

A target definition must be unambiguous:

- What is the event you predict?
- When does the event occur relative to the available features?
- How are labels produced? (manual, automated, delayed, noisy)
- Are there edge cases or ambiguous labels?

If your label is “did the user churn in the next 30 days”, then features must reflect only information available **at the prediction time**, not after churn.

### 0.3 Constraints and success criteria

List constraints explicitly:

- latency (e.g., < 50 ms)
- memory/model size
- interpretability/auditability
- privacy and retention constraints
- fairness goals
- cost of false positives/negatives

Define success criteria:

- primary metric (ROC-AUC, RMSE, …)
- acceptable operating point (precision/recall at a threshold)
- stability requirement (variance across folds and slices)

### 0.4 Split strategy

Split strategy is a modeling choice:

- random split for i.i.d. tabular data
- stratified split to preserve class balance
- group split to avoid entity leakage
- time split to mimic deployment conditions

---

## 1) Data stage: ingest, validate, and understand

### 1.1 Ingest

Good habits:

- load from known paths
- log dataset version / snapshot
- inspect shape and column names
- store quick summaries (min/max, missingness)

### 1.2 Validate (minimal data contract)

At a minimum:

- required columns exist
- numeric ranges are plausible
- categorical values are known or handled safely
- missingness is within expected bounds

This prevents silent failures in training and in production.

### 1.3 Understand

You should always look for:

- class balance (classification prevalence)
- target noise
- duplicates
- suspicious “too informative” features (potential leakage)

---

## 2) Modeling stage: pipelines and baselines

### 2.1 Pipelines prevent leakage

All preprocessing must be inside a pipeline. Otherwise you risk contaminating validation/test data with statistics computed from the full dataset.

We will use:

- `ColumnTransformer` for column-wise preprocessing
- `Pipeline` for preprocessing → model

### 2.2 Baselines first

A baseline model answers:

- is there real signal?
- how hard is the task?
- what performance level is realistic?

We use logistic regression as a strong baseline for tabular classification, then compare to a small random forest.

---

## 3) Evaluation stage: metrics, thresholds, slices

### 3.1 Global metrics

Binary classification:

- Accuracy, Precision, Recall, F1
- ROC-AUC (threshold-free ranking quality)

Regression:

- MAE, RMSE, $R^2$

### 3.2 Threshold tuning under costs

Deployment requires a threshold. If FN costs 5× FP, choose a threshold minimizing:

$$
\text{Cost}(t) = c_{FP}\cdot FP(t) + c_{FN}\cdot FN(t)
$$

### 3.3 Slice analysis

Compute metrics by subgroups (age bands, cohorts, regions). A model can look “good” overall and still be unacceptable on critical slices.

---

## 4) Deployment stage (minimal, but real)

You should be able to:

- save the entire pipeline
- load it back
- run inference on new rows
- document required input schema

---

## 5) Monitoring stage (minimum viable)

Monitoring typically includes:

- feature drift
- performance decay
- operational health

We compute PSI for a feature as a simple drift indicator.

---

## 6) Leakage (the classic workflow failure)

Leakage frequently produces unrealistically high metrics. We create an intentionally leaky feature (a proxy for the target) to demonstrate how metrics can be misleading.

---

## 7) Unsupervised workflow (clustering)

Even without labels, workflow discipline remains:

- define the goal (segmentation vs anomaly detection vs compression)
- standardize features
- choose $k$ (elbow/inertia, stability, business constraints)
- interpret clusters and validate with domain checks

---

## Practical checklists

### Data checklist
- [ ] Target definition includes timing constraints
- [ ] Split strategy matches real deployment
- [ ] Schema validation implemented
- [ ] Leakage candidates reviewed
- [ ] Data snapshot/version tracked

### Modeling checklist
- [ ] Baseline established
- [ ] Preprocessing is inside pipeline
- [ ] Reproducibility controls in place (seeds, environment)
- [ ] Hyperparameters tuned without test leakage

### Evaluation checklist
- [ ] Global metrics and uncertainty (CV)
- [ ] Threshold chosen under costs
- [ ] Slice analysis performed
- [ ] Error examples inspected manually

### Deployment checklist
- [ ] Artifact saved and load-tested
- [ ] Input schema documented
- [ ] Monitoring and retraining triggers defined

This notebook demonstrates a minimal version of each item.

In [2]:
import math
import random
from pathlib import Path

import numpy as np
import pandas as pd
from IPython.display import display

from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,
    confusion_matrix,
    mean_absolute_error, mean_squared_error, r2_score
)
from sklearn.cluster import KMeans

import joblib

SEED = 42
np.random.seed(SEED)
random.seed(SEED)

print("Imports OK")

Imports OK


In [3]:
candidates = [
    ("classification_diabetes", "../../../Datasets/Classification/diabetes.csv"),
    ("classification_iris", "../../../Datasets/Classification/iris.csv"),
    ("classification_wine", "../../../Datasets/Classification/Wine_Quality.csv"),
    ("regression_diamonds", "../../../Datasets/Regression/diamonds.csv"),
    ("regression_house_prices", "../../../Datasets/Regression/house-prices.csv"),
    ("clustering_airports", "../../../Datasets/Clustering/airports.csv"),
    ("clustering_hw200", "../../../Datasets/Clustering/hw_200.csv"),
]

random.seed(4)  # fixed for stable lesson content
chosen = dict(random.sample(candidates, k=3))
chosen

{'classification_iris': '../../../Datasets/Classification/iris.csv',
 'classification_wine': '../../../Datasets/Classification/Wine_Quality.csv',
 'classification_diabetes': '../../../Datasets/Classification/diabetes.csv'}

In [4]:
def _exists(path_str: str) -> bool:
    try:
        return Path(path_str).exists()
    except Exception:
        return False

DIABETES_SAMPLE = r'''Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,classification
6,148,72,35,0,33.6,0.627,50,Diabetic
1,85,66,29,0,26.6,0.351,31,Non-Diabetic
8,183,64,0,0,23.3,0.672,32,Diabetic
1,89,66,23,94,28.1,0.167,21,Non-Diabetic
0,137,40,35,168,43.1,2.288,33,Diabetic
'''

AIRPORTS_SAMPLE = r'''"latitude_deg","longitude_deg","elevation_ft"
40.070985,-74.933689,11
38.704022,-101.473911,3435
59.947733,-151.692524,450
'''

def load_or_synthesize_diabetes(path: str, n: int = 420, seed: int = 42) -> pd.DataFrame:
    if _exists(path):
        return pd.read_csv(path)
    rng = np.random.default_rng(seed)
    df0 = pd.read_csv(pd.io.common.StringIO(DIABETES_SAMPLE))
    cols = [c for c in df0.columns if c != "classification"]
    mu = df0[cols].mean()
    sd = df0[cols].std().replace(0, 1.0).fillna(1.0)

    X = rng.normal(loc=mu.values, scale=sd.values, size=(n, len(cols)))
    X = pd.DataFrame(X, columns=cols)

    X["Pregnancies"] = np.clip(np.round(X["Pregnancies"]), 0, 20)
    X["Glucose"] = np.clip(X["Glucose"], 50, 250)
    X["BloodPressure"] = np.clip(X["BloodPressure"], 30, 140)
    X["SkinThickness"] = np.clip(X["SkinThickness"], 0, 100)
    X["Insulin"] = np.clip(X["Insulin"], 0, 600)
    X["BMI"] = np.clip(X["BMI"], 15, 60)
    X["DiabetesPedigreeFunction"] = np.clip(X["DiabetesPedigreeFunction"], 0.05, 3.0)
    X["Age"] = np.clip(X["Age"], 18, 85)

    score = (
        0.03 * (X["Glucose"] - 120)
        + 0.06 * (X["BMI"] - 30)
        + 0.02 * (X["Age"] - 35)
        + 0.15 * (X["DiabetesPedigreeFunction"] - 0.5)
    )
    p = 1 / (1 + np.exp(-score))
    y = rng.binomial(1, np.clip(p, 0.05, 0.95), size=n)
    X["classification"] = np.where(y == 1, "Diabetic", "Non-Diabetic")
    return X

def load_or_synthesize_diamonds(path: str, n: int = 600, seed: int = 42) -> pd.DataFrame:
    if _exists(path):
        return pd.read_csv(path)
    rng = np.random.default_rng(seed)
    cuts = ["Fair", "Good", "Very Good", "Premium", "Ideal"]
    colors = list("DEFGHIJ")
    clarities = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]

    carat = np.clip(rng.lognormal(mean=-0.4, sigma=0.5, size=n), 0.2, 2.5)
    cut = rng.choice(cuts, size=n, p=[0.03, 0.10, 0.25, 0.30, 0.32])
    color = rng.choice(colors, size=n, p=[0.15, 0.18, 0.17, 0.15, 0.13, 0.12, 0.10])
    clarity = rng.choice(clarities, size=n, p=[0.02, 0.10, 0.18, 0.20, 0.18, 0.15, 0.10, 0.07])

    depth = np.clip(rng.normal(61.5, 1.5, size=n), 55, 70)
    table = np.clip(rng.normal(57.0, 2.0, size=n), 50, 70)

    x = np.clip(3.0 + 2.2 * np.sqrt(carat) + rng.normal(0, 0.15, size=n), 3.0, 10.0)
    y = np.clip(x + rng.normal(0, 0.08, size=n), 3.0, 10.0)
    z = np.clip(2.0 + 1.4 * np.sqrt(carat) + rng.normal(0, 0.12, size=n), 1.5, 6.5)

    base = 800 * (carat ** 1.7)
    noise = rng.normal(0, 250, size=n)
    price = np.clip(base + noise, 200, None)

    return pd.DataFrame({
        "id": np.arange(1, n+1).astype(str),
        "carat": carat,
        "cut": cut,
        "color": color,
        "clarity": clarity,
        "depth": depth,
        "table": table,
        "price": price.round(0).astype(int),
        "x": x.round(2),
        "y": y.round(2),
        "z": z.round(2),
    })

def load_or_synthesize_airports(path: str, n: int = 240, seed: int = 42) -> pd.DataFrame:
    if _exists(path):
        return pd.read_csv(path)
    rng = np.random.default_rng(seed)
    df0 = pd.read_csv(pd.io.common.StringIO(AIRPORTS_SAMPLE))
    centers = df0[["latitude_deg", "longitude_deg", "elevation_ft"]].to_numpy()
    cluster = rng.integers(0, len(centers), size=n)
    base = centers[cluster]
    lat = base[:, 0] + rng.normal(0, 1.0, size=n)
    lon = base[:, 1] + rng.normal(0, 1.6, size=n)
    elev = np.clip(base[:, 2] + rng.normal(0, 800, size=n), 0, 12000).astype(int)
    return pd.DataFrame({"latitude_deg": lat, "longitude_deg": lon, "elevation_ft": elev})

print("Loader functions ready.")

Loader functions ready.


In [5]:
diabetes_path = chosen.get("classification_diabetes", "../../../Datasets/Classification/diabetes.csv")
diamonds_path = chosen.get("regression_diamonds", "../../../Datasets/Regression/diamonds.csv")
airports_path = chosen.get("clustering_airports", "../../../Datasets/Clustering/airports.csv")

df_diabetes = load_or_synthesize_diabetes(diabetes_path, n=420, seed=SEED)
df_diamonds = load_or_synthesize_diamonds(diamonds_path, n=600, seed=SEED)
df_airports = load_or_synthesize_airports(airports_path, n=240, seed=SEED)

print("Diabetes:", df_diabetes.shape, " Diamonds:", df_diamonds.shape, " Airports:", df_airports.shape)
display(df_diabetes.head())

Diabetes: (768, 9)  Diamonds: (53940, 11)  Airports: (83125, 19)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,classification
0,6,148,72,35,0,33.6,0.627,50,Diabetic
1,1,85,66,29,0,26.6,0.351,31,Non-Diabetic
2,8,183,64,0,0,23.3,0.672,32,Diabetic
3,1,89,66,23,94,28.1,0.167,21,Non-Diabetic
4,0,137,40,35,168,43.1,2.288,33,Diabetic


In [6]:
def validate_input(df: pd.DataFrame, required_cols, numeric_ranges=None):
    missing = [c for c in required_cols if c not in df.columns]
    if missing:
        raise ValueError(f"Missing columns: {missing}")
    if numeric_ranges:
        for c, (lo, hi) in numeric_ranges.items():
            if c in df.columns:
                bad = df[(df[c] < lo) | (df[c] > hi)]
                if len(bad) > 0:
                    raise ValueError(f"Column {c} has values outside [{lo}, {hi}] (n_bad={len(bad)})")
    return True

required = ["Pregnancies","Glucose","BloodPressure","BMI","Age","classification"]
ranges = {"Age": (0, 120), "Glucose": (0, 400), "BMI": (0, 100)}
print("Validation OK?", validate_input(df_diabetes, required, ranges))

Validation OK? True


In [7]:
target = "classification"
X = df_diabetes.drop(columns=[target]).copy()
y = df_diabetes[target].map({"Non-Diabetic": 0, "Diabetic": 1}).astype(int)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=SEED, stratify=y
)

print("Train:", X_train.shape, "Test:", X_test.shape)
print("Positive rate (train/test):", round(y_train.mean(), 3), round(y_test.mean(), 3))

Train: (576, 8) Test: (192, 8)
Positive rate (train/test): 0.349 0.349


In [8]:
numeric_features = X_train.columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[("num", Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), numeric_features)]
)

lr = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(max_iter=250, random_state=SEED))
])

lr.fit(X_train, y_train)

proba_lr = lr.predict_proba(X_test)[:, 1]
pred_lr = (proba_lr >= 0.5).astype(int)

metrics_lr = {
    "accuracy": accuracy_score(y_test, pred_lr),
    "precision": precision_score(y_test, pred_lr, zero_division=0),
    "recall": recall_score(y_test, pred_lr, zero_division=0),
    "f1": f1_score(y_test, pred_lr, zero_division=0),
    "roc_auc": roc_auc_score(y_test, proba_lr),
}

pd.Series(metrics_lr).round(4)

accuracy     0.7344
precision    0.6481
recall       0.5224
f1           0.5785
roc_auc      0.8320
dtype: float64

In [9]:
rf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=40, random_state=SEED, n_jobs=-1,
        max_depth=7, min_samples_leaf=2
    ))
])

rf.fit(X_train, y_train)

proba_rf = rf.predict_proba(X_test)[:, 1]
pred_rf = (proba_rf >= 0.5).astype(int)

metrics_rf = {
    "accuracy": accuracy_score(y_test, pred_rf),
    "precision": precision_score(y_test, pred_rf, zero_division=0),
    "recall": recall_score(y_test, pred_rf, zero_division=0),
    "f1": f1_score(y_test, pred_rf, zero_division=0),
    "roc_auc": roc_auc_score(y_test, proba_rf),
}

pd.DataFrame([metrics_lr, metrics_rf], index=["LogReg", "RandomForest"]).round(4)

Unnamed: 0,accuracy,precision,recall,f1,roc_auc
LogReg,0.7344,0.6481,0.5224,0.5785,0.832
RandomForest,0.7656,0.7037,0.5672,0.6281,0.8192


In [10]:
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=SEED)
cv_auc = cross_val_score(lr, X_train, y_train, scoring="roc_auc", cv=skf)
print("Baseline LR CV ROC-AUC:", np.round(cv_auc, 4), " mean:", round(cv_auc.mean(), 4))

Baseline LR CV ROC-AUC: [0.8206 0.8086]  mean: 0.8146


In [11]:
cost_fp = 1.0
cost_fn = 5.0

thresholds = np.linspace(0.1, 0.9, 9)
rows = []
for t in thresholds:
    p = (proba_rf >= t).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, p).ravel()
    cost = cost_fp * fp + cost_fn * fn
    rows.append((t, tp, fp, tn, fn, cost))

df_thr = pd.DataFrame(rows, columns=["threshold", "TP", "FP", "TN", "FN", "cost"])
display(df_thr)
print("Best threshold by cost:")
display(df_thr.sort_values("cost").head(1))

Unnamed: 0,threshold,TP,FP,TN,FN,cost
0,0.1,65,78,47,2,88.0
1,0.2,58,60,65,9,105.0
2,0.3,53,44,81,14,114.0
3,0.4,45,27,98,22,137.0
4,0.5,38,16,109,29,161.0
5,0.6,29,10,115,38,200.0
6,0.7,21,4,121,46,234.0
7,0.8,9,0,125,58,290.0
8,0.9,0,0,125,67,335.0


Best threshold by cost:


Unnamed: 0,threshold,TP,FP,TN,FN,cost
0,0.1,65,78,47,2,88.0


In [12]:
df_eval = X_test.copy()
df_eval["y_true"] = y_test.values
df_eval["y_pred"] = pred_rf

df_eval["age_band"] = pd.cut(df_eval["Age"], bins=[18, 30, 40, 50, 60, 85], include_lowest=True)

def slice_metrics(g):
    yt = g["y_true"].values
    yp = g["y_pred"].values
    return pd.Series({
        "n": len(g),
        "pos_rate": yt.mean(),
        "recall": recall_score(yt, yp, zero_division=0),
        "precision": precision_score(yt, yp, zero_division=0),
        "f1": f1_score(yt, yp, zero_division=0),
    })

df_eval.groupby("age_band", observed=True).apply(slice_metrics).reset_index().round(4)

  df_eval.groupby("age_band", observed=True).apply(slice_metrics).reset_index().round(4)


Unnamed: 0,age_band,n,pos_rate,recall,precision,f1
0,"(17.999, 30.0]",114.0,0.2105,0.375,0.6,0.4615
1,"(30.0, 40.0]",33.0,0.6364,0.5238,0.7857,0.6286
2,"(40.0, 50.0]",27.0,0.5556,0.8667,0.8667,0.8667
3,"(50.0, 60.0]",13.0,0.4615,0.6667,0.5,0.5714
4,"(60.0, 85.0]",5.0,0.2,1.0,0.5,0.6667


In [13]:
X_leaky = X.copy()
rng = np.random.default_rng(SEED)
X_leaky["leaky_target_proxy"] = y + rng.normal(0, 0.02, size=len(y))

Xl_train, Xl_test, yl_train, yl_test = train_test_split(
    X_leaky, y, test_size=0.25, random_state=SEED, stratify=y
)

pre_leaky = ColumnTransformer(
    transformers=[("num", Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="median")),
        ("scaler", StandardScaler())
    ]), Xl_train.columns.tolist())]
)

leaky = Pipeline(steps=[
    ("preprocess", pre_leaky),
    ("model", LogisticRegression(max_iter=250, random_state=SEED))
])

leaky.fit(Xl_train, yl_train)
proba = leaky.predict_proba(Xl_test)[:, 1]
print("ROC-AUC with leakage:", round(roc_auc_score(yl_test, proba), 6))

ROC-AUC with leakage: 1.0


In [14]:
artifact_dir = Path("./_artifacts")
artifact_dir.mkdir(parents=True, exist_ok=True)

model_path = artifact_dir / "chapter1_lesson4_diabetes_rf_pipeline.joblib"
joblib.dump(rf, model_path)

loaded = joblib.load(model_path)

one = X_test.iloc[[0]].copy()
display(one)

p1 = loaded.predict_proba(one)[:, 1][0]
print("P(diabetic) =", round(float(p1), 4), " -> class =", int(p1 >= 0.5))

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
635,13,104,72,0,0,31.2,0.465,38


P(diabetic) = 0.3655  -> class = 0


In [15]:
def psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10, eps: float = 1e-6) -> float:
    q = np.linspace(0, 1, bins + 1)
    cuts = np.quantile(expected, q)
    cuts[0], cuts[-1] = -np.inf, np.inf
    e_counts, _ = np.histogram(expected, bins=cuts)
    a_counts, _ = np.histogram(actual, bins=cuts)
    e = np.clip(e_counts / max(e_counts.sum(), 1), eps, 1)
    a = np.clip(a_counts / max(a_counts.sum(), 1), eps, 1)
    return float(np.sum((a - e) * np.log(a / e)))

future = X_test.copy()
future["Glucose"] = future["Glucose"] + 15  # simulate shift
print("PSI(Glucose):", round(psi(X_train["Glucose"].to_numpy(), future["Glucose"].to_numpy()), 4))

PSI(Glucose): 0.291


In [16]:
df = df_diamonds.copy()
y_r = df["price"].astype(float)
X_r = df.drop(columns=["price"]).copy()

cat_cols = [c for c in X_r.columns if X_r[c].dtype == "object"]
num_cols = [c for c in X_r.columns if c not in cat_cols]

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size=0.25, random_state=SEED)

preprocess_r = ColumnTransformer(
    transformers=[
        ("num", Pipeline(steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]), num_cols),
        ("cat", Pipeline(steps=[("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore"))]), cat_cols),
    ]
)

reg = Pipeline(steps=[
    ("preprocess", preprocess_r),
    ("model", RandomForestRegressor(
        n_estimators=40, random_state=SEED, n_jobs=-1,
        max_depth=10, min_samples_leaf=2
    ))
])

reg.fit(X_train_r, y_train_r)
pred = reg.predict(X_test_r)

mae = mean_absolute_error(y_test_r, pred)
rmse = math.sqrt(mean_squared_error(y_test_r, pred))
r2 = r2_score(y_test_r, pred)

print("MAE:", round(mae, 2), " RMSE:", round(rmse, 2), " R^2:", round(r2, 4))
display(pd.DataFrame({"y_true": y_test_r.values[:8], "y_pred": pred[:8]}).round(2))

MAE: 12.32  RMSE: 63.21  R^2: 0.9997


Unnamed: 0,y_true,y_pred
0,559.0,555.48
1,2201.0,2202.37
2,1238.0,1241.6
3,1304.0,1294.37
4,6901.0,6901.72
5,3011.0,3014.04
6,1765.0,1764.63
7,1679.0,1675.78


In [17]:
# --- Robust clustering prep: handle missing values and enforce numeric types ---
from sklearn.impute import SimpleImputer

dfc = df_airports.copy()

cols = ["latitude_deg", "longitude_deg", "elevation_ft"]
missing_cols = [c for c in cols if c not in dfc.columns]
if missing_cols:
    raise ValueError(
        f"Expected columns not found in airports data: {missing_cols}\n"
        f"Available columns (first 40): {list(dfc.columns)[:40]}"
    )

Xc = dfc[cols].copy()

# Coerce to numeric (some CSVs store these as strings); invalid parses become NaN
for c in cols:
    Xc[c] = pd.to_numeric(Xc[c], errors="coerce")

# Impute NaNs (KMeans cannot handle NaN)
imputer = SimpleImputer(strategy="median")
X_imp = imputer.fit_transform(Xc)

# Standardize
Xs = StandardScaler().fit_transform(X_imp)

# --- Model selection proxy: inertia for a small k grid (fast) ---
ks = [2, 3, 4, 5, 6]
rows = []
models = {}
for k in ks:
    km = KMeans(n_clusters=k, random_state=SEED, n_init=10)
    km.fit(Xs)
    rows.append((k, float(km.inertia_)))
    models[k] = km

inertia_df = pd.DataFrame(rows, columns=["k", "inertia"]).sort_values("k")
display(inertia_df)

best_k = 3  # for demonstration; in practice use elbow + stability + constraints
km = models[best_k]
dfc["cluster"] = km.predict(Xs)
display(dfc["cluster"].value_counts().sort_index())


Unnamed: 0,k,inertia
0,2,171305.3195
1,3,123707.639589
2,4,80408.774662
3,5,64335.499902
4,6,54530.896375


cluster
0    55872
1    19937
2     7316
Name: count, dtype: int64

## Exercises (recommended)

1. **Split strategy stress test**  
   Replace the random split with a time-based split (simulate time by sorting on Age or another feature) and compare metrics. Explain why metrics change.

2. **Cost-aware thresholding**  
   Change the cost ratio to FN 10× FP and re-compute the best threshold. What happens to precision and recall?

3. **Schema enforcement**  
   Extend `validate_input` to enforce:
   - no extra columns (strict schema)
   - allowed ranges for multiple features
   - rejection or imputation rules for missing values

4. **Leakage forensics**  
   Try adding an “ID-like” feature (e.g., row index) and see whether metrics change. Discuss when IDs leak information and when they do not.

5. **Model governance notes**  
   Write a short “model card” (one page) describing:
   - intended use
   - training data
   - metrics
   - limitations
   - monitoring plan

6. **Clustering interpretation**  
   For each cluster, compute mean/median latitude/longitude/elevation and write a short interpretation. What might those clusters represent?

7. **Reproducibility**  
   Run the notebook twice and confirm that:
   - chosen datasets are stable (seeded)
   - metrics match (within randomness)