# Notebook 03 — Baseline-Modelle mit Cross-Validation (no-leak)

Dieses Notebook deckt folgende Anforderungen ab:
- **Anforderung 6:** Modellauswahl (Baseline-Kandidaten) und Hyperparameter
- **Anforderung 7:** Training (im Rahmen von Cross-Validation)
- **Anforderung 8:** Evaluation und Ergebnisse (CV-Metriken)

Inhalte:
- Definition des no-leak Feature-Sets für Baselines
- Baseline-Modelle: Dummy, Ridge (linear), Poly2 + Ridge
- Auswahl des Regularisierungsparameters `alpha` per Cross-Validation
- Export der CV-Ergebnisse als Grundlage für Notebook 04 (Hold-out Evaluation)

No-Leak Regel:
- `elapsed_time` ist in `model_ready.csv` enthalten, wird aber nicht als Feature genutzt.

Outputs:
- `data_derived/03_alpha_grid_ridge_degree1.csv`
- `data_derived/03_alpha_grid_poly2_ridge.csv`
- `data_derived/03_cv_results_final_no_leak.csv`

In [1]:
from __future__ import annotations

from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.model_selection import KFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.dummy import DummyRegressor
from sklearn.base import clone

SEED = 42
np.random.seed(SEED)

def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start] + list(start.parents):
        if (p / "data" / "processed").exists():
            return p
    return start

REPO_ROOT = find_repo_root(Path.cwd())

PATH_MODEL_READY = REPO_ROOT / "data" / "processed" / "model_ready.csv"
PATH_DERIVED = REPO_ROOT / "data_derived"
PATH_DERIVED.mkdir(parents=True, exist_ok=True)

TARGET = "moving_time"

print("REPO_ROOT:", REPO_ROOT)
print("PATH_MODEL_READY:", PATH_MODEL_READY)

REPO_ROOT: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer
PATH_MODEL_READY: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data/processed/model_ready.csv


In [2]:
if not PATH_MODEL_READY.exists():
    raise FileNotFoundError(
        "[ERROR] model_ready.csv nicht gefunden.\n"
        f"Erwarteter Pfad: {PATH_MODEL_READY}\n"
        "Bitte Notebook 02 ausführen."
    )

df = pd.read_csv(PATH_MODEL_READY)
print("model_ready geladen:", df.shape)

BASE_FEATURES = ["distance", "total_elevation_gain", "highest_elevation", "lowest_elevation"]

missing = [c for c in BASE_FEATURES + [TARGET] if c not in df.columns]
if missing:
    raise ValueError(f"[ERROR] Erwartete Spalten fehlen in model_ready.csv: {missing}")

if "elapsed_time" not in df.columns:
    print("[WARN] 'elapsed_time' ist nicht in model_ready.csv enthalten (QC-Spalte optional).")

X = df[BASE_FEATURES].copy()
y = df[TARGET].astype(float).copy()

print("Baseline Features:", BASE_FEATURES)
print("X shape:", X.shape, "| y shape:", y.shape)

model_ready geladen: (9237, 13)
Baseline Features: ['distance', 'total_elevation_gain', 'highest_elevation', 'lowest_elevation']
X shape: (9237, 4) | y shape: (9237,)


## Cross-Validation und Hyperparameterwahl

Wir vergleichen Baseline-Modelle (no-leak) mittels K-Fold Cross-Validation.

Baselines:
- Dummy Regressor (Median)
- Ridge Regression (linear, L2-Regularisierung)
- Polynomial Features (Grad 2) + Ridge

Für Ridge-Modelle wird der Regularisierungsparameter `alpha` per Cross-Validation gesucht.
Entscheidungskriterium ist der niedrigste CV-MAE (Sekunden).

In [3]:
cv = KFold(n_splits=5, shuffle=True, random_state=SEED)

scoring = {
    "mae": "neg_mean_absolute_error",
    "rmse": "neg_root_mean_squared_error",
    "r2": "r2",
}

alphas = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0]

def cv_metrics(estimator, X, y) -> dict:
    res = cross_validate(
        estimator, X, y,
        cv=cv,
        scoring=scoring,
        n_jobs=-1,
        return_train_score=False
    )
    mae = -res["test_mae"]
    rmse = -res["test_rmse"]
    r2 = res["test_r2"]
    return {
        "mae_mean": float(np.mean(mae)),
        "mae_std": float(np.std(mae, ddof=1)),
        "rmse_mean": float(np.mean(rmse)),
        "rmse_std": float(np.std(rmse, ddof=1)),
        "r2_mean": float(np.mean(r2)),
        "r2_std": float(np.std(r2, ddof=1)),
    }

# Dummy (kein alpha)
dummy_est = DummyRegressor(strategy="median")
dummy_metrics = cv_metrics(dummy_est, X, y)

# Ridge-Grid (linear)
ridge_template = Pipeline([
    ("scaler", MinMaxScaler()),
    ("model", Ridge(random_state=SEED)),
])

rows_ridge = []
for a in alphas:
    est = clone(ridge_template).set_params(model__alpha=a)
    rows_ridge.append({"alpha": a, **cv_metrics(est, X, y)})
ridge_grid = pd.DataFrame(rows_ridge).sort_values("mae_mean", ascending=True).reset_index(drop=True)

# Poly2+Ridge-Grid
poly2_template = Pipeline([
    ("scaler", MinMaxScaler()),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("model", Ridge(random_state=SEED)),
])

rows_poly2 = []
for a in alphas:
    est = clone(poly2_template).set_params(model__alpha=a)
    rows_poly2.append({"alpha": a, **cv_metrics(est, X, y)})
poly2_grid = pd.DataFrame(rows_poly2).sort_values("mae_mean", ascending=True).reset_index(drop=True)

print("Ridge Alpha-Grid (best first):")
display(ridge_grid)

print("Poly2+Ridge Alpha-Grid (best first):")
display(poly2_grid)

ridge_path = PATH_DERIVED / "03_alpha_grid_ridge_degree1.csv"
poly2_path = PATH_DERIVED / "03_alpha_grid_poly2_ridge.csv"
ridge_grid.to_csv(ridge_path, index=False)
poly2_grid.to_csv(poly2_path, index=False)

print("Saved:", ridge_path)
print("Saved:", poly2_path)
print("Best Ridge alpha:", ridge_grid.iloc[0]["alpha"])
print("Best Poly2+Ridge alpha:", poly2_grid.iloc[0]["alpha"])

Ridge Alpha-Grid (best first):


Unnamed: 0,alpha,mae_mean,mae_std,rmse_mean,rmse_std,r2_mean,r2_std
0,0.1,760.562772,26.621199,1467.069287,105.499846,0.917594,0.012653
1,0.01,760.593339,26.541681,1467.483605,106.353344,0.917537,0.012793
2,0.001,760.60567,26.534458,1467.534059,106.439292,0.91753,0.012807
3,0.0001,760.607101,26.533471,1467.539197,106.447893,0.91753,0.012809
4,1e-05,760.607245,26.533372,1467.539711,106.448753,0.91753,0.012809
5,1e-06,760.607259,26.533362,1467.539763,106.448839,0.91753,0.012809
6,1.0,767.314151,27.137225,1470.436026,97.552489,0.917304,0.011435
7,10.0,959.984979,29.78234,1713.256863,56.040079,0.888095,0.007344


Poly2+Ridge Alpha-Grid (best first):


Unnamed: 0,alpha,mae_mean,mae_std,rmse_mean,rmse_std,r2_mean,r2_std
0,0.01,763.936073,23.316878,1464.24806,107.188927,0.917733,0.014052
1,0.001,765.181559,24.428758,1468.576401,107.245045,0.917276,0.01386
2,0.0001,765.498546,24.626436,1469.79303,107.327029,0.917148,0.013806
3,1e-05,765.532422,24.647593,1469.932948,107.338242,0.917133,0.0138
4,1e-06,765.535829,24.649723,1469.94715,107.339399,0.917131,0.013799
5,0.1,766.268446,22.7012,1464.604835,106.884172,0.917727,0.013814
6,1.0,786.140723,21.220868,1462.7428,98.607154,0.917967,0.013184
7,10.0,964.537475,29.026315,1640.924166,45.997332,0.897084,0.010189


Saved: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data_derived/03_alpha_grid_ridge_degree1.csv
Saved: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data_derived/03_alpha_grid_poly2_ridge.csv
Best Ridge alpha: 0.1
Best Poly2+Ridge alpha: 0.01


In [4]:
best_ridge = ridge_grid.iloc[0]
best_poly2 = poly2_grid.iloc[0]

cv_results = pd.DataFrame([
    {"model": "Dummy (median)", **dummy_metrics},
    {"model": f"Ridge (degree=1, alpha={best_ridge['alpha']})", **best_ridge.drop(labels=["alpha"]).to_dict()},
    {"model": f"Poly2 + Ridge (alpha={best_poly2['alpha']})", **best_poly2.drop(labels=["alpha"]).to_dict()},
]).sort_values("mae_mean", ascending=True).reset_index(drop=True)

best_idx = int(cv_results["mae_mean"].idxmin())
cv_results.loc[best_idx, "model"] = cv_results.loc[best_idx, "model"] + " [BEST]"

display(cv_results)

cv_path = PATH_DERIVED / "03_cv_results_final_no_leak.csv"
cv_results.to_csv(cv_path, index=False)
print("Saved:", cv_path)

print("CV-Baseline-Sieger:", cv_results.iloc[0]["model"])

Unnamed: 0,model,mae_mean,mae_std,rmse_mean,rmse_std,r2_mean,r2_std
0,"Ridge (degree=1, alpha=0.1) [BEST]",760.562772,26.621199,1467.069287,105.499846,0.917594,0.012653
1,Poly2 + Ridge (alpha=0.01),763.936073,23.316878,1464.24806,107.188927,0.917733,0.014052
2,Dummy (median),3241.350846,116.880913,5567.842705,213.335264,-0.179608,0.022013


Saved: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data_derived/03_cv_results_final_no_leak.csv
CV-Baseline-Sieger: Ridge (degree=1, alpha=0.1) [BEST]


## Ergebnis und Baseline-Entscheidung

- In dieser Cross-Validation liefert **Ridge (linear)** mit `alpha=0.01` den niedrigsten MAE und wird als Baseline-Modell gewählt.
- **Poly2 + Ridge** bringt in dieser Konfiguration keinen konsistenten Vorteil gegenüber der linearen Ridge-Regression.
- Der Dummy-Regressor dient als Referenz und zeigt die Größenordnung ohne modellierte Zusammenhänge.

Das gewählte Baseline-Modell wird in Notebook 04 auf einem Hold-out Datensatz evaluiert.

### Output

- `data_derived/03_alpha_grid_ridge_degree1.csv`
- `data_derived/03_alpha_grid_poly2_ridge.csv`
- `data_derived/03_cv_results_final_no_leak.csv`