# Notebook 04 — Hold-out Evaluation der Baseline (no-leak)

Dieses Notebook deckt folgende Anforderungen ab:
- **Anforderung 7:** Training (Baseline auf Trainingssplit)
- **Anforderung 8:** Evaluation und Ergebnisse (Hold-out Metriken)

Inhalte:
- Reproduzierbarer Train/Hold-out Split
- Training des Baseline-Modells (no-leak)
- Hold-out Metriken (MAE, RMSE, R²)
- Export von Metriken und Vorhersagen für Notebook 08 (Final Report) und Fehleranalyse

No-Leak Regel:
- `elapsed_time` wird nicht als Feature genutzt.

Outputs:
- `data_derived/04_holdout_metrics_no_leak.csv`
- `data_derived/04_holdout_predictions_no_leak.csv`

In [1]:
from __future__ import annotations

from pathlib import Path
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

SEED = 42
np.random.seed(SEED)

def find_repo_root(start: Path) -> Path:
    start = start.resolve()
    for p in [start] + list(start.parents):
        if (p / "data" / "processed").exists():
            return p
    return start

REPO_ROOT = find_repo_root(Path.cwd())

PATH_MODEL_READY = REPO_ROOT / "data" / "processed" / "model_ready.csv"
PATH_DERIVED = REPO_ROOT / "data_derived"
PATH_DERIVED.mkdir(parents=True, exist_ok=True)

TARGET = "moving_time"
BASE_FEATURES = ["distance", "total_elevation_gain", "highest_elevation", "lowest_elevation"]

print("REPO_ROOT:", REPO_ROOT)
print("PATH_MODEL_READY:", PATH_MODEL_READY)

REPO_ROOT: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer
PATH_MODEL_READY: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data/processed/model_ready.csv


In [2]:
import re

CV_PATH = PATH_DERIVED / "03_cv_results_final_no_leak.csv"

def _pick_col(df: pd.DataFrame, candidates: list[str]) -> str | None:
    for c in candidates:
        if c in df.columns:
            return c
    return None

def _extract_alpha_from_model_string(s: str) -> float | None:
    # akzeptiert z.B. "Ridge (degree=1, alpha=0.1) [BEST]" oder "alpha=1e-4"
    m = re.search(r"alpha\s*=\s*([0-9eE\.\-]+)", s)
    return float(m.group(1)) if m else None

ALPHA_BASELINE = None

if CV_PATH.exists():
    cv = pd.read_csv(CV_PATH)
    mae_col = _pick_col(cv, ["mae_mean_s", "mae_cv_mean_s", "mae_mean", "mae"])
    if mae_col is None:
        raise ValueError(f"Keine MAE-Spalte in {CV_PATH.name} gefunden. Spalten: {list(cv.columns)}")

    best = cv.sort_values(mae_col, ascending=True).iloc[0]
    ALPHA_BASELINE = _extract_alpha_from_model_string(str(best.get("model", "")))

# Fallback (sollte praktisch nie greifen, ist aber besser als Crash)
if ALPHA_BASELINE is None:
    ALPHA_BASELINE = 0.01

print("Baseline-Alpha (aus CV-Best, Notebook 03):", ALPHA_BASELINE)

Baseline-Alpha (aus CV-Best, Notebook 03): 0.1


In [3]:
if not PATH_MODEL_READY.exists():
    raise FileNotFoundError(
        "[ERROR] model_ready.csv nicht gefunden.\n"
        f"Erwarteter Pfad: {PATH_MODEL_READY}\n"
        "Bitte Notebook 02 ausführen."
    )

df = pd.read_csv(PATH_MODEL_READY)
print("model_ready geladen:", df.shape)

X = df[BASE_FEATURES].copy()
y = df[TARGET].astype(float).copy()

X_train, X_holdout, y_train, y_holdout = train_test_split(
    X, y,
    test_size=0.20,
    random_state=SEED,
)

print("Train:", X_train.shape, "Hold-out:", X_holdout.shape)

model_ready geladen: (9237, 13)
Train: (7389, 4) Hold-out: (1848, 4)


## Baseline-Modell trainieren und auf Hold-out evaluieren

Baseline aus Notebook 03:
- Ridge Regression (linear) mit `alpha = 0.1`
- Features: `distance`, `total_elevation_gain`, `highest_elevation`, `lowest_elevation`

Wir berichten MAE, RMSE und R² auf dem Hold-out Datensatz.

In [4]:
# Baseline-Modell (no-leak): Ridge(alpha=0.1)

ALPHA_BASELINE = 0.1

baseline_model = Pipeline([
    ("scaler", MinMaxScaler()),
    ("model", Ridge(alpha=ALPHA_BASELINE, random_state=SEED)),
])

baseline_model.fit(X_train, y_train)
y_pred = baseline_model.predict(X_holdout)

print("Baseline trainiert. Predictions erzeugt:", y_pred.shape)

Baseline trainiert. Predictions erzeugt: (1848,)


In [5]:
# Hold-out Metriken (Sekunden) + Exporte

mae = mean_absolute_error(y_holdout, y_pred)

mse = mean_squared_error(y_holdout, y_pred)   # ohne squared-Argument (versionsrobust)
rmse = float(np.sqrt(mse))

r2 = r2_score(y_holdout, y_pred)

metrics = pd.DataFrame([{
    "block": "Hold-out – Baseline (no-leak)",
    "model": f"Ridge (degree=1, alpha={ALPHA_BASELINE})",
    "mae_s": float(mae),
    "rmse_s": float(rmse),
    "r2": float(r2),
    "mae_min": float(mae / 60.0),
    "rmse_min": float(rmse / 60.0),
}])

display(metrics)

metrics_path = PATH_DERIVED / "04_holdout_metrics_no_leak.csv"
metrics.to_csv(metrics_path, index=False)
print("Saved:", metrics_path)

pred_df = X_holdout.copy()
pred_df["y_true"] = y_holdout.values
pred_df["y_pred"] = y_pred
pred_df["abs_error"] = (pred_df["y_true"] - pred_df["y_pred"]).abs()

pred_path = PATH_DERIVED / "04_holdout_predictions_no_leak.csv"
pred_df.to_csv(pred_path, index=False)
print("Saved:", pred_path)

print("MAE:", f"{mae:.2f}s ({mae/60:.2f} min)", "| RMSE:", f"{rmse:.2f}s", "| R2:", f"{r2:.4f}")
print("pred_df shape:", pred_df.shape)

Unnamed: 0,block,model,mae_s,rmse_s,r2,mae_min,rmse_min
0,Hold-out – Baseline (no-leak),"Ridge (degree=1, alpha=0.1)",760.643263,1631.394177,0.898263,12.677388,27.189903


Saved: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data_derived/04_holdout_metrics_no_leak.csv
Saved: /Users/justuspfeifer/Documents/AML/aml-justus-pfeifer/data_derived/04_holdout_predictions_no_leak.csv
MAE: 760.64s (12.68 min) | RMSE: 1631.39s | R2: 0.8983
pred_df shape: (1848, 7)
