# 02 – Baseline risk model (Logistic Regression)

## Objective
Develop an interpretable baseline model predicting `high_risk_next`
using competitive workload features.

Key principles:

- Chronological split
- Feature scaling
- ROC-AUC & PR-AUC evaluation
- Coefficient interpretation


In [None]:
# === Setup ===
import os
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss
from sklearn.calibration import calibration_curve


def resolve_db_path():
    cwd = Path.cwd()
    candidates = [
        cwd / "lakehouse" / "analytics.duckdb",
        cwd.parent / "lakehouse" / "analytics.duckdb"
    ]
    for p in candidates:
        if p.exists():
            return p
    raise FileNotFoundError("DuckDB file not found.")

DB_PATH = resolve_db_path()


In [None]:
# === Load Data ===
with duckdb.connect(str(DB_PATH)) as con:
    dfp = con.execute("SELECT * FROM player_dataset_predictive WHERE acwr IS NOT NULL").df()

dfp.shape


## Feature selection
Interpretable competitive workload variables.


In [None]:
TABLE = os.getenv("FRA_TABLE", "player_dataset_predictive_v2")

con = duckdb.connect(DB_PATH, read_only=True)
dfp = con.execute(f"SELECT * FROM {TABLE}").fetchdf()
con.close()

print("Loaded TABLE:", TABLE)
print("Columns:", len(dfp.columns))

In [None]:
features = [
    "minutes_last_7d","minutes_last_14d","minutes_last_28d","minutes_last_5_matches","acwr",
    "minutes_std_last_5_matches","minutes_std_last_10_matches",
    "delta_7d_14d","delta_14d_28d",
    "ratio_7d_14d","ratio_14d_28d",
    "acwr_change",
    "season_minutes_cum","season_matches_played","season_avg_minutes",
    "minutes_last_3_matches","season_momentum_3v_season_avg"
]

target = "high_risk_next"

cols = features + [target]

if "match_date" in dfp.columns:
    cols.append("match_date")

d = dfp[cols].copy()

if "match_date" in d.columns:
    d = d.sort_values("match_date")

cut = int(len(d) * 0.8)
train = d.iloc[:cut]
test = d.iloc[cut:]

X_train = train[features]
y_train = train["high_risk_next"].astype(int)

X_test = test[features]
y_test = test["high_risk_next"].astype(int)

len(train), len(test)


In [None]:
missing = [c for c in features if c not in dfp.columns]
print("Missing:", missing)

## Model Evaluation (Chronological Test)

- ROC-AUC
- PR-AUC
- Brier score
- Prevalence
- Threshold policy


In [None]:
# === Model Training ===

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),  # robusto a outliers
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000)),
])

pipe.fit(X_train, y_train)
y_proba = pipe.predict_proba(X_test)[:, 1]
roc = roc_auc_score(y_test, y_proba)
pr  = average_precision_score(y_test, y_proba)
brier = brier_score_loss(y_test, y_proba)

print(f"ROC-AUC (test): {roc:.4f}")
print(f"PR-AUC  (test): {pr:.4f}")
print(f"Brier  (test): {brier:.4f}")
print(f"Prevalence (test): {y_test.mean():.4f}")


## Probability Calibration & Reliability Analysis

- Reliability curve
- Expected Calibration Error (ECE)
- Operational interpretation

In [None]:
# -----------------------
# CALIBRATION (Reliability + ECE)
# -----------------------
def expected_calibration_error(y_true, y_prob, n_bins=10, strategy="quantile"):
    y_true = np.asarray(y_true).astype(int)
    y_prob = np.asarray(y_prob).astype(float)

    if strategy == "quantile":
        bins = np.quantile(y_prob, np.linspace(0, 1, n_bins + 1))
        bins[0], bins[-1] = 0.0, 1.0
    else:
        bins = np.linspace(0, 1, n_bins + 1)

    bin_ids = np.digitize(y_prob, bins[1:-1], right=True)

    ece = 0.0
    for b in range(n_bins):
        mask = bin_ids == b
        if mask.sum() == 0:
            continue
        acc = y_true[mask].mean()
        conf = y_prob[mask].mean()
        w = mask.mean()
        ece += w * abs(acc - conf)
    return ece

prob_true, prob_pred = calibration_curve(y_test, y_proba, n_bins=10, strategy="quantile")
ece = expected_calibration_error(y_test, y_proba, n_bins=10, strategy="quantile")

plt.figure()
plt.plot([0, 1], [0, 1])
plt.plot(prob_pred, prob_true, marker="o")
plt.title(f"Calibration curve (TEST) | ECE={ece:.3f}")
plt.xlabel("Mean predicted probability")
plt.ylabel("Observed frequency")
plt.show()

print("ECE (test):", round(ece, 4))

### Calibration analysis (test)

The model shows acceptable calibration (ECE ≈ 0.089).  
In higher probability bins, observed frequency slightly exceeds predicted probability, indicating mild underestimation of elevated risk.

This conservative bias is preferable in operational settings where false alarms carry cost.

## Coefficient interpretation

Positive coefficients increase log-odds of elevated risk proxy.


In [None]:
logit_model = pipe.named_steps["model"]

coef_df = pd.DataFrame({
    "feature": features,
    "coef": logit_model.coef_[0],
    "odds_ratio": np.exp(logit_model.coef_[0])
})

coef_df["abs_coef"] = coef_df["coef"].abs()
coef_df = coef_df.sort_values("abs_coef", ascending=False)

coef_df


In [None]:
cap = 0.10
thr = float(np.quantile(pipe.predict_proba(X_train)[:,1], 1-cap))
y_pred = (y_proba >= thr).astype(int)

from sklearn.metrics import precision_score, recall_score, f1_score
print("Threshold (train quantile @10%):", thr)
print("Test alert rate:", y_pred.mean())
print("Precision:", precision_score(y_test, y_pred, zero_division=0))
print("Recall:", recall_score(y_test, y_pred, zero_division=0))
print("F1:", f1_score(y_test, y_pred, zero_division=0))

## Positioning within the project

This notebook establishes the baseline modelling layer.

Subsequent notebooks extend this foundation:

- Notebook 03 → Model comparison
- Notebook 04 → Operational thresholding
- Notebook 05 → Rolling deployment validation

The baseline selected for operational use is Logistic Regression,
due to its interpretability and stable calibration behaviour.

## Feature engineering rationale

Workload features were expanded to include:

- Rolling standard deviations
- Short vs medium-term deltas
- Season cumulative load
- Season momentum relative to baseline

The objective was to capture:

1. Acute workload shocks
2. Chronic load accumulation
3. Structural season fatigue
4. Load volatility

Rolling validation confirmed that interaction ratios
provide incremental predictive signal.

Feature expansion improved overall ROC-AUC stability
compared to the initial baseline specification.