# 02 – Baseline risk model (Logistic Regression)

## Objective
Develop an interpretable baseline model predicting `high_risk_next`
using competitive workload features.

Key principles:

- Chronological split
- Feature scaling
- ROC-AUC & PR-AUC evaluation
- Coefficient interpretation


In [29]:
# === Setup ===
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss

def resolve_db_path():
    cwd = Path.cwd()
    candidates = [
        cwd / "lakehouse" / "analytics.duckdb",
        cwd.parent / "lakehouse" / "analytics.duckdb"
    ]
    for p in candidates:
        if p.exists():
            return p
    raise FileNotFoundError("DuckDB file not found.")

DB_PATH = resolve_db_path()


In [9]:
# === Load Data ===
with duckdb.connect(str(DB_PATH)) as con:
    dfp = con.execute("SELECT * FROM player_dataset_predictive WHERE acwr IS NOT NULL").df()

dfp.shape


(51569, 23)

## Feature selection
Interpretable competitive workload variables.


In [30]:
TABLE = os.getenv("FRA_TABLE", "player_dataset_predictive_v2")

con = duckdb.connect(DB_PATH, read_only=True)
dfp = con.execute(f"SELECT * FROM {TABLE}").fetchdf()
con.close()

print("Loaded TABLE:", TABLE)
print("Columns:", len(dfp.columns))

Loaded TABLE: player_dataset_predictive_v2
Columns: 36


In [31]:
missing = [c for c in features if c not in dfp.columns]
print("Missing:", missing)

Missing: []


In [32]:
features = [
    "minutes_last_7d","minutes_last_14d","minutes_last_28d","minutes_last_5_matches","acwr",
    "minutes_std_last_5_matches","minutes_std_last_10_matches",
    "delta_7d_14d","delta_14d_28d",
    "ratio_7d_14d","ratio_14d_28d",
    "acwr_change",
    "season_minutes_cum","season_matches_played","season_avg_minutes",
    "minutes_last_3_matches","season_momentum_3v_season_avg"
]

target = "high_risk_next"

cols = features + [target]

if "match_date" in dfp.columns:
    cols.append("match_date")

d = dfp[cols].copy()

if "match_date" in d.columns:
    d = d.sort_values("match_date")

cut = int(len(d) * 0.8)
train = d.iloc[:cut]
test = d.iloc[cut:]

X_train = train[features]
y_train = train["high_risk_next"].astype(int)

X_test = test[features]
y_test = test["high_risk_next"].astype(int)

len(train), len(test)


(64926, 16232)

In [33]:
# === Model Training ===

pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),  # robusto a outliers
    ("scaler", StandardScaler()),
    ("model", LogisticRegression(max_iter=2000)),
])

pipe.fit(X_train, y_train)
y_proba = pipe.predict_proba(X_test)[:, 1]
roc = roc_auc_score(y_test, y_proba)
pr  = average_precision_score(y_test, y_proba)
brier = brier_score_loss(y_test, y_proba)

print(f"ROC-AUC (test): {roc:.4f}")
print(f"PR-AUC  (test): {pr:.4f}")
print(f"Brier  (test): {brier:.4f}")
print(f"Prevalence (test): {y_test.mean():.4f}")


ROC-AUC (test): 0.7671
PR-AUC  (test): 0.6968
Brier  (test): 0.2020
Prevalence (test): 0.4730


## Coefficient interpretation

Positive coefficients increase log-odds of elevated risk proxy.


In [35]:
logit_model = pipe.named_steps["model"]

coef_df = pd.DataFrame({
    "feature": features,
    "coef": logit_model.coef_[0],
    "odds_ratio": np.exp(logit_model.coef_[0])
})

coef_df["abs_coef"] = coef_df["coef"].abs()
coef_df = coef_df.sort_values("abs_coef", ascending=False)

coef_df


Unnamed: 0,feature,coef,odds_ratio,abs_coef
3,minutes_last_5_matches,0.874259,2.397099,0.874259
7,delta_7d_14d,0.636781,1.890385,0.636781
0,minutes_last_7d,0.633581,1.884345,0.633581
9,ratio_7d_14d,-0.537556,0.584174,0.537556
15,minutes_last_3_matches,-0.510555,0.600162,0.510555
10,ratio_14d_28d,-0.467732,0.626421,0.467732
8,delta_14d_28d,0.445992,1.562039,0.445992
14,season_avg_minutes,0.314266,1.369254,0.314266
2,minutes_last_28d,-0.227956,0.796159,0.227956
4,acwr,-0.199881,0.818828,0.199881


In [36]:
cap = 0.10
thr = float(np.quantile(pipe.predict_proba(X_train)[:,1], 1-cap))
y_pred = (y_proba >= thr).astype(int)

from sklearn.metrics import precision_score, recall_score, f1_score
print("Threshold (train quantile @10%):", thr)
print("Test alert rate:", y_pred.mean())
print("Precision:", precision_score(y_test, y_pred, zero_division=0))
print("Recall:", recall_score(y_test, y_pred, zero_division=0))
print("F1:", f1_score(y_test, y_pred, zero_division=0))

Threshold (train quantile @10%): 0.6464096474419989
Test alert rate: 0.09813947757516017
Precision: 0.7545511613308223
Recall: 0.15657157743910383
F1: 0.2593311758360302


## Conclusion

- Competitive workload provides measurable predictive signal.
- Model performance exceeds random baseline.
- Tree-based models may capture additional non-linear effects.

This baseline serves as a transparent reference model.


## Positioning within the project

This notebook establishes the baseline modelling layer.

Subsequent notebooks extend this foundation:

- Notebook 03 → Model comparison
- Notebook 04 → Operational thresholding
- Notebook 05 → Rolling deployment validation

The baseline selected for operational use is Logistic Regression,
due to its interpretability and stable calibration behaviour.

## Feature engineering rationale

Workload features were expanded to include:

- Rolling standard deviations
- Short vs medium-term deltas
- Season cumulative load
- Season momentum relative to baseline

The objective was to capture:

1. Acute workload shocks
2. Chronic load accumulation
3. Structural season fatigue
4. Load volatility

Rolling validation confirmed that interaction ratios
provide incremental predictive signal.

Feature expansion improved overall ROC-AUC stability
compared to the initial baseline specification.