# 02 â€“ Baseline Risk Model (Logistic Regression)

## Objective
Develop an interpretable baseline model predicting `high_risk_next`
using competitive workload features.

Key principles:

- Chronological split
- Feature scaling
- ROC-AUC & PR-AUC evaluation
- Coefficient interpretation


In [1]:
# === Setup ===
import duckdb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, average_precision_score

def resolve_db_path():
    cwd = Path.cwd()
    candidates = [
        cwd / "lakehouse" / "analytics.duckdb",
        cwd.parent / "lakehouse" / "analytics.duckdb"
    ]
    for p in candidates:
        if p.exists():
            return p
    raise FileNotFoundError("DuckDB file not found.")

DB_PATH = resolve_db_path()


In [2]:
# === Load Data ===
with duckdb.connect(str(DB_PATH)) as con:
    dfp = con.execute("SELECT * FROM player_dataset_predictive WHERE acwr IS NOT NULL").df()

dfp.shape


(51569, 23)

## Feature Selection
Interpretable competitive workload variables.


In [3]:
features = ["minutes_last_7d", "minutes_last_14d", "minutes_last_28d", "acwr"]

d = dfp[features + ["high_risk_next"]].dropna().copy()

if "match_date" in dfp.columns:
    d = d.sort_values("match_date")

cut = int(len(d) * 0.8)
train = d.iloc[:cut]
test = d.iloc[cut:]

X_train = train[features]
y_train = train["high_risk_next"].astype(int)
X_test = test[features]
y_test = test["high_risk_next"].astype(int)

len(train), len(test)


KeyError: 'match_date'

In [None]:
# === Model Training ===
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

model = LogisticRegression(max_iter=2000)
model.fit(X_train_s, y_train)

y_proba = model.predict_proba(X_test_s)[:,1]

roc = roc_auc_score(y_test, y_proba)
pr = average_precision_score(y_test, y_proba)

roc, pr


## Coefficient Interpretation

Positive coefficients increase log-odds of elevated risk proxy.


In [None]:
coef_df = pd.DataFrame({
    "feature": features,
    "coef": model.coef_[0],
    "odds_ratio": np.exp(model.coef_[0])
}).sort_values("coef", ascending=False)

coef_df


## Conclusion

- Competitive workload provides measurable predictive signal.
- Model performance exceeds random baseline.
- Tree-based models may capture additional non-linear effects.

This baseline serves as a transparent reference model.
