
# Fourth & Value — Game Predictions Walkthrough

This notebook builds **game-level ML predictions** (spread & total) by rolling up **player weekly stats → team → game** and then training simple baseline models.

**What you'll do here:**
1. Load `stats_player_week_{SEASON}.parquet` (player-level weekly box scores).
2. Roll up to **team/game totals**.
3. Build **rolling features** per team (last N weeks).
4. Construct **matchup features** (home vs away).
5. Train **Ridge** models for **spread (point differential)** and **total (combined points)**.
6. Evaluate on held-out weeks.
7. (Optional) Compare model vs **market lines** from `data/odds/latest.csv`.

> This is a **clear, testable scaffold**. Start with it as-is; then iterate: add better features (EPA, success rate, pace), injuries, weather/dome flags, and calibration.



## Paths & Parameters

Set your season/week here. This notebook expects your repo layout (as we've been using for props):
- `data/weekly_player_stats_{SEASON}.parquet` ← downloaded from nflverse
- `data/odds/latest.csv` ← from The Odds API (optional for later merge)
- Outputs:
  - `data/games/model_preds_week{WEEK}.csv` (predictions for the selected week)


In [None]:

from pathlib import Path
import pandas as pd
import numpy as np

# Modeling
from sklearn.linear_model import Ridge

SEASON = 2025
WEEK   = 3

PLAYER_PARQUET = Path(f"data/weekly_player_stats_{SEASON}.parquet")
ODDS_CSV       = Path("data/odds/latest.csv")
OUT_DIR        = Path("data/games")
OUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"[conf] SEASON={SEASON} WEEK={WEEK}")
print(f"[paths] player_stats={PLAYER_PARQUET.exists()} odds={ODDS_CSV.exists()} out_dir={OUT_DIR.as_posix()}")


In [None]:

if not PLAYER_PARQUET.exists():
    raise FileNotFoundError(f"Missing {PLAYER_PARQUET}. Fetch via nflverse weekly release first.")

stats = pd.read_parquet(PLAYER_PARQUET)

# Peek at columns so it's easy to adjust below if your schema differs.
print(f"[stats] shape={stats.shape}")
print("[stats] sample cols:", list(stats.columns)[:25])
stats.head()



## Roll up: Player → Team/Game

We aggregate player stats up to **team** per **game**, keeping the **points** column as the target reference.
If your parquet lacks any of these exact columns, tweak the mapping block below where noted.


In [None]:

# ---- Column mapping (adjust here if needed) ----
# Try common nflverse names; fall back if not present.
def first_present(d, keys, default=np.nan):
    for k in keys:
        if k in d:
            return d[k]
    return default

# Normalize a few expected columns
# (You may tweak these if your parquet uses different names)
colmap = {
    "team":       "team",             # team abbreviation
    "opponent":   "opponent",         # opponent team abbr
    "game_id":    "game_id",
    "week":       "week",
    "pass_yds":   first_present(stats, ["passing_yards", "pass_yards", "pass_yds"], 0),
    "rush_yds":   first_present(stats, ["rushing_yards", "rush_yards", "rush_yds"], 0),
    "ints":       first_present(stats, ["interceptions", "int", "interceptions_thrown"], 0),
    "fumbles_lost": first_present(stats, ["fumbles_lost", "fumbles_lost_team"], 0),
    "sacks":      first_present(stats, ["sacks", "sacked", "sacks_taken"], 0),
    "points":     first_present(stats, ["points", "team_points", "pts"], np.nan),
}

# Build a lightweight frame with mapped columns
mini = pd.DataFrame({
    "game_id":   stats["game_id"],
    "week":      stats["week"],
    "team":      stats["team"],
    "opponent":  stats.get("opponent", stats.get("opp", np.nan)),
    "pass_yds":  colmap["pass_yds"],
    "rush_yds":  colmap["rush_yds"],
    "ints":      colmap["ints"],
    "fumbles_lost": colmap["fumbles_lost"],
    "sacks":     colmap["sacks"],
    "points":    colmap["points"],
})

# Aggregate to team per game
team_games = (
    mini.groupby(["game_id", "week", "team"], dropna=False)
        .agg(pass_yds=("pass_yds","sum"),
             rush_yds=("rush_yds","sum"),
             turnovers=("ints","sum"),
             fumbles_lost=("fumbles_lost","sum"),
             sacks=("sacks","sum"),
             points=("points","max"))  # team points should be same for all players
        .reset_index()
)

print("[team_games] shape:", team_games.shape)
team_games.head()



## Rolling Features per Team

We compute simple **rolling means** over the last `window` weeks to stabilize noisy weekly stats.
You can switch to **exponential moving averages** or **weighted** windows easily later.


In [None]:

def add_rolling_features(df: pd.DataFrame, window: int = 3) -> pd.DataFrame:
    # Ensure proper sorting for rolling
    df = df.sort_values(["team", "week"])
    # Rolling means per team
    grouped = df.groupby("team", group_keys=False).apply(
        lambda g: g.assign(
            pass_yds_r = g["pass_yds"].rolling(window, min_periods=1).mean(),
            rush_yds_r = g["rush_yds"].rolling(window, min_periods=1).mean(),
            to_r       = g["turnovers"].rolling(window, min_periods=1).mean(),
            sacks_r    = g["sacks"].rolling(window, min_periods=1).mean(),
            pts_r      = g["points"].rolling(window, min_periods=1).mean(),
        )
    )
    return grouped

team_feats = add_rolling_features(team_games, window=3)
print("[team_feats] shape:", team_feats.shape)
team_feats.head()



## Matchup Rows (Home vs Away)

We build **one row per game**, with **feature differences** for spread and **sums** for totals.
We infer home/away from the team-level frame. If your dataset doesn’t carry a clear home/away flag,
we’ll detect it from the **team vs opp** relationship per game.


In [None]:

# We need both teams per game with a clear home/away. If your dataset doesn't carry home/away flags,
# we'll pair teams within game_id arbitrarily and then label the first as 'home' for feature construction.
# Later you can replace with actual home/away flags from schedules if desired.

# Build all team pairs within a game (two rows → one row)
pairs = (team_feats.merge(team_feats, on=["game_id","week"], suffixes=("_home","_away"))
                   .query("team_home != team_away"))

# Keep exactly one pairing per game: sort team codes to choose one canonical order
pairs["canon"] = pairs.apply(lambda r: "|".join(sorted([r["team_home"], r["team_away"]])), axis=1)
pairs = pairs.sort_values(["game_id","canon","team_home"]).drop_duplicates(["game_id","canon"])

# Feature diffs (for spread) and sums (for totals)
pairs["yds_diff_r"] = (pairs["pass_yds_r_home"] + pairs["rush_yds_r_home"]
                       - (pairs["pass_yds_r_away"] + pairs["rush_yds_r_away"]))
pairs["to_diff_r"]  = pairs["to_r_home"] - pairs["to_r_away"]
pairs["sack_diff_r"]= pairs["sacks_r_home"] - pairs["sacks_r_away"]

# Targets from points
pairs["spread_actual"] = pairs["points_home"] - pairs["points_away"]
pairs["total_actual"]  = pairs["points_home"] + pairs["points_away"]

matchups = pairs[["game_id","week",
                  "team_home","team_away",
                  "yds_diff_r","to_diff_r","sack_diff_r",
                  "spread_actual","total_actual"]].copy()

print("[matchups] shape:", matchups.shape)
matchups.head()



## Train & Predict (Ridge)

Two independent regressions:
- **Spread** target = `points_home - points_away`
- **Total** target = `points_home + points_away`

We keep the baseline small (Ridge) to make debugging easy, then you can upgrade (e.g., GradientBoosting, XGBoost).


In [None]:

# Train = all weeks < WEEK; Test = WEEK
train = matchups[matchups["week"] < WEEK].copy()
test  = matchups[matchups["week"] == WEEK].copy()

feature_cols = ["yds_diff_r","to_diff_r","sack_diff_r"]

print(f"[split] train={train.shape} test={test.shape}")
train[["week"] + feature_cols + ["spread_actual","total_actual"]].head()


In [None]:

def fit_ridge(train_df, target, features):
    X = train_df[features].values
    y = train_df[target].values
    model = Ridge(alpha=1.0)
    model.fit(X, y)
    return model

spread_model = fit_ridge(train, "spread_actual", feature_cols)
total_model  = fit_ridge(train, "total_actual",  feature_cols)

test = test.copy()
test["spread_pred"] = spread_model.predict(test[feature_cols].values)
test["total_pred"]  = total_model.predict(test[feature_cols].values)

test[["game_id","team_home","team_away","spread_pred","total_pred"]].head()



## Quick Evaluation

Compute MAE for both targets on the test week (or a validation split). For proper backtests,
loop weeks and gather errors across folds.


In [None]:

def mae(a, b): 
    return float(np.abs(a - b).mean()) if len(a) else np.nan

mae_spread = mae(test["spread_pred"], test["spread_actual"]) if "spread_actual" in test else np.nan
mae_total  = mae(test["total_pred"],  test["total_actual"])  if "total_actual"  in test else np.nan

print(f"[eval] MAE spread (week={WEEK}): {mae_spread:.3f}")
print(f"[eval] MAE total  (week={WEEK}): {mae_total:.3f}")



## Export Predictions

Write per-game predictions for the chosen week:
`data/games/model_preds_week{WEEK}.csv`

This mirrors your props flow so you can later add a **site page** and **edges vs market**.


In [None]:

out = test[["game_id","team_home","team_away","spread_pred","total_pred"]].rename(
    columns={"spread_pred":"model_spread","total_pred":"model_total"}
).copy()

out_path = OUT_DIR / f"model_preds_week{WEEK}.csv"
out.to_csv(out_path, index=False)
print(f"[write] {out_path.as_posix()} ({out.shape[0]} rows)")
out.head()



## Next Steps

- Add richer features (EPA/play, success rate, early-downs rates, pace).
- Injury/QB adjustments (at minimum a QB availability flag or ELO-style rating shift).
- Weather/dome flags.
- Calibration & book comparisons:
  - Convert spread to **win probability** with a logistic link.
  - Compare to market moneylines and compute **edge**.
- Integrate into Makefile (`make game_preds`) once satisfied with outputs.


In [None]:

# OPTIONAL: Merge with market lines to compute edges later.
# This block is a stub; you'll likely have to map game keys with your odds format.
if ODDS_CSV.exists():
    odds = pd.read_csv(ODDS_CSV)
    print("[odds] shape:", odds.shape)
    display(odds.head())
else:
    print("[odds] not found; skip for now.")
