# Notebook 03: Feature Engineering - Version 2.0

## Objective
Engineer 81 high-quality, interpretable features from raw game logs and shot chart data.

## Features (81 total) - WITH DATA QUALITY FIXES

1. **Rolling Averages (27)**: PTS, REB, AST, MIN, FGA, FTA, TOV, FG%, FG3%, FT%
2. **Game Context (10)**: IS_HOME, REST_DAYS + quality fixes, SEASON timing
3. **Player Role (5)**: RECENT_MIN_AVG, role categories (Bench/Rotation/Starter/Star)
4. **Season Progression (5)**: Non-linear time effects (quadratic + phases)
5. **Trends (7)**: PTS/REB/AST/MIN_TREND, PTS/REB/AST_VOLATILITY
6. **Season Stats (9)**: Season averages + HOT_HAND momentum scores
7. **Shot Location (20)**: Distribution, efficiency, quality by zone

## Data Quality Improvements from v1.0:
- âœ… **REST_DAYS capped at 7** (removes injury return contamination)
- âœ… **Player role features** (model knows bench vs starter volatility)
- âœ… **Fixed HOT_HAND** (now measures true momentum, not player quality)
- âœ… **Quadratic season progression** (captures mid-season peak)
- âœ… **Quality flags** (INCLUDE_IN_TRAINING for filtering outliers)

## Output
`data/processed/features_complete.parquet` - ~85,000 games Ã— 95 columns (81 features + metadata)

**CRITICAL**: All features use `.shift(1)` to prevent data leakage!

## 1. Setup & Load Data

In [46]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import json
from pathlib import Path

tqdm.pandas()

print("âœ… Imports loaded")

âœ… Imports loaded


In [47]:
# Load game logs
df = pd.read_parquet('../data/raw/gamelogs_combined.parquet')
df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])
df = df.sort_values(['Player_ID', 'GAME_DATE']).reset_index(drop=True)

print(f"Loaded {len(df):,} games from {df['Player_ID'].nunique()} players")
print(f"Date range: {df['GAME_DATE'].min().date()} to {df['GAME_DATE'].max().date()}")

Loaded 90,274 games from 369 players
Date range: 2019-10-22 to 2024-04-14


## 2. Rolling Averages (27 features)

Core predictive features: recent performance trends for all major stats.

In [48]:
print("Creating rolling averages (leakage-safe with .shift(1))...\n")

# Basic stats: 3, 5, 10-game windows
for stat in tqdm(['PTS', 'REB', 'AST', 'MIN', 'FGA', 'FTA', 'TOV'], desc="Basic stats"):
    for window in [3, 5, 10]:
        df[f'{stat}_last_{window}'] = (
            df.groupby('Player_ID')[stat]
            .shift(1)
            .rolling(window, min_periods=1)
            .mean()
            .reset_index(level=0, drop=True)
        )

# Shooting percentages: 5, 10-game windows
for window in [5, 10]:
    for pct in ['FG_PCT', 'FG3_PCT', 'FT_PCT']:
        df[f'{pct}_last_{window}'] = (
            df.groupby('Player_ID')[pct]
            .shift(1)
            .rolling(window, min_periods=1)
            .mean()
            .reset_index(level=0, drop=True)
        )

print("âœ… 27 rolling features created")

Creating rolling averages (leakage-safe with .shift(1))...



Basic stats: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 7/7 [00:00<00:00, 193.70it/s]

âœ… 27 rolling features created





## 3. Game Context (10 features)

**Data Quality Fixes Applied:**
- REST_DAYS_CAPPED: Removes injury return contamination (cap at 7 days)
- IS_SEASON_OPENER: Flag first 3 games per player-season
- AVG_REST_LAST_5: Game load indicator
- DAYS_NORM + DAYS_NORM_SQ: Captures non-linear season progression

In [49]:
print("Creating game context features...\n")

# Home/away indicator
df['IS_HOME'] = df['MATCHUP'].apply(lambda x: 1 if 'vs.' in x else 0)

# Rest days (original + capped)
df['REST_DAYS'] = df.groupby('Player_ID')['GAME_DATE'].diff().dt.days.fillna(7)
df['REST_DAYS_CAPPED'] = df['REST_DAYS'].clip(upper=7)  # FIX: Cap at 7

# Back-to-back indicator
df['BACK_TO_BACK'] = (df['REST_DAYS'] <= 1).astype(int)

# Season opener flag (first 3 games per player-season)
df['IS_SEASON_OPENER'] = (
    df.groupby(['Player_ID', 'SEASON_ID']).cumcount() < 3
).astype(int)

# Average rest over last 5 games (game load)
df['AVG_REST_LAST_5'] = (
    df.groupby('Player_ID')['REST_DAYS']
    .shift(1)
    .rolling(5, min_periods=1)
    .mean()
    .reset_index(level=0, drop=True)
    .fillna(7)
)

# Season timing
df['SEASON_GAME_NUM'] = df.groupby(['Player_ID', 'SEASON_ID']).cumcount() + 1

# FIX: Extract year correctly from SEASON_ID (format: "22019" -> "2019")
df['SEASON_YEAR'] = df['SEASON_ID'].astype(str).str[1:5]
df['DAYS_INTO_SEASON'] = (
    df['GAME_DATE'] - pd.to_datetime(df['SEASON_YEAR'] + '-10-01')
).dt.days

# Normalized season progression (for quadratic term)
days_min = df['DAYS_INTO_SEASON'].min()
days_max = df['DAYS_INTO_SEASON'].max()
df['DAYS_NORM'] = (df['DAYS_INTO_SEASON'] - days_min) / (days_max - days_min)
df['DAYS_NORM_SQ'] = df['DAYS_NORM'] ** 2  # Captures inverted U-shape

print("âœ… 10 game context features created")
print(f"   REST_DAYS range: {df['REST_DAYS'].min():.0f} - {df['REST_DAYS'].max():.0f} days")
print(f"   REST_DAYS_CAPPED: {df['REST_DAYS_CAPPED'].min():.0f} - {df['REST_DAYS_CAPPED'].max():.0f} days")
print(f"   Season openers: {df['IS_SEASON_OPENER'].sum():,} games ({df['IS_SEASON_OPENER'].mean()*100:.1f}%)")
print(f"   DAYS_INTO_SEASON: {df['DAYS_INTO_SEASON'].min()} to {df['DAYS_INTO_SEASON'].max()} days")

Creating game context features...

âœ… 10 game context features created
   REST_DAYS range: 1 - 717 days
   REST_DAYS_CAPPED: 1 - 7 days
   Season openers: 4,644 games (5.1%)
   DAYS_INTO_SEASON: 17 to 318 days


## 4. Player Role (5 features)

**Why:** Bench players have 2.8x higher volatility (CV=1.25) than starters (CV=0.45).
Model needs to know player role for appropriate uncertainty.

In [50]:
print("Creating player role features...\n")

# Recent minutes average (leakage-safe)
df['RECENT_MIN_AVG'] = (
    df.groupby('Player_ID')['MIN']
    .shift(1)  # CRITICAL: Don't use current game
    .rolling(5, min_periods=1)
    .mean()
    .reset_index(level=0, drop=True)
    .fillna(0)
)

# Categorical role assignment
df['MIN_ROLE'] = pd.cut(
    df['RECENT_MIN_AVG'], 
    bins=[0, 10, 20, 30, 48], 
    labels=['Bench', 'Rotation', 'Starter', 'Star'],
    include_lowest=True
)

# One-hot encode
role_dummies = pd.get_dummies(df['MIN_ROLE'], prefix='ROLE', dtype=int)
df = pd.concat([df, role_dummies], axis=1)

print("âœ… 5 player role features created")
print(f"   RECENT_MIN_AVG range: {df['RECENT_MIN_AVG'].min():.1f} - {df['RECENT_MIN_AVG'].max():.1f} min\n")
print("   Role distribution:")
for role in ['Bench', 'Rotation', 'Starter', 'Star']:
    count = df[f'ROLE_{role}'].sum()
    pct = count / len(df) * 100
    print(f"      {role:10s}: {count:6,} games ({pct:5.1f}%)")

Creating player role features...

âœ… 5 player role features created
   RECENT_MIN_AVG range: 0.0 - 44.2 min

   Role distribution:
      Bench     :  3,197 games (  3.5%)
      Rotation  : 17,778 games ( 19.7%)
      Starter   : 36,721 games ( 40.7%)
      Star      : 32,578 games ( 36.1%)


## 5. Season Progression (5 features)

**Why:** Performance follows inverted U-shape:
- October: 11.83 PTS (ramp-up)
- February: 12.82 PTS (peak)
- April: 12.70 PTS (fatigue/load management)

Linear features can't capture this - need quadratic + categorical phases.

In [51]:
print("Creating season phase categories...\n")

# Categorical season phases
# FIXED BINS: Based on actual DAYS_INTO_SEASON range (17-318 days)
# Early: Oct/Nov/Dec (0-100 days)
# Peak: Jan/Feb/Mar (100-180 days)
# Late: Apr onwards (180-320 days, includes playoffs)
df['SEASON_PHASE'] = pd.cut(
    df['DAYS_INTO_SEASON'], 
    bins=[0, 100, 180, 320],
    labels=['Early', 'Peak', 'Late'],
    include_lowest=True
)

# One-hot encode phases
phase_dummies = pd.get_dummies(df['SEASON_PHASE'], prefix='PHASE', dtype=int)
df = pd.concat([df, phase_dummies], axis=1)

print("âœ… 5 season progression features created")
print(f"   DAYS_INTO_SEASON: {df['DAYS_INTO_SEASON'].min()} - {df['DAYS_INTO_SEASON'].max()} days")
print(f"   DAYS_NORM range: {df['DAYS_NORM'].min():.3f} - {df['DAYS_NORM'].max():.3f}")
print(f"   DAYS_NORM_SQ range: {df['DAYS_NORM_SQ'].min():.3f} - {df['DAYS_NORM_SQ'].max():.3f}\n")
print("   Phase distribution:")
for phase in ['Early', 'Peak', 'Late']:
    count = df[f'PHASE_{phase}'].sum()
    pct = count / len(df) * 100
    print(f"      {phase:5s}: {count:6,} games ({pct:5.1f}%)")

Creating season phase categories...

âœ… 5 season progression features created
   DAYS_INTO_SEASON: 17 - 318 days
   DAYS_NORM range: 0.000 - 1.000
   DAYS_NORM_SQ range: 0.000 - 1.000

   Phase distribution:
      Early: 38,919 games ( 43.1%)
      Peak : 40,230 games ( 44.6%)
      Late : 11,125 games ( 12.3%)


## 6. Trends (7 features)

Performance momentum (slope) and volatility (std) over recent games.

In [52]:
print("Creating trend features...\n")

def calculate_trend(series):
    """Calculate linear trend (slope) of series"""
    if len(series) < 2:
        return 0
    return np.polyfit(np.arange(len(series)), series, 1)[0]

# Trend (slope) for key stats
for stat in ['PTS', 'REB', 'AST', 'MIN']:
    df[f'{stat}_TREND'] = (
        df.groupby('Player_ID')[stat]
        .shift(1)
        .rolling(5, min_periods=2)
        .apply(calculate_trend, raw=True)
        .reset_index(level=0, drop=True)
        .fillna(0)
    )

# Volatility (std) for target variables
for stat in ['PTS', 'REB', 'AST']:
    df[f'{stat}_VOLATILITY'] = (
        df.groupby('Player_ID')[stat]
        .shift(1)
        .rolling(5, min_periods=2)
        .std()
        .reset_index(level=0, drop=True)
        .fillna(0)
    )

print("âœ… 7 trend features created")

Creating trend features...

âœ… 7 trend features created


## 7. Season Stats (9 features)

**FIXED HOT_HAND:** Old version selected elite players (Giannis, Jokic), not momentum.

**New approach:** Continuous momentum score = (last_3_avg - last_10_avg) / last_10_avg

In [53]:
print("Creating season stats (with FIXED hot hand)...\n")

# Season averages (expanding mean)
for stat in ['PTS', 'REB', 'AST']:
    df[f'{stat}_SEASON_AVG'] = (
        df.groupby(['Player_ID', 'SEASON_ID'])[stat]
        .apply(lambda x: x.shift(1).expanding().mean())
        .reset_index(level=[0, 1], drop=True)
        .fillna(0)
    )

# HOT_HAND momentum scores
for stat in ['PTS', 'REB', 'AST']:
    # Short-term performance (last 3 games)
    short_term = (
        df.groupby('Player_ID')[stat]
        .shift(1)
        .rolling(3, min_periods=2)
        .mean()
        .reset_index(level=0, drop=True)
    )
    
    # Long-term baseline (last 10 games)
    long_term = (
        df.groupby('Player_ID')[stat]
        .shift(1)
        .rolling(10, min_periods=5)
        .mean()
        .reset_index(level=0, drop=True)
    )
    
    # Continuous momentum score
    df[f'{stat}_HOT_HAND_SCORE'] = (
        (short_term - long_term) / (long_term + 1e-6)
    ).fillna(0)
    
    # Binary version (hot if >10% above baseline)
    df[f'{stat}_HOT_HAND_BINARY'] = (
        df[f'{stat}_HOT_HAND_SCORE'] > 0.1
    ).astype(int)

print("âœ… 9 season stats features created\n")
print("   Hot hand momentum scores:")
for stat in ['PTS', 'REB', 'AST']:
    score_col = f'{stat}_HOT_HAND_SCORE'
    binary_col = f'{stat}_HOT_HAND_BINARY'
    mean_score = df[score_col].mean()
    hot_pct = df[binary_col].mean() * 100
    print(f"      {stat}: avg_score={mean_score:+.3f}, hot_games={hot_pct:.1f}%")

Creating season stats (with FIXED hot hand)...

âœ… 9 season stats features created

   Hot hand momentum scores:
      PTS: avg_score=+0.003, hot_games=34.0%
      REB: avg_score=+0.002, hot_games=35.6%
      AST: avg_score=+0.004, hot_games=39.4%


## 8. Shot Location Features (20 features)

**KEY FOR PTS IMPROVEMENT:** Shot distribution and efficiency by zone.

In [54]:
print("Loading shot chart data...\n")

df_shots = pd.read_parquet('../data/raw/shot_charts_all.parquet')

# Map zones to simplified categories
def map_zone(z):
    if z == 'Restricted Area': return 'RESTRICTED_AREA'
    if z == 'In The Paint (Non-RA)': return 'PAINT'
    if z == 'Mid-Range': return 'MIDRANGE'
    if z in ['Above the Break 3', 'Left Corner 3', 'Right Corner 3']: return 'THREE_PT'
    return 'OTHER'

df_shots['ZONE'] = df_shots['SHOT_ZONE_BASIC'].apply(map_zone)
df_shots['GAME_DATE'] = pd.to_datetime(df_shots['GAME_DATE'], format='%Y%m%d')

print(f"âœ… Loaded {len(df_shots):,} shots")

Loading shot chart data...

âœ… Loaded 885,698 shots


In [55]:
print("Aggregating shots by player-game-zone...\n")

# Aggregate by player-game-zone
shot_agg = df_shots.groupby(['Player_ID', 'GAME_ID', 'GAME_DATE', 'ZONE']).agg({
    'SHOT_MADE_FLAG': 'sum',
    'SHOT_ATTEMPTED_FLAG': 'sum'
}).reset_index()
shot_agg.columns = ['Player_ID', 'GAME_ID', 'GAME_DATE', 'ZONE', 'FGM', 'FGA']

# Pivot to wide format
fga_pivot = shot_agg.pivot_table(
    index=['Player_ID', 'GAME_ID', 'GAME_DATE'], 
    columns='ZONE', 
    values='FGA', 
    fill_value=0
).reset_index()

fgm_pivot = shot_agg.pivot_table(
    index=['Player_ID', 'GAME_ID', 'GAME_DATE'], 
    columns='ZONE', 
    values='FGM', 
    fill_value=0
).reset_index()

# Rename columns
fga_pivot.columns = ['Player_ID', 'GAME_ID', 'GAME_DATE'] + [f'{c}_FGA' for c in fga_pivot.columns[3:]]
fgm_pivot.columns = ['Player_ID', 'GAME_ID', 'GAME_DATE'] + [f'{c}_FGM' for c in fgm_pivot.columns[3:]]

# Merge
df_shot = fga_pivot.merge(
    fgm_pivot[['Player_ID', 'GAME_ID'] + [c for c in fgm_pivot.columns if 'FGM' in c]], 
    on=['Player_ID', 'GAME_ID']
)

print(f"âœ… Shot aggregation: {len(df_shot):,} player-games")

Aggregating shots by player-game-zone...

âœ… Shot aggregation: 88,600 player-games


In [56]:
print("Calculating shot percentages and quality...\n")

# Total attempts
df_shot['TOTAL_FGA'] = (
    df_shot['RESTRICTED_AREA_FGA'] + 
    df_shot['PAINT_FGA'] + 
    df_shot['MIDRANGE_FGA'] + 
    df_shot['THREE_PT_FGA']
)

# Distribution (% of shots from each zone)
for zone in ['RESTRICTED_AREA', 'PAINT', 'MIDRANGE', 'THREE_PT']:
    df_shot[f'{zone}_FGA_PCT'] = (
        df_shot[f'{zone}_FGA'] / df_shot['TOTAL_FGA'].replace(0, 1) * 100
    )

# Efficiency (FG% by zone)
for zone in ['RESTRICTED_AREA', 'PAINT', 'MIDRANGE', 'THREE_PT']:
    df_shot[f'{zone}_FG_PCT'] = (
        df_shot[f'{zone}_FGM'] / df_shot[f'{zone}_FGA'].replace(0, 1) * 100
    )

# Average shot distance
shot_dist = df_shots.groupby(['Player_ID', 'GAME_ID', 'GAME_DATE'])['SHOT_DISTANCE'].mean().reset_index()
shot_dist.columns = ['Player_ID', 'GAME_ID', 'GAME_DATE', 'AVG_SHOT_DISTANCE']
df_shot = df_shot.merge(shot_dist, on=['Player_ID', 'GAME_ID', 'GAME_DATE'], how='left')

# Shot quality score (weighted by expected points)
df_shot['SHOT_QUALITY_SCORE'] = (
    df_shot['RESTRICTED_AREA_FGA_PCT'] * 1.3 +  # High-value shots
    df_shot['PAINT_FGA_PCT'] * 1.0 +
    df_shot['MIDRANGE_FGA_PCT'] * 0.8 +         # Low-value shots
    df_shot['THREE_PT_FGA_PCT'] * 1.1            # High-value if made
)

print("âœ… Shot metrics calculated")

Calculating shot percentages and quality...

âœ… Shot metrics calculated


In [57]:
print("Creating rolling averages for shot features...\n")

df_shot = df_shot.sort_values(['Player_ID', 'GAME_DATE'])

shot_features = [
    'RESTRICTED_AREA_FGA_PCT', 'PAINT_FGA_PCT', 'MIDRANGE_FGA_PCT', 'THREE_PT_FGA_PCT',
    'RESTRICTED_AREA_FG_PCT', 'PAINT_FG_PCT', 'MIDRANGE_FG_PCT', 'THREE_PT_FG_PCT',
    'AVG_SHOT_DISTANCE', 'SHOT_QUALITY_SCORE'
]

for feat in tqdm(shot_features, desc="Shot rolling"):
    for window in [5, 10]:
        df_shot[f'{feat}_LAST_{window}'] = (
            df_shot.groupby('Player_ID')[feat]
            .shift(1)
            .rolling(window, min_periods=1)
            .mean()
            .reset_index(level=0, drop=True)
        )

# Keep only rolling features for merge
shot_cols = ['Player_ID', 'GAME_ID', 'GAME_DATE'] + [
    c for c in df_shot.columns if 'LAST_' in c
]
df_shot_rolling = df_shot[shot_cols]

print(f"âœ… 20 shot rolling features created")

Creating rolling averages for shot features...



Shot rolling: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 10/10 [00:00<00:00, 240.98it/s]

âœ… 20 shot rolling features created





## 9. Merge All Features

In [58]:
print("Merging game logs with shot features...\n")

# Standardize column names for merge
if 'GAME_ID' not in df.columns and 'Game_ID' in df.columns:
    df['GAME_ID'] = df['Game_ID']

# Merge
df_complete = df.merge(
    df_shot_rolling, 
    on=['Player_ID', 'GAME_ID', 'GAME_DATE'], 
    how='left'
)

print(f"âœ… Merged dataset: {len(df_complete):,} games, {df_complete.shape[1]} columns")
print(f"   Missing shot data: {df_complete[[c for c in df_complete.columns if 'LAST_' in c]].isnull().mean().mean()*100:.1f}%")

Merging game logs with shot features...

âœ… Merged dataset: 90,274 games, 113 columns
   Missing shot data: 1.8%


## 10. Data Quality Flags

Flag games that may need filtering during training due to extreme conditions.

In [59]:
print("Adding data quality flags...\n")

# Flag 1: Extreme rest (>10 days) - likely injury returns
df_complete['EXTREME_REST'] = (df_complete['REST_DAYS'] > 10).astype(int)

# Flag 2: Very low minutes (<5 min) - DNP-CD, garbage time
df_complete['LOW_MINUTES'] = (df_complete['MIN'] < 5).astype(int)

# Combined training quality flag
df_complete['INCLUDE_IN_TRAINING'] = (
    (df_complete['EXTREME_REST'] == 0) & 
    (df_complete['LOW_MINUTES'] == 0)
).astype(int)

print(f"âœ… Data quality flags created\n")
print(f"   Quality issues:")
print(f"      Extreme rest (>10 days):  {df_complete['EXTREME_REST'].sum():,} games ({df_complete['EXTREME_REST'].mean()*100:.1f}%)")
print(f"      Low minutes (<5 min):     {df_complete['LOW_MINUTES'].sum():,} games ({df_complete['LOW_MINUTES'].mean()*100:.1f}%)")
print(f"      Exclude from training:    {(~df_complete['INCLUDE_IN_TRAINING'].astype(bool)).sum():,} games ({(~df_complete['INCLUDE_IN_TRAINING'].astype(bool)).mean()*100:.1f}%)")
print(f"      Clean training data:      {df_complete['INCLUDE_IN_TRAINING'].sum():,} games ({df_complete['INCLUDE_IN_TRAINING'].mean()*100:.1f}%)")

Adding data quality flags...

âœ… Data quality flags created

   Quality issues:
      Extreme rest (>10 days):  2,742 games (3.0%)
      Low minutes (<5 min):     2,033 games (2.3%)
      Exclude from training:    4,571 games (5.1%)
      Clean training data:      85,703 games (94.9%)


## 11. Finalize Dataset

In [60]:
print("Defining final feature set (81 features)...\n")

feature_columns = [
    # ========== ROLLING AVERAGES (27 features) ==========
    'PTS_last_3', 'PTS_last_5', 'PTS_last_10',
    'REB_last_3', 'REB_last_5', 'REB_last_10',
    'AST_last_3', 'AST_last_5', 'AST_last_10',
    'MIN_last_3', 'MIN_last_5', 'MIN_last_10',
    'FGA_last_3', 'FGA_last_5', 'FGA_last_10',
    'FTA_last_3', 'FTA_last_5', 'FTA_last_10',
    'TOV_last_3', 'TOV_last_5', 'TOV_last_10',
    'FG_PCT_last_5', 'FG_PCT_last_10',
    'FG3_PCT_last_5', 'FG3_PCT_last_10',
    'FT_PCT_last_5', 'FT_PCT_last_10',
    
    # ========== GAME CONTEXT (10 features) ==========
    'IS_HOME', 
    'REST_DAYS', 'REST_DAYS_CAPPED',
    'BACK_TO_BACK', 'IS_SEASON_OPENER', 'AVG_REST_LAST_5',
    'SEASON_GAME_NUM', 'DAYS_INTO_SEASON',
    'DAYS_NORM', 'DAYS_NORM_SQ',
    
    # ========== PLAYER ROLE (5 features) ==========
    'RECENT_MIN_AVG',
    'ROLE_Bench', 'ROLE_Rotation', 'ROLE_Starter', 'ROLE_Star',
    
    # ========== SEASON PHASE (3 features) ==========
    'PHASE_Early', 'PHASE_Peak', 'PHASE_Late',
    
    # ========== TRENDS (7 features) ==========
    'PTS_TREND', 'REB_TREND', 'AST_TREND', 'MIN_TREND',
    'PTS_VOLATILITY', 'REB_VOLATILITY', 'AST_VOLATILITY',
    
    # ========== SEASON STATS (9 features) ==========
    'PTS_SEASON_AVG', 'REB_SEASON_AVG', 'AST_SEASON_AVG',
    'PTS_HOT_HAND_SCORE', 'REB_HOT_HAND_SCORE', 'AST_HOT_HAND_SCORE',
    'PTS_HOT_HAND_BINARY', 'REB_HOT_HAND_BINARY', 'AST_HOT_HAND_BINARY',
    
    # ========== SHOT LOCATION (20 features) ==========
    'RESTRICTED_AREA_FGA_PCT_LAST_5', 'RESTRICTED_AREA_FGA_PCT_LAST_10',
    'PAINT_FGA_PCT_LAST_5', 'PAINT_FGA_PCT_LAST_10',
    'MIDRANGE_FGA_PCT_LAST_5', 'MIDRANGE_FGA_PCT_LAST_10',
    'THREE_PT_FGA_PCT_LAST_5', 'THREE_PT_FGA_PCT_LAST_10',
    'RESTRICTED_AREA_FG_PCT_LAST_5', 'RESTRICTED_AREA_FG_PCT_LAST_10',
    'PAINT_FG_PCT_LAST_5', 'PAINT_FG_PCT_LAST_10',
    'MIDRANGE_FG_PCT_LAST_5', 'MIDRANGE_FG_PCT_LAST_10',
    'THREE_PT_FG_PCT_LAST_5', 'THREE_PT_FG_PCT_LAST_10',
    'AVG_SHOT_DISTANCE_LAST_5', 'AVG_SHOT_DISTANCE_LAST_10',
    'SHOT_QUALITY_SCORE_LAST_5', 'SHOT_QUALITY_SCORE_LAST_10'
]

# Metadata columns
tracking = ['Player_ID', 'PLAYER_NAME', 'GAME_ID', 'GAME_DATE', 'SEASON_ID']

# Quality flags - categorical metadata for analysis (not features)
# NOTE: IS_SEASON_OPENER removed - it's already a feature!
quality_flags = ['INCLUDE_IN_TRAINING', 'EXTREME_REST', 'LOW_MINUTES', 
                 'MIN_ROLE', 'SEASON_PHASE']

targets = ['PTS', 'REB', 'AST']

# Create final dataset
df_final = df_complete[tracking + feature_columns + quality_flags + targets].copy()

# Fill missing shot data with 0 (players with no shot chart data)
shot_cols = [c for c in feature_columns if any(x in c for x in 
            ['RESTRICTED', 'PAINT', 'MIDRANGE', 'THREE_PT', 'SHOT_QUALITY', 'AVG_SHOT_DISTANCE'])]
df_final[shot_cols] = df_final[shot_cols].fillna(0)

# Drop first 3 games per player-season (insufficient rolling history)
df_final = df_final[df_final['SEASON_GAME_NUM'] >= 4].copy()

print(f"âœ… Final dataset prepared\n")
print(f"   Total games: {len(df_final):,}")
print(f"   Total features: {len(feature_columns)}\n")
print(f"   Feature breakdown:")
print(f"      Rolling averages:  27")
print(f"      Game context:      10")
print(f"      Player role:        5")
print(f"      Season phase:       3")
print(f"      Trends:             7")
print(f"      Season stats:       9")
print(f"      Shot location:     20")
print(f"      ----------------")
print(f"      TOTAL:             81 features\n")
print(f"   Data quality:")
print(f"      Clean for training: {df_final['INCLUDE_IN_TRAINING'].sum():,} ({df_final['INCLUDE_IN_TRAINING'].mean()*100:.1f}%)")
print(f"      Requires filtering: {(~df_final['INCLUDE_IN_TRAINING'].astype(bool)).sum():,} ({(~df_final['INCLUDE_IN_TRAINING'].astype(bool)).mean()*100:.1f}%)")

Defining final feature set (81 features)...

âœ… Final dataset prepared

   Total games: 85,630
   Total features: 81

   Feature breakdown:
      Rolling averages:  27
      Game context:      10
      Player role:        5
      Season phase:       3
      Trends:             7
      Season stats:       9
      Shot location:     20
      ----------------
      TOTAL:             81 features

   Data quality:
      Clean for training: 82,477 (96.3%)
      Requires filtering: 3,153 (3.7%)


## 12. Save

In [61]:
print("Saving final dataset and metadata...\n")

proc_path = Path('../data/processed')
proc_path.mkdir(parents=True, exist_ok=True)

# Save feature dataset
df_final.to_parquet(proc_path / 'features_complete.parquet', index=False)

# Save comprehensive metadata
metadata = {
    'version': '2.0',
    'date_created': pd.Timestamp.now().isoformat(),
    'total_features': len(feature_columns),
    'total_games': len(df_final),
    'total_players': df_final['Player_ID'].nunique(),
    'date_range': {
        'start': df_final['GAME_DATE'].min().isoformat(),
        'end': df_final['GAME_DATE'].max().isoformat()
    },
    
    'feature_names': feature_columns,
    'feature_breakdown': {
        'rolling_averages': 27,
        'game_context': 10,
        'player_role': 5,
        'season_phase': 3,
        'trends': 7,
        'season_stats': 9,
        'shot_location': 20
    },
    
    'data_quality_improvements': {
        'rest_days_capped': 'Capped at 7 days to remove injury return contamination',
        'season_opener_flag': 'Flag first 3 games per player-season (cold-start handling)',
        'player_role_features': 'Added RECENT_MIN_AVG and role categories (Bench/Rotation/Starter/Star)',
        'season_progression': 'Added quadratic term and phase categories for non-linear time effects',
        'hot_hand_fixed': 'Changed from binary elite-player selector to continuous momentum score',
        'quality_flags': 'Added INCLUDE_IN_TRAINING, EXTREME_REST, LOW_MINUTES flags'
    },
    
    'tracking_columns': tracking,
    'target_columns': targets,
    'quality_flags': quality_flags,
    
    'data_quality': {
        'clean_training_games': int(df_final['INCLUDE_IN_TRAINING'].sum()),
        'extreme_rest_games': int(df_final['EXTREME_REST'].sum()),
        'low_minute_games': int(df_final['LOW_MINUTES'].sum()),
        'season_opener_games': int(df_final['IS_SEASON_OPENER'].sum())
    },
    
    'leakage_prevention': 'All features use .shift(1) before rolling/expanding operations',
    
    'usage_notes': {
        'training_filter': 'Use INCLUDE_IN_TRAINING flag to filter outlier games during training',
        'role_analysis': 'Use MIN_ROLE for stratified evaluation (bench vs starters have different volatility)',
        'feature_selection': 'All 81 features are interpretable and validated by EDA'
    }
}

with open(proc_path / 'feature_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"{'='*70}")
print(f"âœ… FEATURE ENGINEERING COMPLETE!")
print(f"{'='*70}\n")
print(f"Files saved:")
print(f"  ðŸ“Š data/processed/features_complete.parquet")
print(f"  ðŸ“‹ data/processed/feature_metadata.json\n")
print(f"Dataset summary:")
print(f"  Total games:    {len(df_final):,}")
print(f"  Total players:  {df_final['Player_ID'].nunique()}")
print(f"  Total features: {len(feature_columns)}")
print(f"  Date range:     {df_final['GAME_DATE'].min().date()} to {df_final['GAME_DATE'].max().date()}\n")
print(f"Quality improvements from v1.0:")
print(f"  âœ… REST_DAYS capped at 7 days")
print(f"  âœ… Player role features added (5)")
print(f"  âœ… Season progression non-linearity (quadratic + phases)")
print(f"  âœ… HOT_HAND fixed (binary â†’ continuous momentum)")
print(f"  âœ… Data quality flags for training/evaluation\n")
print(f"Next step: Notebooks 04-05 for model training!")

Saving final dataset and metadata...

âœ… FEATURE ENGINEERING COMPLETE!

Files saved:
  ðŸ“Š data/processed/features_complete.parquet
  ðŸ“‹ data/processed/feature_metadata.json

Dataset summary:
  Total games:    85,630
  Total players:  369
  Total features: 81
  Date range:     2019-10-28 to 2024-04-14

Quality improvements from v1.0:
  âœ… REST_DAYS capped at 7 days
  âœ… Player role features added (5)
  âœ… Season progression non-linearity (quadratic + phases)
  âœ… HOT_HAND fixed (binary â†’ continuous momentum)
  âœ… Data quality flags for training/evaluation

Next step: Notebooks 04-05 for model training!
