# Notebook 03: Feature Engineering (Evidence-Based)

## Objective
Engineer **38 effective features** based on EDA findings from Notebook 02.

## Key Insights from Exploration (Notebook 02):

### What Matters (High Correlation):
- **Player rolling averages** (PTS/REB/AST last 3/5/10 games) - captures recent form
- **Season averages** - baseline performance level
- **Opponent defensive rating** - +9% effect (Elite D: 12.66 PTS vs Weak D: 13.82 PTS)
- **IS_HOME** - consistent +1.7% boost
- **Player shot tendencies** - archetype matters (rim runner vs shooter)

### What Doesn't Matter (Low Correlation):
- **TEAM_PACE** - r=0.014, only +2.4% effect (keep but don't expect much)
- **IS_B2B** - paradox: +1.7% scoring (no fatigue signal) ‚Üí **EXCLUDE**
- **Hot hand binary** - selection bias artifact ‚Üí Use continuous momentum instead

### The Fundamental Challenge:
- FGA (r=0.874) and MIN (r=0.683) are strongest predictors but **unavailable at prediction time**
- This creates a performance ceiling - we must predict without knowing usage

## Features (38 total):

1. **Rolling Averages (9)**: PTS/REB/AST last 3, 5, 10 games
2. **Season Context (6)**: Season avg PTS/REB/AST, SEASON_GAME_NUM, month, day_of_week
3. **Opponent Context (4)**: OPP_DEF_RATING, OPP_PACE, OPP_W_PCT, OPP_OFF_RATING
4. **Team Context (4)**: TEAM_DEF_RATING, TEAM_PACE, TEAM_W_PCT, TEAM_OFF_RATING
5. **Game Context (5)**: IS_HOME, DAYS_REST, REST_0_1, REST_2_3, REST_4_PLUS
6. **Shot Tendencies (4)**: Season % from: Restricted Area, Paint, Mid-Range, Three-Point
7. **Momentum (6)**: PTS/REB/AST trends (last 5 games slope) + volatility (std)

## Data Quality:
- **Leakage prevention**: All features use `.shift(1)` before rolling operations
- **Shot tendencies**: Season averages only (not per-game, as data doesn't exist pre-game)
- **Rest days**: Binned for non-linear effects (4+ days = injury signal)

## Train/Val/Test Split:
- **Temporal split** (not random!)
- Train: < 2023-01-01 (~70%)
- Val: 2023-01-01 to 2024-01-01 (~15%)
- Test: >= 2024-01-01 (~15%)

## Output:
- `data/processed/features_engineered.parquet` - Full dataset with 38 features
- `data/processed/train.parquet` - Training set
- `data/processed/val.parquet` - Validation set
- `data/processed/test.parquet` - Test set
- `data/processed/feature_metadata_v2.json` - Feature documentation

## 1. Setup & Load PROCESSED Data

In [5]:
import pandas as pd
import numpy as np
from pathlib import Path
import json
from tqdm import tqdm

print("‚úÖ Imports loaded")

‚úÖ Imports loaded


In [6]:
# Load PROCESSED game logs (cleaned, deduplicated, with opponent/team context)
df = pd.read_parquet('../data/processed/gamelogs_combined.parquet')
df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])
df = df.sort_values(['PLAYER_ID', 'GAME_DATE']).reset_index(drop=True)

print(f"Loaded {len(df):,} games from {df['PLAYER_ID'].nunique()} players")
print(f"Date range: {df['GAME_DATE'].min().date()} to {df['GAME_DATE'].max().date()}")
print(f"\nColumns: {df.shape[1]}")
print(f"Sample opponent/team features: {[c for c in df.columns if 'OPP_' in c or 'TEAM_' in c][:8]}")

Loaded 72,509 games from 289 players
Date range: 2019-10-22 to 2024-04-14

Columns: 45
Sample opponent/team features: ['OPP_TEAM_ABBREV', 'OPP_TEAM_NAME', 'OPP_DEF_RATING', 'OPP_OFF_RATING', 'OPP_PACE', 'OPP_W', 'OPP_L', 'OPP_W_PCT']


## 2. Rolling Averages (9 features)

Recent performance for targets only (not FGA/FTA/TOV - those won't exist at prediction time).

In [7]:
print("Creating rolling averages (leakage-safe with .shift(1))...\n")

# Only for target variables (PTS/REB/AST)
for stat in ['PTS', 'REB', 'AST']:
    for window in [3, 5, 10]:
        df[f'{stat}_LAST_{window}'] = (
            df.groupby('PLAYER_ID')[stat]
            .shift(1)  # CRITICAL: Don't include current game
            .rolling(window, min_periods=1)
            .mean()
            .reset_index(level=0, drop=True)
        )

print("‚úÖ 9 rolling features created (PTS/REB/AST √ó 3/5/10 games)")
print(f"   Sample: PTS_LAST_3 range = {df['PTS_LAST_3'].min():.1f} to {df['PTS_LAST_3'].max():.1f}")

Creating rolling averages (leakage-safe with .shift(1))...

‚úÖ 9 rolling features created (PTS/REB/AST √ó 3/5/10 games)
   Sample: PTS_LAST_3 range = 0.0 to 52.7


## 3. Season Context (6 features)

Season averages, game count, timing features.

In [8]:
print("Creating season context features...\n")

# Season averages (expanding mean, leakage-safe)
for stat in ['PTS', 'REB', 'AST']:
    df[f'{stat}_SEASON_AVG'] = (
        df.groupby(['PLAYER_ID', 'SEASON'])[stat]
        .apply(lambda x: x.shift(1).expanding().mean())
        .reset_index(level=[0, 1], drop=True)
        .fillna(0)
    )

# Games played this season (so far)
df['SEASON_GAME_NUM'] = df.groupby(['PLAYER_ID', 'SEASON']).cumcount() + 1

# Month (October = 10, April = 4)
df['MONTH'] = df['GAME_DATE'].dt.month

# Day of week (0=Monday, 6=Sunday)
df['DAY_OF_WEEK'] = df['GAME_DATE'].dt.dayofweek

print("‚úÖ 6 season context features created")
print(f"   SEASON_GAME_NUM range: {df['SEASON_GAME_NUM'].min()} to {df['SEASON_GAME_NUM'].max()}")
print(f"   MONTH range: {df['MONTH'].min()} to {df['MONTH'].max()}")

Creating season context features...

‚úÖ 6 season context features created
   SEASON_GAME_NUM range: 1 to 84
   MONTH range: 1 to 12


## 4. Opponent Context (4 features)

From EDA: OPP_DEF_RATING has +9% effect (Elite D: 12.66 vs Weak D: 13.82 PTS).

In [9]:
print("Opponent context features...\n")

# These columns already exist in processed data from 01_data_collection
opponent_features = ['OPP_DEF_RATING', 'OPP_PACE', 'OPP_W_PCT', 'OPP_OFF_RATING']

# Verify they exist
missing = [c for c in opponent_features if c not in df.columns]
if missing:
    print(f"‚ö†Ô∏è  Missing opponent features: {missing}")
else:
    print(f"‚úÖ 4 opponent context features confirmed")
    print(f"   OPP_DEF_RATING range: {df['OPP_DEF_RATING'].min():.1f} to {df['OPP_DEF_RATING'].max():.1f}")
    print(f"   OPP_PACE range: {df['OPP_PACE'].min():.1f} to {df['OPP_PACE'].max():.1f}")

Opponent context features...

‚úÖ 4 opponent context features confirmed
   OPP_DEF_RATING range: 102.5 to 119.6
   OPP_PACE range: 95.6 to 105.5


## 5. Team Context (4 features)

From EDA: TEAM_PACE has minimal effect (r=0.014, +2.4%), but keep for completeness.

In [10]:
print("Team context features...\n")

# These columns already exist in processed data from 01_data_collection
team_features = ['TEAM_DEF_RATING', 'TEAM_PACE', 'TEAM_W_PCT', 'TEAM_OFF_RATING']

# Verify they exist
missing = [c for c in team_features if c not in df.columns]
if missing:
    print(f"‚ö†Ô∏è  Missing team features: {missing}")
else:
    print(f"‚úÖ 4 team context features confirmed")
    print(f"   TEAM_PACE range: {df['TEAM_PACE'].min():.1f} to {df['TEAM_PACE'].max():.1f}")

Team context features...

‚úÖ 4 team context features confirmed
   TEAM_PACE range: 95.6 to 105.5


## 6. Game Context (5 features)

From EDA: 
- IS_HOME has consistent +1.7% boost
- DAYS_REST has non-linear effect (4+ days = injury signal, lower scoring)
- **Excluding IS_B2B** (paradoxically +1.7% scoring, no fatigue effect)

In [11]:
print("Creating game context features...\n")

# Home/away indicator (already exists in processed data as IS_HOME)
if 'IS_HOME' not in df.columns:
    df['IS_HOME'] = df['MATCHUP'].apply(lambda x: 1 if 'vs.' in x else 0)

# DAYS_REST already exists in processed data
# Create binned versions for non-linear relationship
df['REST_0_1'] = (df['DAYS_REST'] <= 1).astype(int)  # 0-1 days
df['REST_2_3'] = ((df['DAYS_REST'] >= 2) & (df['DAYS_REST'] <= 3)).astype(int)  # 2-3 days
df['REST_4_PLUS'] = (df['DAYS_REST'] >= 4).astype(int)  # 4+ days (injury signal)

print("‚úÖ 5 game context features created")
print(f"   IS_HOME: {df['IS_HOME'].mean()*100:.1f}% home games")
print(f"   REST_0_1: {df['REST_0_1'].mean()*100:.1f}%")
print(f"   REST_2_3: {df['REST_2_3'].mean()*100:.1f}%")
print(f"   REST_4_PLUS: {df['REST_4_PLUS'].mean()*100:.1f}%")
print(f"\n   NOTE: IS_B2B excluded (EDA showed +1.7% paradox, no fatigue effect)")

Creating game context features...

‚úÖ 5 game context features created
   IS_HOME: 50.1% home games
   REST_0_1: 15.8%
   REST_2_3: 72.9%
   REST_4_PLUS: 11.3%

   NOTE: IS_B2B excluded (EDA showed +1.7% paradox, no fatigue effect)


## 7. Shot Tendencies (4 features)

**CRITICAL**: Shot data doesn't exist pre-game, so we use **season averages** (not per-game).

Calculate % of shots from each zone using player's prior games in the season.

In [13]:
print("Loading shot chart data...\\n")

df_shots = pd.read_parquet('../data/processed/shot_charts_all.parquet')
df_shots['GAME_DATE'] = pd.to_datetime(df_shots['GAME_DATE'])

# Rename 'Season' to 'SEASON' for consistency\n
if 'Season' in df_shots.columns:
    df_shots.rename(columns={'Season': 'SEASON'}, inplace=True)

print(f"Loaded {len(df_shots):,} shots from {df_shots['PLAYER_ID'].nunique()} players")

Loading shot chart data...\n
Loaded 591,467 shots from 229 players


In [14]:
print("Calculating season-to-date shot tendencies...\n")

# Map zones to simplified categories
def map_zone(z):
    if z == 'Restricted Area': return 'RESTRICTED_AREA'
    if z == 'In The Paint (Non-RA)': return 'PAINT'
    if z == 'Mid-Range': return 'MIDRANGE'
    if z in ['Above the Break 3', 'Left Corner 3', 'Right Corner 3']: return 'THREE_PT'
    return 'OTHER'

df_shots['ZONE'] = df_shots['SHOT_ZONE_BASIC'].apply(map_zone)

# For each player-game, calculate zone percentages from PRIOR games in season
shot_tendencies_list = []

for (player, season), group in tqdm(df_shots.groupby(['PLAYER_ID', 'SEASON']), desc="Players"):
    group = group.sort_values('GAME_DATE')
    game_dates = group['GAME_DATE'].unique()
    
    for game_date in game_dates:
        # Get all PRIOR games in this season
        prior_shots = group[group['GAME_DATE'] < game_date]
        
        if len(prior_shots) == 0:
            # No prior games - use league averages or zeros
            shot_tendencies_list.append({
                'PLAYER_ID': player,
                'GAME_DATE': game_date,
                'SEASON': season,
                'RESTRICTED_AREA_PCT': 0.25,  # Reasonable defaults
                'PAINT_PCT': 0.15,
                'MIDRANGE_PCT': 0.20,
                'THREE_PT_PCT': 0.40
            })
        else:
            # Calculate zone percentages from prior games
            zone_counts = prior_shots['ZONE'].value_counts()
            total_shots = len(prior_shots)
            
            shot_tendencies_list.append({
                'PLAYER_ID': player,
                'GAME_DATE': game_date,
                'SEASON': season,
                'RESTRICTED_AREA_PCT': zone_counts.get('RESTRICTED_AREA', 0) / total_shots,
                'PAINT_PCT': zone_counts.get('PAINT', 0) / total_shots,
                'MIDRANGE_PCT': zone_counts.get('MIDRANGE', 0) / total_shots,
                'THREE_PT_PCT': zone_counts.get('THREE_PT', 0) / total_shots
            })

df_shot_tendencies = pd.DataFrame(shot_tendencies_list)

print(f"\n‚úÖ Shot tendencies calculated for {len(df_shot_tendencies):,} player-games")

Calculating season-to-date shot tendencies...



Players: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 974/974 [00:11<00:00, 84.53it/s] 


‚úÖ Shot tendencies calculated for 55,949 player-games





In [15]:
print("Merging shot tendencies with game logs...\n")

# Merge with main dataframe
df = df.merge(
    df_shot_tendencies,
    on=['PLAYER_ID', 'GAME_DATE', 'SEASON'],
    how='left'
)

# Fill missing values (players with no shot data) with league averages
df['RESTRICTED_AREA_PCT'] = df['RESTRICTED_AREA_PCT'].fillna(0.25)
df['PAINT_PCT'] = df['PAINT_PCT'].fillna(0.15)
df['MIDRANGE_PCT'] = df['MIDRANGE_PCT'].fillna(0.20)
df['THREE_PT_PCT'] = df['THREE_PT_PCT'].fillna(0.40)

print("‚úÖ 4 shot tendency features added")
print(f"   Missing shot data: {df[['RESTRICTED_AREA_PCT', 'PAINT_PCT']].isnull().mean().mean()*100:.1f}%")
print(f"   Sample distributions:")
print(f"      RESTRICTED_AREA_PCT: {df['RESTRICTED_AREA_PCT'].mean()*100:.1f}%")
print(f"      PAINT_PCT: {df['PAINT_PCT'].mean()*100:.1f}%")
print(f"      MIDRANGE_PCT: {df['MIDRANGE_PCT'].mean()*100:.1f}%")
print(f"      THREE_PT_PCT: {df['THREE_PT_PCT'].mean()*100:.1f}%")

Merging shot tendencies with game logs...

‚úÖ 4 shot tendency features added
   Missing shot data: 0.0%
   Sample distributions:
      RESTRICTED_AREA_PCT: 29.1%
      PAINT_PCT: 16.5%
      MIDRANGE_PCT: 13.9%
      THREE_PT_PCT: 40.4%


## 8. Momentum (6 features)

Trends (slope) and volatility (std) for recent performance.

In [16]:
print("Creating momentum features...\n")

def calculate_trend(series):
    """Calculate linear trend (slope) of series"""
    if len(series) < 2:
        return 0
    return np.polyfit(np.arange(len(series)), series, 1)[0]

# Trend (slope) for target stats over last 5 games
for stat in ['PTS', 'REB', 'AST']:
    df[f'{stat}_TREND'] = (
        df.groupby('PLAYER_ID')[stat]
        .shift(1)
        .rolling(5, min_periods=2)
        .apply(calculate_trend, raw=True)
        .reset_index(level=0, drop=True)
        .fillna(0)
    )

# Volatility (std) for target stats over last 5 games
for stat in ['PTS', 'REB', 'AST']:
    df[f'{stat}_VOLATILITY'] = (
        df.groupby('PLAYER_ID')[stat]
        .shift(1)
        .rolling(5, min_periods=2)
        .std()
        .reset_index(level=0, drop=True)
        .fillna(0)
    )

print("‚úÖ 6 momentum features created (3 trends + 3 volatility)")
print(f"   PTS_TREND range: {df['PTS_TREND'].min():.2f} to {df['PTS_TREND'].max():.2f}")
print(f"   PTS_VOLATILITY range: {df['PTS_VOLATILITY'].min():.2f} to {df['PTS_VOLATILITY'].max():.2f}")

Creating momentum features...

‚úÖ 6 momentum features created (3 trends + 3 volatility)
   PTS_TREND range: -10.90 to 11.80
   PTS_VOLATILITY range: 0.00 to 24.71


## 9. Finalize Feature Set

In [17]:
print("Defining final feature set...\n")

# Feature columns (~40 total)
feature_columns = [
    # ========== ROLLING AVERAGES (9) ==========
    'PTS_LAST_3', 'PTS_LAST_5', 'PTS_LAST_10',
    'REB_LAST_3', 'REB_LAST_5', 'REB_LAST_10',
    'AST_LAST_3', 'AST_LAST_5', 'AST_LAST_10',
    
    # ========== SEASON CONTEXT (6) ==========
    'PTS_SEASON_AVG', 'REB_SEASON_AVG', 'AST_SEASON_AVG',
    'SEASON_GAME_NUM', 'MONTH', 'DAY_OF_WEEK',
    
    # ========== OPPONENT CONTEXT (4) ==========
    'OPP_DEF_RATING', 'OPP_PACE', 'OPP_W_PCT', 'OPP_OFF_RATING',
    
    # ========== TEAM CONTEXT (4) ==========
    'TEAM_DEF_RATING', 'TEAM_PACE', 'TEAM_W_PCT', 'TEAM_OFF_RATING',
    
    # ========== GAME CONTEXT (5) ==========
    'IS_HOME', 'DAYS_REST', 'REST_0_1', 'REST_2_3', 'REST_4_PLUS',
    
    # ========== SHOT TENDENCIES (4) ==========
    'RESTRICTED_AREA_PCT', 'PAINT_PCT', 'MIDRANGE_PCT', 'THREE_PT_PCT',
    
    # ========== MOMENTUM (6) ==========
    'PTS_TREND', 'REB_TREND', 'AST_TREND',
    'PTS_VOLATILITY', 'REB_VOLATILITY', 'AST_VOLATILITY'
]

# Tracking columns
tracking = ['PLAYER_ID', 'PLAYER_NAME', 'GAME_ID', 'GAME_DATE', 'SEASON', 'MATCHUP']

# Target columns
targets = ['PTS', 'REB', 'AST']

# Filter to games with sufficient history (at least 3 games played)
df_filtered = df[df['SEASON_GAME_NUM'] >= 4].copy()

# Select final columns
df_final = df_filtered[tracking + feature_columns + targets].copy()

print(f"‚úÖ Final feature set prepared\n")
print(f"   Total games: {len(df_final):,}")
print(f"   Total features: {len(feature_columns)}\n")
print(f"   Feature breakdown:")
print(f"      Rolling averages:    9")
print(f"      Season context:      6")
print(f"      Opponent context:    4")
print(f"      Team context:        4")
print(f"      Game context:        5")
print(f"      Shot tendencies:     4")
print(f"      Momentum:            6")
print(f"      ----------------------")
print(f"      TOTAL:              {len(feature_columns)} features")

Defining final feature set...

‚úÖ Final feature set prepared

   Total games: 68,765
   Total features: 38

   Feature breakdown:
      Rolling averages:    9
      Season context:      6
      Opponent context:    4
      Team context:        4
      Game context:        5
      Shot tendencies:     4
      Momentum:            6
      ----------------------
      TOTAL:              38 features


## 10. Train/Val/Test Split (Temporal)

**CRITICAL**: Use temporal split (not random) to prevent leakage.

- **Train**: < 2023-01-01 (~70%)
- **Val**: 2023-01-01 to 2024-01-01 (~15%)
- **Test**: >= 2024-01-01 (~15%)

In [18]:
print("Creating temporal train/val/test split...\n")

# Define split dates
train_end = pd.Timestamp('2023-01-01')
val_end = pd.Timestamp('2024-01-01')

# Create masks
train_mask = df_final['GAME_DATE'] < train_end
val_mask = (df_final['GAME_DATE'] >= train_end) & (df_final['GAME_DATE'] < val_end)
test_mask = df_final['GAME_DATE'] >= val_end

# Split datasets
train = df_final[train_mask].copy()
val = df_final[val_mask].copy()
test = df_final[test_mask].copy()

print("‚úÖ Temporal split complete\n")
print(f"   Train: {len(train):,} games ({len(train)/len(df_final)*100:.1f}%) | {train['GAME_DATE'].min().date()} to {train['GAME_DATE'].max().date()}")
print(f"   Val:   {len(val):,} games ({len(val)/len(df_final)*100:.1f}%) | {val['GAME_DATE'].min().date()} to {val['GAME_DATE'].max().date()}")
print(f"   Test:  {len(test):,} games ({len(test)/len(df_final)*100:.1f}%) | {test['GAME_DATE'].min().date()} to {test['GAME_DATE'].max().date()}")
print(f"\n   Total: {len(df_final):,} games")

Creating temporal train/val/test split...

‚úÖ Temporal split complete

   Train: 46,824 games (68.1%) | 2019-10-28 to 2022-12-31
   Val:   13,337 games (19.4%) | 2023-01-01 to 2023-12-31
   Test:  8,604 games (12.5%) | 2024-01-01 to 2024-04-14

   Total: 68,765 games


## 11. Save Datasets & Metadata

In [19]:
print("Saving datasets and metadata...\n")

proc_path = Path('../data/processed')
proc_path.mkdir(parents=True, exist_ok=True)

# Save full dataset
df_final.to_parquet(proc_path / 'features_engineered.parquet', index=False)

# Save splits
train.to_parquet(proc_path / 'train.parquet', index=False)
val.to_parquet(proc_path / 'val.parquet', index=False)
test.to_parquet(proc_path / 'test.parquet', index=False)

# Save metadata
metadata = {
    'version': '2.0_evidence_based',
    'date_created': pd.Timestamp.now().isoformat(),
    'total_features': len(feature_columns),
    'total_games': len(df_final),
    'total_players': df_final['PLAYER_ID'].nunique(),
    
    'date_range': {
        'start': df_final['GAME_DATE'].min().isoformat(),
        'end': df_final['GAME_DATE'].max().isoformat()
    },
    
    'feature_names': feature_columns,
    
    'feature_breakdown': {
        'rolling_averages': 9,
        'season_context': 6,
        'opponent_context': 4,
        'team_context': 4,
        'game_context': 5,
        'shot_tendencies': 4,
        'momentum': 6
    },
    
    'evidence_from_eda': {
        'high_correlation_features': ['Rolling averages (PTS/REB/AST)', 'Season averages', 'Opponent defense (+9% effect)', 'Home advantage (+1.7%)'],
        'low_correlation_features': ['TEAM_PACE (r=0.014, +2.4%)', 'IS_B2B (excluded, paradox +1.7%)'],
        'fundamental_challenge': 'FGA (r=0.874) and MIN (r=0.683) unavailable at prediction time',
        'opponent_effect': 'Elite D: 12.66 PTS vs Weak D: 13.82 PTS (+9.2%)',
        'rest_days_effect': 'Non-linear (4+ days = injury signal, lower scoring)'
    },
    
    'data_quality': {
        'leakage_prevention': 'All features use .shift(1) before rolling operations',
        'shot_tendencies': 'Season averages only (not per-game)',
        'minimum_games': 4,
        'source': 'data/processed/gamelogs_combined.parquet (cleaned, deduplicated)'
    },
    
    'train_val_test_split': {
        'method': 'Temporal (not random)',
        'train': {'end_date': '2023-01-01', 'games': len(train), 'pct': f"{len(train)/len(df_final)*100:.1f}%"},
        'val': {'start_date': '2023-01-01', 'end_date': '2024-01-01', 'games': len(val), 'pct': f"{len(val)/len(df_final)*100:.1f}%"},
        'test': {'start_date': '2024-01-01', 'games': len(test), 'pct': f"{len(test)/len(df_final)*100:.1f}%"}
    },
    
    'tracking_columns': tracking,
    'target_columns': targets,
    
    'excluded_features': {
        'IS_B2B': 'Paradox: +1.7% scoring instead of fatigue (r=+0.009)',
        'FGA/MIN': 'Unavailable at prediction time (r=0.874 and r=0.683)',
        'Per_game_shot_data': 'Doesnt exist pre-game, using season averages instead'
    }
}

with open(proc_path / 'feature_metadata_v2.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"{'='*70}")
print(f"‚úÖ FEATURE ENGINEERING COMPLETE (EVIDENCE-BASED)")
print(f"{'='*70}\n")
print(f"Files saved:")
print(f"  üìä data/processed/features_engineered.parquet ({len(df_final):,} games)")
print(f"  üìä data/processed/train.parquet ({len(train):,} games)")
print(f"  üìä data/processed/val.parquet ({len(val):,} games)")
print(f"  üìä data/processed/test.parquet ({len(test):,} games)")
print(f"  üìã data/processed/feature_metadata_v2.json\n")
print(f"Key improvements from v1.0:")
print(f"  ‚úÖ Reduced from 81 to {len(feature_columns)} evidence-based features")
print(f"  ‚úÖ Excluded IS_B2B (no fatigue effect)")
print(f"  ‚úÖ Shot tendencies as season averages (not per-game)")
print(f"  ‚úÖ Binned DAYS_REST for non-linear effects")
print(f"  ‚úÖ Added opponent/team context from processed data")
print(f"  ‚úÖ Temporal train/val/test split (prevents leakage)\n")
print(f"Next step: Notebook 04 for baseline model training!")

Saving datasets and metadata...

‚úÖ FEATURE ENGINEERING COMPLETE (EVIDENCE-BASED)

Files saved:
  üìä data/processed/features_engineered.parquet (68,765 games)
  üìä data/processed/train.parquet (46,824 games)
  üìä data/processed/val.parquet (13,337 games)
  üìä data/processed/test.parquet (8,604 games)
  üìã data/processed/feature_metadata_v2.json

Key improvements from v1.0:
  ‚úÖ Reduced from 81 to 38 evidence-based features
  ‚úÖ Excluded IS_B2B (no fatigue effect)
  ‚úÖ Shot tendencies as season averages (not per-game)
  ‚úÖ Binned DAYS_REST for non-linear effects
  ‚úÖ Added opponent/team context from processed data
  ‚úÖ Temporal train/val/test split (prevents leakage)

Next step: Notebook 04 for baseline model training!
