# Enhanced Data Collection (3 Seasons + Contextual Features)

This notebook collects comprehensive data for the midterm report:

**Enhancements over notebook 01:**
1. **3 full seasons:** 2022-23, 2023-24, 2024-25 (instead of just 1 season)
2. **100+ players:** Top players by total minutes (instead of 44)
3. **Team context:** Opponent defensive rating, offensive rating, pace
4. **Game context:** Home/away, rest days, back-to-back flags

**Expected dataset size:** ~10,000-15,000 games

**Collection time:** ~15-20 minutes (with 1 req/sec rate limiting)

## Setup and Imports

In [1]:
from nba_api.stats.endpoints import playergamelog, leaguegamefinder, leaguedashteamstats
from nba_api.stats.static import players, teams
import pandas as pd
import numpy as np
import time
from tqdm import tqdm
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

print("Imports successful!")
print(f"Collection started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

Imports successful!
Collection started: 2025-10-18 20:39:35


## Step 1: Select Top 100+ Players

Strategy: Collect top players by total minutes played in 2023-24 season to ensure we get active, high-minute players.

In [2]:
# Get all players who played in 2023-24 season (use this as reference)
print("Fetching top players by minutes played in 2023-24 season...")

try:
    # Get player stats from 2023-24 to identify top players
    from nba_api.stats.endpoints import leaguedashplayerstats
    
    player_stats = leaguedashplayerstats.LeagueDashPlayerStats(
        season='2023-24',
        season_type_all_star='Regular Season',
        per_mode_detailed='Totals'
    )
    
    stats_df = player_stats.get_data_frames()[0]
    time.sleep(1.5)
    
    # Sort by total minutes and select top 120 (some may have incomplete data)
    top_players = stats_df.nlargest(120, 'MIN')[['PLAYER_ID', 'PLAYER_NAME', 'MIN', 'GP']]
    
    print(f"\nTop 120 players by minutes in 2023-24:")
    print(f"  Total minutes range: {top_players['MIN'].min():.0f} - {top_players['MIN'].max():.0f}")
    print(f"  Games played range: {top_players['GP'].min():.0f} - {top_players['GP'].max():.0f}")
    
    player_ids = top_players['PLAYER_ID'].tolist()
    
    print(f"\nSample players:")
    print(top_players.head(20)[['PLAYER_NAME', 'MIN', 'GP']].to_string(index=False))
    
except Exception as e:
    print(f"Error fetching player stats: {e}")
    print("\nFalling back to manual list of top players...")
    
    # Fallback: comprehensive list of top NBA players
    top_player_names = [
        # Superstars
        'Luka Doncic', 'Giannis Antetokounmpo', 'Nikola Jokic', 'Joel Embiid',
        'Shai Gilgeous-Alexander', 'Jayson Tatum', 'Kevin Durant', 'LeBron James',
        'Stephen Curry', 'Damian Lillard', 'Anthony Edwards', 'Tyrese Haliburton',
        'Donovan Mitchell', 'Jaylen Brown', 'Devin Booker', 'Anthony Davis',
        
        # All-Stars and high-minute players
        'Jalen Brunson', 'De\'Aaron Fox', 'Trae Young', 'LaMelo Ball',
        'Tyrese Maxey', 'Paolo Banchero', 'Franz Wagner', 'Scottie Barnes',
        'Cade Cunningham', 'Jalen Williams', 'Desmond Bane', 'Mikal Bridges',
        'Lauri Markkanen', 'Pascal Siakam', 'Julius Randle', 'Domantas Sabonis',
        'Bam Adebayo', 'Karl-Anthony Towns', 'Jaren Jackson Jr.', 'Evan Mobley',
        'Chet Holmgren', 'Victor Wembanyama', 'Alperen Sengun', 'Jarrett Allen',
        
        # Starters and rotation players
        'DeMar DeRozan', 'Zach LaVine', 'Nikola Vucevic', 'Coby White',
        'Darius Garland', 'Kyrie Irving', 'Kawhi Leonard', 'Paul George',
        'James Harden', 'Norman Powell', 'Russell Westbrook', 'Bradley Beal',
        'Deandre Ayton', 'Kristaps Porzingis', 'Clint Capela', 'Rudy Gobert',
        'Nikola Jovic', 'Tyler Herro', 'Jimmy Butler', 'Terry Rozier',
        'Brandon Ingram', 'Zion Williamson', 'CJ McCollum', 'Dejounte Murray',
        'Trey Murphy III', 'Jordan Poole', 'Klay Thompson', 'Andrew Wiggins',
        'Draymond Green', 'Anfernee Simons', 'Jerami Grant', 'Shaedon Sharpe',
        'Jalen Green', 'Alperen Sengun', 'Fred VanVleet', 'Jabari Smith Jr.',
        'OG Anunoby', 'RJ Barrett', 'Immanuel Quickley', 'Josh Hart',
        'Miles Bridges', 'Mark Williams', 'Brandon Miller', 'Nick Richards',
        'Cam Thomas', 'Nic Claxton', 'Spencer Dinwiddie', 'Dorian Finney-Smith',
        'Tobias Harris', 'Kelly Oubre Jr.', 'Kyle Lowry', 'Caleb Martin',
        'Cole Anthony', 'Jalen Suggs', 'Wendell Carter Jr.', 'Jonathan Isaac',
        'Malcolm Brogdon', 'Jerami Grant', 'Josh Giddey', 'Jalen Duren',
        'Ivica Zubac', 'Terance Mann', 'Bones Hyland', 'Malik Monk',
        'Keegan Murray', 'Harrison Barnes', 'Kevin Huerter', 'Trey Lyles',
        'Jonas Valanciunas', 'Jusuf Nurkic', 'Collin Sexton', 'Lauri Markkanen',
        'Walker Kessler', 'Jordan Clarkson', 'John Collins', 'Onyeka Okongwu',
        'Bogdan Bogdanovic', 'Dejounte Murray', 'Saddiq Bey', 'Jalen Johnson'
    ]
    
    all_players = players.get_players()
    player_ids = []
    
    for name in top_player_names:
        player = [p for p in all_players if p['full_name'] == name]
        if player:
            player_ids.append(player[0]['id'])
    
    player_ids = list(set(player_ids))  # Remove duplicates
    print(f"Collected {len(player_ids)} unique player IDs")

Fetching top players by minutes played in 2023-24 season...

Top 120 players by minutes in 2023-24:
  Total minutes range: 1920 - 2989
  Games played range: 54 - 84

Sample players:
     PLAYER_NAME         MIN  GP
   DeMar DeRozan 2988.568333  79
Domantas Sabonis 2928.021667  82
      Coby White 2881.356667  79
   Mikal Bridges 2854.240000  82
  Paolo Banchero 2798.701667  80
    Kevin Durant 2790.636667  75
 Dejounte Murray 2783.343333  78
 Anthony Edwards 2770.383333  79
    Nikola Jokić 2736.533333  79
   Jalen Brunson 2726.303333  77
       Josh Hart 2706.771667  81
   Anthony Davis 2700.243333  76
   Fred VanVleet 2684.426667  73
    De'Aaron Fox 2658.763333  74
   Pascal Siakam 2657.888333  80
    Jayson Tatum 2645.213333  74
   Austin Reaves 2628.950000  82
    Tyrese Maxey 2625.806667  70
     Luka Dončić 2624.040000  70
  Nikola Vučević 2609.501667  76


## Step 2: Collect Player Game Logs (3 Seasons)

Collect game logs for each player across 2022-23, 2023-24, and 2024-25 seasons.

In [3]:
seasons = ['2022-23', '2023-24', '2024-25']
all_gamelogs = []

print(f"Collecting game logs for {len(player_ids)} players across {len(seasons)} seasons...")
print(f"Estimated time: {len(player_ids) * len(seasons) * 1.2 / 60:.1f} minutes\n")

failed_requests = []

for season in seasons:
    print(f"\n{'='*60}")
    print(f"Season: {season}")
    print('='*60)
    
    for player_id in tqdm(player_ids, desc=f"{season}"):
        try:
            gamelog = playergamelog.PlayerGameLog(
                player_id=str(player_id),
                season=season,
                season_type_all_star='Regular Season'
            )
            df = gamelog.get_data_frames()[0]
            
            if len(df) > 0:
                df['PLAYER_ID'] = player_id
                df['SEASON'] = season
                all_gamelogs.append(df)
            
            time.sleep(0.6)  # Rate limit: ~1 req/sec
            
        except Exception as e:
            failed_requests.append({'player_id': player_id, 'season': season, 'error': str(e)})
            time.sleep(2)  # Back off on error
            continue

print(f"\n{'='*60}")
print("Collection Complete!")
print('='*60)
print(f"Successful: {len(all_gamelogs)} player-seasons")
print(f"Failed: {len(failed_requests)} requests")

if len(failed_requests) > 0:
    print(f"\nFailed requests (showing first 10):")
    for req in failed_requests[:10]:
        print(f"  Player {req['player_id']}, {req['season']}: {req['error'][:50]}")

Collecting game logs for 120 players across 3 seasons...
Estimated time: 7.2 minutes


Season: 2022-23


2022-23: 100%|█████████████████████████████████████| 120/120 [01:36<00:00,  1.24it/s]



Season: 2023-24


2023-24: 100%|█████████████████████████████████████| 120/120 [01:36<00:00,  1.24it/s]



Season: 2024-25


2024-25: 100%|█████████████████████████████████████| 120/120 [01:39<00:00,  1.20it/s]


Collection Complete!
Successful: 352 player-seasons
Failed: 0 requests





In [10]:
# Combine all game logs
gamelogs_df = pd.concat(all_gamelogs, ignore_index=True)

# Convert GAME_DATE to datetime
gamelogs_df['GAME_DATE'] = pd.to_datetime(gamelogs_df['GAME_DATE'])

# Sort by player and date
gamelogs_df = gamelogs_df.sort_values(['PLAYER_ID', 'GAME_DATE']).reset_index(drop=True)

print(f"\nDataset Summary:")
print(f"  Total games: {len(gamelogs_df):,}")
print(f"  Unique players: {gamelogs_df['PLAYER_ID'].nunique()}")
print(f"  Date range: {gamelogs_df['GAME_DATE'].min().date()} to {gamelogs_df['GAME_DATE'].max().date()}")
print(f"  Seasons: {gamelogs_df['SEASON'].unique()}")

print(f"\nGames per season:")
print(gamelogs_df.groupby('SEASON').size())

print(f"\nTarget statistics:")
print(f"  PTS: {gamelogs_df['PTS'].mean():.1f} ± {gamelogs_df['PTS'].std():.1f}")
print(f"  REB: {gamelogs_df['REB'].mean():.1f} ± {gamelogs_df['REB'].std():.1f}")
print(f"  AST: {gamelogs_df['AST'].mean():.1f} ± {gamelogs_df['AST'].std():.1f}")


Dataset Summary:
  Total games: 23,925
  Unique players: 120
  Date range: 2022-10-18 to 2025-04-13
  Seasons: ['2022-23' '2023-24' '2024-25']

Games per season:
SEASON
2022-23    7585
2023-24    8826
2024-25    7514
dtype: int64

Target statistics:
  PTS: 17.1 ± 9.3
  REB: 5.4 ± 3.7
  AST: 3.9 ± 3.0


## Step 3: Collect Team Stats (For Opponent Context)

Collect team statistics (defensive rating, offensive rating, pace) for each season to enable opponent-based features.

In [11]:
print("Collecting team statistics for opponent context...\n")

all_team_stats = []

for season in seasons:
    try:
        print(f"Fetching {season} team stats...")
        
        team_stats = leaguedashteamstats.LeagueDashTeamStats(
            season=season,
            season_type_all_star='Regular Season',
            measure_type_detailed_defense='Advanced'
        )
        
        df = team_stats.get_data_frames()[0]
        df['SEASON'] = season
        
        # Keep relevant columns
        df = df[['TEAM_ID', 'TEAM_NAME', 'SEASON', 'GP', 'W', 'L', 
                 'OFF_RATING', 'DEF_RATING', 'NET_RATING', 'PACE']]
        
        all_team_stats.append(df)
        
        print(f"  ✓ {len(df)} teams")
        time.sleep(1.5)
        
    except Exception as e:
        print(f"  ✗ Error: {e}")
        continue

team_stats_df = pd.concat(all_team_stats, ignore_index=True)

print(f"\nTeam stats collected:")
print(f"  Total records: {len(team_stats_df)}")
print(f"  Teams per season: {team_stats_df.groupby('SEASON').size().to_dict()}")

print(f"\nSample team stats:")
print(team_stats_df[team_stats_df['SEASON'] == '2023-24'].head(10).to_string(index=False))

Collecting team statistics for opponent context...

Fetching 2022-23 team stats...
  ✓ 72 teams
Fetching 2023-24 team stats...
  ✓ 73 teams
Fetching 2024-25 team stats...
  ✓ 76 teams

Team stats collected:
  Total records: 221
  Teams per season: {'2022-23': 72, '2023-24': 73, '2024-25': 76}

Sample team stats:
   TEAM_ID           TEAM_NAME  SEASON  GP  W  L  OFF_RATING  DEF_RATING  NET_RATING   PACE
1611661330       Atlanta Dream 2023-24  40 19 21        99.7       101.5        -1.8  98.70
1610612737       Atlanta Hawks 2023-24  87 39 48       115.9       118.0        -2.0 100.86
1612709890        Austin Spurs 2023-24  34 20 14       111.6       108.2         3.4 101.76
1612709913 Birmingham Squadron 2023-24  34 15 19       117.2       118.9        -1.7 101.08
1610612738      Boston Celtics 2023-24  87 66 21       121.3       110.2        11.0  98.40
1610612751       Brooklyn Nets 2023-24  87 35 52       112.1       114.4        -2.4  98.02
1612709928  Capital City Go-Go 2023-24  34

## Step 4: Add Contextual Features

Add game-level contextual features:
- **Home/Away indicator**
- **Rest days** since last game
- **Back-to-back games** flag
- **Opponent team ID** (extracted from MATCHUP)

In [12]:
# Extract opponent team from MATCHUP column (e.g., "BOS @ LAL" -> opponent is LAL)
def extract_opponent_and_location(matchup):
    """Extract opponent abbreviation and home/away from MATCHUP string."""
    if ' vs. ' in matchup:
        # Home game: "BOS vs. LAL" -> opponent is LAL, home = True
        parts = matchup.split(' vs. ')
        return parts[1], True
    elif ' @ ' in matchup:
        # Away game: "BOS @ LAL" -> opponent is LAL, home = False
        parts = matchup.split(' @ ')
        return parts[1], False
    else:
        return None, None

gamelogs_df[['OPP_ABBREV', 'IS_HOME']] = gamelogs_df['MATCHUP'].apply(
    lambda x: pd.Series(extract_opponent_and_location(x))
)

# Create team abbreviation lookup
all_teams = teams.get_teams()
team_abbrev_to_id = {t['abbreviation']: t['id'] for t in all_teams}

# Map opponent abbreviation to team ID
gamelogs_df['OPP_TEAM_ID'] = gamelogs_df['OPP_ABBREV'].map(team_abbrev_to_id)

print("Opponent and location extracted:")
print(f"  Home games: {gamelogs_df['IS_HOME'].sum():,} ({gamelogs_df['IS_HOME'].mean()*100:.1f}%)")
print(f"  Away games: {(~gamelogs_df['IS_HOME']).sum():,} ({(~gamelogs_df['IS_HOME']).mean()*100:.1f}%)")
print(f"  Opponent IDs mapped: {gamelogs_df['OPP_TEAM_ID'].notna().sum():,} / {len(gamelogs_df):,}")

Opponent and location extracted:
  Home games: 12,019 (50.2%)
  Away games: 11,906 (49.8%)
  Opponent IDs mapped: 23,925 / 23,925


In [13]:
# Compute rest days for each player
print("Computing rest days and back-to-back flags...\n")

rest_features = []

for player_id in tqdm(gamelogs_df['PLAYER_ID'].unique(), desc="Players"):
    player_df = gamelogs_df[gamelogs_df['PLAYER_ID'] == player_id].copy()
    player_df = player_df.sort_values('GAME_DATE')
    
    # Compute rest days (days since previous game)
    player_df['PREV_GAME_DATE'] = player_df['GAME_DATE'].shift(1)
    player_df['REST_DAYS'] = (player_df['GAME_DATE'] - player_df['PREV_GAME_DATE']).dt.days - 1
    
    # Fill first game with NaN (no previous game)
    player_df.loc[player_df.index[0], 'REST_DAYS'] = np.nan
    
    # Back-to-back flag (0 rest days)
    player_df['IS_BACK_TO_BACK'] = (player_df['REST_DAYS'] == 0).astype(int)
    
    # FIX: Include PLAYER_ID to avoid duplication during merge
    rest_features.append(player_df[['PLAYER_ID', 'Game_ID', 'REST_DAYS', 'IS_BACK_TO_BACK']])

rest_features_df = pd.concat(rest_features, ignore_index=True)

# Merge back to main dataframe
# FIX: Merge on BOTH PLAYER_ID and Game_ID to avoid Cartesian product
gamelogs_df = gamelogs_df.merge(rest_features_df, on=['PLAYER_ID', 'Game_ID'], how='left')

print("Rest days statistics:")
print(f"  Mean rest days: {gamelogs_df['REST_DAYS'].mean():.1f}")
print(f"  Median rest days: {gamelogs_df['REST_DAYS'].median():.0f}")
print(f"  Back-to-back games: {gamelogs_df['IS_BACK_TO_BACK'].sum():,} ({gamelogs_df['IS_BACK_TO_BACK'].mean()*100:.1f}%)")

print(f"\nRest days distribution:")
print(gamelogs_df['REST_DAYS'].value_counts().sort_index().head(10))

Computing rest days and back-to-back flags...



Players: 100%|████████████████████████████████████| 120/120 [00:00<00:00, 679.92it/s]

Rest days statistics:
  Mean rest days: 3.4
  Median rest days: 1
  Back-to-back games: 3,726 (15.6%)

Rest days distribution:
REST_DAYS
0.0     3726
1.0    14100
2.0     3566
3.0     1000
4.0      326
5.0      124
6.0       94
7.0      262
8.0      117
9.0       40
Name: count, dtype: int64





## Step 5: Merge Opponent Team Stats

Merge opponent's season-to-date stats to each game.

In [14]:
# Merge opponent team stats (season-level)
# Rename columns to indicate they are opponent stats
opponent_stats = team_stats_df.copy()
opponent_stats = opponent_stats.rename(columns={
    'TEAM_ID': 'OPP_TEAM_ID',
    'TEAM_NAME': 'OPP_TEAM_NAME',
    'GP': 'OPP_GP',
    'W': 'OPP_W',
    'L': 'OPP_L',
    'OFF_RATING': 'OPP_OFF_RATING',
    'DEF_RATING': 'OPP_DEF_RATING',
    'NET_RATING': 'OPP_NET_RATING',
    'PACE': 'OPP_PACE'
})

# Merge on opponent team ID and season
gamelogs_enhanced = gamelogs_df.merge(
    opponent_stats,
    on=['OPP_TEAM_ID', 'SEASON'],
    how='left'
)

print("Opponent stats merged:")
print(f"  Total records: {len(gamelogs_enhanced):,}")
print(f"  Records with opponent stats: {gamelogs_enhanced['OPP_DEF_RATING'].notna().sum():,}")
print(f"  Coverage: {gamelogs_enhanced['OPP_DEF_RATING'].notna().mean()*100:.1f}%")

print(f"\nOpponent stats summary:")
print(gamelogs_enhanced[['OPP_DEF_RATING', 'OPP_OFF_RATING', 'OPP_PACE']].describe())

Opponent stats merged:
  Total records: 23,925
  Records with opponent stats: 23,925
  Coverage: 100.0%

Opponent stats summary:
       OPP_DEF_RATING  OPP_OFF_RATING      OPP_PACE
count    23925.000000    23925.000000  23925.000000
mean       113.454775      113.475156     99.742025
std          2.628779        3.227698      1.742195
min        106.400000      105.400000     96.290000
25%        111.400000      111.700000     98.370000
50%        113.200000      113.800000     99.670000
75%        114.800000      115.800000    100.970000
max        118.900000      121.300000    103.640000


## Step 6: Save Enhanced Dataset

In [17]:
# Save to parquet
output_path = '../data/raw/player_gamelogs_enhanced_2022-2025.parquet'
gamelogs_enhanced.to_parquet(output_path)

print("="*70)
print("ENHANCED DATASET SAVED")
print("="*70)

print(f"\nFile: {output_path}")
print(f"  Size: {gamelogs_enhanced.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
print(f"  Rows: {len(gamelogs_enhanced):,}")
print(f"  Columns: {len(gamelogs_enhanced.columns)}")

print(f"\nKey enhancements:")
print(f"  ✓ 3 seasons of data (2022-23, 2023-24, 2024-25)")
print(f"  ✓ {gamelogs_enhanced['PLAYER_ID'].nunique()} players")
print(f"  ✓ Opponent team stats (DEF_RATING, OFF_RATING, PACE)")
print(f"  ✓ Game context (home/away, rest days, back-to-back)")

print(f"\nColumn list:")
for i, col in enumerate(gamelogs_enhanced.columns, 1):
    print(f"  {i:2}. {col}")

print(f"\nSample data:")
display(gamelogs_enhanced[[
    'GAME_DATE', 'PLAYER_ID', 'MATCHUP', 'IS_HOME', 'REST_DAYS', 'IS_BACK_TO_BACK',
    'PTS', 'REB', 'AST', 'OPP_DEF_RATING', 'OPP_PACE'
]].head(30))


print(f"\n{'='*70}")
print(f"Ready for feature engineering!")
print(f"Next: Run notebook 08 to create enhanced features and re-run ML models")
print(f"{'='*70}")

ENHANCED DATASET SAVED

File: ../data/raw/player_gamelogs_enhanced_2022-2025.parquet
  Size: 16.5 MB
  Rows: 23,925
  Columns: 42

Key enhancements:
  ✓ 3 seasons of data (2022-23, 2023-24, 2024-25)
  ✓ 120 players
  ✓ Opponent team stats (DEF_RATING, OFF_RATING, PACE)
  ✓ Game context (home/away, rest days, back-to-back)

Column list:
   1. SEASON_ID
   2. Player_ID
   3. Game_ID
   4. GAME_DATE
   5. MATCHUP
   6. WL
   7. MIN
   8. FGM
   9. FGA
  10. FG_PCT
  11. FG3M
  12. FG3A
  13. FG3_PCT
  14. FTM
  15. FTA
  16. FT_PCT
  17. OREB
  18. DREB
  19. REB
  20. AST
  21. STL
  22. BLK
  23. TOV
  24. PF
  25. PTS
  26. PLUS_MINUS
  27. VIDEO_AVAILABLE
  28. PLAYER_ID
  29. SEASON
  30. OPP_ABBREV
  31. IS_HOME
  32. OPP_TEAM_ID
  33. REST_DAYS
  34. IS_BACK_TO_BACK
  35. OPP_TEAM_NAME
  36. OPP_GP
  37. OPP_W
  38. OPP_L
  39. OPP_OFF_RATING
  40. OPP_DEF_RATING
  41. OPP_NET_RATING
  42. OPP_PACE

Sample data:


Unnamed: 0,GAME_DATE,PLAYER_ID,MATCHUP,IS_HOME,REST_DAYS,IS_BACK_TO_BACK,PTS,REB,AST,OPP_DEF_RATING,OPP_PACE
0,2022-10-18,2544,LAL @ GSW,False,,0,31,15,8,113.0,102.59
1,2022-10-20,2544,LAL vs. LAC,True,1.0,0,20,10,6,112.5,99.0
2,2022-10-23,2544,LAL vs. POR,True,2.0,0,31,8,8,116.6,99.36
3,2022-10-26,2544,LAL @ DEN,False,2.0,0,19,7,9,112.8,98.78
4,2022-10-28,2544,LAL @ MIN,False,1.0,0,28,7,5,112.5,101.52
5,2022-10-30,2544,LAL vs. DEN,True,1.0,0,26,6,8,112.8,98.78
6,2022-11-02,2544,LAL vs. NOP,True,2.0,0,20,10,8,111.3,99.67
7,2022-11-04,2544,LAL vs. UTA,True,1.0,0,17,10,8,114.5,101.11
8,2022-11-06,2544,LAL vs. CLE,True,1.0,0,27,7,4,109.8,96.29
9,2022-11-09,2544,LAL @ LAC,False,2.0,0,30,8,5,112.5,99.0



Ready for feature engineering!
Next: Run notebook 08 to create enhanced features and re-run ML models
