# Notebook 01: Data Collection

## Objective
Collect all raw data needed for NBA player performance prediction.

## Data to Collect
1. **Player Game Logs** (5 seasons, top 200 players/season)
2. **Shot Chart Data** (shot locations for all players)

## Output
- `data/raw/gamelogs_combined.parquet` (~66,000 games)
- `data/raw/shot_charts_all.parquet` (~890,000 shots)

## Time: ~35 minutes

In [11]:
import pandas as pd
import time
from pathlib import Path
from tqdm import tqdm
from nba_api.stats.endpoints import leaguegamelog, playergamelog, shotchartdetail

SEASONS = ['2019-20', '2020-21', '2021-22', '2022-23', '2023-24']
N_PLAYERS = 200
RATE_LIMIT = 0.6

raw_path = Path('../data/raw')
raw_path.mkdir(parents=True, exist_ok=True)

print("✅ Setup complete")

✅ Setup complete


## Part 1: Collect Game Logs

In [12]:
# Step 1: Identify top 200 players from each season
print("Step 1: Identifying top 200 players per season...\n")
top_players_by_season = {}

for season in SEASONS:
    print(f"{'='*60}\n{season}\n{'='*60}")
    
    # Get league game log for this season
    league_log = leaguegamelog.LeagueGameLog(
        season=season,
        season_type_all_star='Regular Season',
        player_or_team_abbreviation='P'
    )
    df_league = league_log.get_data_frames()[0]
    time.sleep(RATE_LIMIT)
    
    # Get top 200 players by total minutes
    top_players = (
        df_league.groupby(['PLAYER_ID', 'PLAYER_NAME'])['MIN']
        .sum()
        .reset_index()
        .sort_values('MIN', ascending=False)
        .head(N_PLAYERS)
    )
    
    top_players_by_season[season] = top_players
    print(f"Top player: {top_players.iloc[0]['PLAYER_NAME']} ({top_players.iloc[0]['MIN']:.0f} min)")
    print(f"✅ {len(top_players)} players identified")

# Step 2: Get ALL unique players across all seasons
print(f"\n{'='*60}")
print("Step 2: Combining unique players across all seasons...")
print(f"{'='*60}\n")

all_top_players = pd.concat(top_players_by_season.values(), ignore_index=True)
unique_players = all_top_players.drop_duplicates(subset='PLAYER_ID')[['PLAYER_ID', 'PLAYER_NAME']]
print(f"✅ Found {len(unique_players)} unique players across all seasons")

# Step 3: Collect game logs for ALL seasons for each unique player
print(f"\n{'='*60}")
print("Step 3: Collecting game logs for all players across all seasons...")
print(f"{'='*60}\n")

all_games = []
for idx, (_, player) in enumerate(tqdm(unique_players.iterrows(), total=len(unique_players), desc="Players")):
    player_id = player['PLAYER_ID']
    player_name = player['PLAYER_NAME']
    
    for season in SEASONS:
        try:
            gamelog = playergamelog.PlayerGameLog(
                player_id=player_id, 
                season=season, 
                season_type_all_star='Regular Season'
            )
            df_games = gamelog.get_data_frames()[0]
            
            if len(df_games) > 0:
                df_games['PLAYER_NAME'] = player_name
                all_games.append(df_games)
            
            time.sleep(RATE_LIMIT)
        except Exception as e:
            # Player didn't play in this season - skip
            time.sleep(1)
    
    # Progress update every 50 players
    if (idx + 1) % 50 == 0:
        total_games = sum(len(g) for g in all_games)
        print(f"\nProgress: {idx+1}/{len(unique_players)} players | {total_games:,} games collected")

# Step 4: Combine and save
print(f"\n{'='*60}")
print("Step 4: Combining and saving...")
print(f"{'='*60}\n")

df_all = pd.concat(all_games, ignore_index=True)

# Save combined file
df_all.to_parquet(raw_path / "gamelogs_combined.parquet", index=False)

# Also save per-season files for reference
for season in SEASONS:
    season_data = df_all[df_all['SEASON_ID'].astype(str).str.contains(season.split('-')[0])]
    if len(season_data) > 0:
        season_data.to_parquet(raw_path / f"gamelogs_{season}.parquet", index=False)
        print(f"{season}: {len(season_data):,} games")

print(f"\n✅ Total: {len(df_all):,} games, {df_all['Player_ID'].nunique()} unique players")
print(f"✅ Data saved to data/raw/")

Step 1: Identifying top 200 players per season...

2019-20
Top player: CJ McCollum (2560 min)
✅ 200 players identified
2020-21
Top player: Julius Randle (2666 min)
✅ 200 players identified
2021-22
Top player: Mikal Bridges (2854 min)
✅ 200 players identified
2022-23
Top player: Mikal Bridges (2965 min)
✅ 200 players identified
2023-24
Top player: DeMar DeRozan (2995 min)
✅ 200 players identified

Step 2: Combining unique players across all seasons...

✅ Found 369 unique players across all seasons

Step 3: Collecting game logs for all players across all seasons...



Players:  14%|███▊                        | 50/369 [04:24<30:02,  5.65s/it]


Progress: 50/369 players | 15,375 games collected


Players:  27%|███████▎                   | 100/369 [09:56<29:00,  6.47s/it]


Progress: 100/369 players | 28,972 games collected


Players:  41%|██████████▉                | 150/369 [14:49<22:37,  6.20s/it]


Progress: 150/369 players | 41,498 games collected


Players:  54%|██████████████▋            | 200/369 [19:35<17:05,  6.07s/it]


Progress: 200/369 players | 53,666 games collected


Players:  68%|██████████████████▎        | 250/369 [24:25<11:24,  5.75s/it]


Progress: 250/369 players | 65,863 games collected


Players:  81%|█████████████████████▉     | 300/369 [29:36<08:05,  7.03s/it]


Progress: 300/369 players | 78,344 games collected


Players:  95%|█████████████████████████▌ | 350/369 [36:19<03:13, 10.16s/it]


Progress: 350/369 players | 87,564 games collected


Players: 100%|███████████████████████████| 369/369 [39:15<00:00,  6.38s/it]



Step 4: Combining and saving...

2019-20: 15,595 games
2020-21: 16,958 games
2021-22: 19,352 games
2022-23: 19,478 games
2023-24: 18,891 games

✅ Total: 90,274 games, 369 unique players
✅ Data saved to data/raw/


## Part 2: Collect Shot Charts

In [13]:
players = df_all[['Player_ID', 'PLAYER_NAME']].drop_duplicates()
print(f"Collecting shot charts for {len(players)} players...\n")

all_shots = []
for idx, (_, row) in enumerate(tqdm(players.iterrows(), total=len(players))):
    for season in SEASONS:
        try:
            shot_chart = shotchartdetail.ShotChartDetail(
                team_id=0,
                player_id=row['Player_ID'],
                season_nullable=season,
                season_type_all_star='Regular Season',
                context_measure_simple='FGA'
            )
            df_shots = shot_chart.get_data_frames()[0]
            if len(df_shots) > 0:
                df_shots['Player_ID'] = row['Player_ID']
                df_shots['Player_Name'] = row['PLAYER_NAME']
                df_shots['Season'] = season
                all_shots.append(df_shots)
            time.sleep(RATE_LIMIT)
        except:
            time.sleep(1)
    
    if (idx + 1) % 50 == 0:
        print(f"Progress: {idx+1}/{len(players)} | Shots: {sum(len(s) for s in all_shots):,}")

df_shots = pd.concat(all_shots, ignore_index=True)
df_shots.to_parquet(raw_path / "shot_charts_all.parquet", index=False)
print(f"\n✅ {len(df_shots):,} shots collected")

Collecting shot charts for 369 players...



 14%|█████                                | 50/369 [03:12<20:27,  3.85s/it]

Progress: 50/369 | Shots: 213,632


 27%|█████████▊                          | 100/369 [06:24<17:01,  3.80s/it]

Progress: 100/369 | Shots: 364,532


 41%|██████████████▋                     | 150/369 [09:33<13:51,  3.80s/it]

Progress: 150/369 | Shots: 484,457


 54%|███████████████████▌                | 200/369 [12:40<10:29,  3.73s/it]

Progress: 200/369 | Shots: 586,298


 68%|████████████████████████▍           | 250/369 [15:48<07:29,  3.77s/it]

Progress: 250/369 | Shots: 698,459


 81%|█████████████████████████████▎      | 300/369 [18:57<04:21,  3.78s/it]

Progress: 300/369 | Shots: 800,736


 95%|██████████████████████████████████▏ | 350/369 [22:05<01:11,  3.74s/it]

Progress: 350/369 | Shots: 869,952


100%|████████████████████████████████████| 369/369 [23:31<00:00,  3.82s/it]



✅ 885,698 shots collected


## Summary

✅ Data collection complete!

**Data Collection Strategy:**
1. Identified top 200 players (by minutes) from each season (2019-2024)
2. Found ~356 unique players across all seasons
3. Collected game logs for EACH player across ALL seasons (not just their top-200 season)
4. Collected shot charts for all players across all seasons
5. This ensures minimal orphaned shots (shots without corresponding game logs)

**Files created:**
- `data/raw/gamelogs_combined.parquet` (~70,000+ games)
- `data/raw/gamelogs_{season}.parquet` (per-season files)
- `data/raw/shot_charts_all.parquet` (~875,000+ shots)

**Next:** Validation checks below, then proceed to Notebook 02 - EDA

## Validation: Game Logs

Before proceeding, let's validate the game log data quality.

In [14]:
# 1. Check data shape and completeness
print("="*60)
print("GAME LOG VALIDATION")
print("="*60)

print(f"\n1. Dataset Shape:")
print(f"   Total games: {len(df_all):,}")
print(f"   Unique players: {df_all['Player_ID'].nunique()}")
print(f"   Columns: {df_all.shape[1]}")

print(f"\n2. Games per Season:")
for season_id in sorted(df_all['SEASON_ID'].unique()):
    count = (df_all['SEASON_ID'] == season_id).sum()
    print(f"   {season_id}: {count:,} games")

print(f"\n3. Date Range:")
df_all['GAME_DATE'] = pd.to_datetime(df_all['GAME_DATE'])
print(f"   Min: {df_all['GAME_DATE'].min()}")
print(f"   Max: {df_all['GAME_DATE'].max()}")

GAME LOG VALIDATION

1. Dataset Shape:
   Total games: 90,274
   Unique players: 369
   Columns: 28

2. Games per Season:
   22019: 15,595 games
   22020: 16,958 games
   22021: 19,352 games
   22022: 19,478 games
   22023: 18,891 games

3. Date Range:
   Min: 2019-10-22 00:00:00
   Max: 2024-04-14 00:00:00


In [15]:
# 2. Check for missing values in critical columns
print(f"\n4. Missing Values Check:")
critical_cols = ['Player_ID', 'Game_ID', 'GAME_DATE', 'PTS', 'REB', 'AST', 'MIN', 'FGA', 'FG_PCT']
missing = df_all[critical_cols].isnull().sum()

if missing.sum() == 0:
    print(f"   ✅ No missing values in critical columns!")
else:
    print(f"   ⚠️ Missing values found:")
    for col, count in missing[missing > 0].items():
        print(f"      {col}: {count} ({count/len(df_all)*100:.2f}%)")

print(f"\n5. Duplicate Games Check:")
duplicates = df_all.duplicated(subset=['Player_ID', 'Game_ID']).sum()
if duplicates == 0:
    print(f"   ✅ No duplicate player-game records")
else:
    print(f"   ⚠️ Found {duplicates} duplicate records")


4. Missing Values Check:
   ✅ No missing values in critical columns!

5. Duplicate Games Check:
   ✅ No duplicate player-game records


In [16]:
# 3. Validate data distributions
print(f"\n6. Target Variable Distributions:")
print(f"   PTS: {df_all['PTS'].mean():.1f} ± {df_all['PTS'].std():.1f} (range: {df_all['PTS'].min()}-{df_all['PTS'].max()})")
print(f"   REB: {df_all['REB'].mean():.1f} ± {df_all['REB'].std():.1f} (range: {df_all['REB'].min()}-{df_all['REB'].max()})")
print(f"   AST: {df_all['AST'].mean():.1f} ± {df_all['AST'].std():.1f} (range: {df_all['AST'].min()}-{df_all['AST'].max()})")

print(f"\n7. Data Type Validation:")
expected_types = {
    'Player_ID': 'int64',
    'PTS': 'int64',
    'REB': 'int64',
    'AST': 'int64',
    'FG_PCT': 'float64'
}

all_correct = True
for col, expected in expected_types.items():
    actual = str(df_all[col].dtype)
    if actual != expected:
        print(f"   ⚠️ {col}: expected {expected}, got {actual}")
        all_correct = False

if all_correct:
    print(f"   ✅ All data types correct")

print(f"\n8. Required Columns Check:")
required_cols = ['SEASON_ID', 'Player_ID', 'Game_ID', 'GAME_DATE', 'MATCHUP', 
                 'MIN', 'FGA', 'FG_PCT', 'FG3A', 'FG3_PCT', 'FTA', 'FT_PCT',
                 'REB', 'AST', 'PTS', 'TOV', 'PLAYER_NAME']
missing_cols = [col for col in required_cols if col not in df_all.columns]

if len(missing_cols) == 0:
    print(f"   ✅ All {len(required_cols)} required columns present")
else:
    print(f"   ⚠️ Missing columns: {missing_cols}")


6. Target Variable Distributions:
   PTS: 12.6 ± 8.9 (range: 0-73)
   REB: 4.7 ± 3.5 (range: 0-31)
   AST: 2.8 ± 2.8 (range: 0-24)

7. Data Type Validation:
   ✅ All data types correct

8. Required Columns Check:
   ✅ All 17 required columns present


In [17]:
print(f"\n11. Required Shot Columns Check:")
required_shot_cols = ['GRID_TYPE', 'GAME_ID', 'GAME_EVENT_ID', 'Player_ID', 
                      'Player_Name', 'TEAM_ID', 'TEAM_NAME', 'PERIOD', 
                      'MINUTES_REMAINING', 'SECONDS_REMAINING', 'EVENT_TYPE',
                      'SHOT_ZONE_BASIC', 'SHOT_ZONE_AREA', 'SHOT_ZONE_RANGE',
                      'SHOT_DISTANCE', 'LOC_X', 'LOC_Y', 'SHOT_ATTEMPTED_FLAG',
                      'SHOT_MADE_FLAG', 'GAME_DATE', 'Season']

missing_shot_cols = [col for col in required_shot_cols if col not in df_shots.columns]

if len(missing_shot_cols) == 0:
    print(f"   ✅ All {len(required_shot_cols)} required shot columns present")
else:
    print(f"   ⚠️ Missing columns: {missing_shot_cols}")

print(f"\n{'='*60}")
print("VALIDATION COMPLETE")
print(f"{'='*60}")
print(f"\n✅ Game logs validated: {len(df_all):,} games")
print(f"✅ Shot charts validated: {len(df_shots):,} shots")
print(f"\nReady to proceed to Notebook 02 - EDA")


11. Required Shot Columns Check:
   ✅ All 21 required shot columns present

VALIDATION COMPLETE

✅ Game logs validated: 90,274 games
✅ Shot charts validated: 885,698 shots

Ready to proceed to Notebook 02 - EDA


In [18]:
print(f"\n8. Shot Zone Distribution:")
zone_counts = df_shots['SHOT_ZONE_BASIC'].value_counts()
print(f"   Total zones: {len(zone_counts)}")
for zone, count in zone_counts.items():
    pct = count / len(df_shots) * 100
    print(f"   {zone}: {count:,} ({pct:.1f}%)")

print(f"\n9. Shot Made/Attempted Validation:")
total_attempted = df_shots['SHOT_ATTEMPTED_FLAG'].sum()
total_made = df_shots['SHOT_MADE_FLAG'].sum()
overall_fg_pct = (total_made / total_attempted * 100) if total_attempted > 0 else 0

print(f"   Total attempts: {total_attempted:,}")
print(f"   Total makes: {total_made:,}")
print(f"   Overall FG%: {overall_fg_pct:.1f}%")

# Sanity check: NBA league average is typically 45-47%
if 40 <= overall_fg_pct <= 50:
    print(f"   ✅ FG% in expected NBA range (40-50%)")
else:
    print(f"   ⚠️ FG% outside typical NBA range - check data quality")

print(f"\n10. Shot Distance Distribution:")
print(f"   Min: {df_shots['SHOT_DISTANCE'].min()} ft")
print(f"   Max: {df_shots['SHOT_DISTANCE'].max()} ft")
print(f"   Mean: {df_shots['SHOT_DISTANCE'].mean():.1f} ft")
print(f"   Median: {df_shots['SHOT_DISTANCE'].median():.1f} ft")

# Sanity check: NBA court is 94 ft long, max shot distance ~90 ft
if df_shots['SHOT_DISTANCE'].max() <= 95:
    print(f"   ✅ All shot distances within court dimensions")
else:
    print(f"   ⚠️ Some shots beyond court dimensions - check data quality")


8. Shot Zone Distribution:
   Total zones: 7
   Restricted Area: 261,561 (29.5%)
   Above the Break 3: 261,216 (29.5%)
   In The Paint (Non-RA): 165,504 (18.7%)
   Mid-Range: 114,418 (12.9%)
   Left Corner 3: 42,539 (4.8%)
   Right Corner 3: 38,825 (4.4%)
   Backcourt: 1,635 (0.2%)

9. Shot Made/Attempted Validation:
   Total attempts: 885,698
   Total makes: 416,830
   Overall FG%: 47.1%
   ✅ FG% in expected NBA range (40-50%)

10. Shot Distance Distribution:
   Min: 0 ft
   Max: 87 ft
   Mean: 13.6 ft
   Median: 13.0 ft
   ✅ All shot distances within court dimensions


In [19]:
print(f"\n5. Missing Values Check:")
shot_critical_cols = ['Player_ID', 'GAME_ID', 'GAME_DATE', 'SHOT_ZONE_BASIC', 
                      'SHOT_DISTANCE', 'SHOT_MADE_FLAG', 'SHOT_ATTEMPTED_FLAG']
missing_shots = df_shots[shot_critical_cols].isnull().sum()

if missing_shots.sum() == 0:
    print(f"   ✅ No missing values in critical shot columns!")
else:
    print(f"   ⚠️ Missing values found:")
    for col, count in missing_shots[missing_shots > 0].items():
        print(f"      {col}: {count} ({count/len(df_shots)*100:.2f}%)")

print(f"\n6. Duplicate Shots Check:")
shot_duplicates = df_shots.duplicated(subset=['Player_ID', 'GAME_ID', 'PERIOD', 
                                               'MINUTES_REMAINING', 'SECONDS_REMAINING']).sum()
if shot_duplicates == 0:
    print(f"   ✅ No duplicate shot records")
else:
    print(f"   ⚠️ Found {shot_duplicates} potential duplicate shots")
    pct = shot_duplicates / len(df_shots) * 100
    print(f"      ({pct:.3f}% of total - likely edge cases, not a concern)")

print(f"\n7. Orphaned Shots Check:")
print(f"   (Checking if shots have corresponding game logs...)")
valid_player_game_pairs = set(zip(df_all['Player_ID'], df_all['Game_ID']))
shot_player_game_pairs = set(zip(df_shots['Player_ID'], df_shots['GAME_ID']))
orphaned = len(shot_player_game_pairs - valid_player_game_pairs)

if orphaned == 0:
    print(f"   ✅ All shots have corresponding game logs")
else:
    print(f"   ⚠️ {orphaned} player-game pairs in shots but not in game logs")
    pct = orphaned / len(shot_player_game_pairs) * 100
    print(f"      ({pct:.1f}% of player-game pairs)")
    print(f"      Note: Since we collect game logs for ALL seasons for each player,")
    print(f"            orphaned shots should be minimal. Any remaining are likely from:")
    print(f"            - API timing/data consistency issues")
    print(f"            - Players with <1 minute played (filtered out of game logs)")
    print(f"      These will be filtered out during feature engineering.")


5. Missing Values Check:
   ✅ No missing values in critical shot columns!

6. Duplicate Shots Check:
   ⚠️ Found 673 potential duplicate shots
      (0.076% of total - likely edge cases, not a concern)

7. Orphaned Shots Check:
   (Checking if shots have corresponding game logs...)
   ⚠️ 29 player-game pairs in shots but not in game logs
      (0.0% of player-game pairs)
      Note: Since we collect game logs for ALL seasons for each player,
            orphaned shots should be minimal. Any remaining are likely from:
            - API timing/data consistency issues
            - Players with <1 minute played (filtered out of game logs)
      These will be filtered out during feature engineering.


In [20]:
print("="*60)
print("SHOT CHART VALIDATION")
print("="*60)

print(f"\n1. Dataset Shape:")
print(f"   Total shots: {len(df_shots):,}")
print(f"   Unique players: {df_shots['Player_ID'].nunique()}")
print(f"   Unique games: {df_shots['GAME_ID'].nunique()}")
print(f"   Columns: {df_shots.shape[1]}")

print(f"\n2. Shots per Season:")
for season in SEASONS:
    count = (df_shots['Season'] == season).sum()
    print(f"   {season}: {count:,} shots")

print(f"\n3. Date Range:")
df_shots['GAME_DATE'] = pd.to_datetime(df_shots['GAME_DATE'])
print(f"   Min: {df_shots['GAME_DATE'].min()}")
print(f"   Max: {df_shots['GAME_DATE'].max()}")

print(f"\n4. Date Alignment Check:")
shots_date_range = (df_shots['GAME_DATE'].min(), df_shots['GAME_DATE'].max())
games_date_range = (df_all['GAME_DATE'].min(), df_all['GAME_DATE'].max())
if shots_date_range[0] >= games_date_range[0] and shots_date_range[1] <= games_date_range[1]:
    print(f"   ✅ Shot dates align with game log dates")
else:
    print(f"   ⚠️ Shot dates: {shots_date_range}")
    print(f"   ⚠️ Game dates: {games_date_range}")

SHOT CHART VALIDATION

1. Dataset Shape:
   Total shots: 885,698
   Unique players: 369
   Unique games: 5829
   Columns: 27

2. Shots per Season:
   2019-20: 156,878 shots
   2020-21: 164,458 shots
   2021-22: 186,756 shots
   2022-23: 190,165 shots
   2023-24: 187,441 shots

3. Date Range:
   Min: 2019-10-22 00:00:00
   Max: 2024-04-14 00:00:00

4. Date Alignment Check:
   ✅ Shot dates align with game log dates


## Validation: Shot Charts

Now let's validate the shot chart data quality.