# NBA Data Collection: Both Teams Lead by 5+ Prediction

This notebook demonstrates the complete data collection pipeline for our unique prediction target:
**Identifying games where both teams led by 5 or more points at any point during the game.**

## Workflow
1. Set up nba_api with proper rate limiting
2. Collect season game data
3. Collect play-by-play data (critical for labeling)
4. Collect boxscore data (for features)
5. Label games using play-by-play analysis
6. Analyze label distribution

## Critical Notes
- **Rate limiting is MANDATORY** (700ms between requests)
- Full season collection takes ~2-3 hours
- Save data incrementally to avoid losing progress

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path

# Add src to path
sys.path.append(str(Path.cwd().parent / 'src'))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

# Set up logging
import logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)

# Import our modules
from data.collectors import NBADataCollector
from data.labelers import GameLabeler, calculate_label_statistics
from utils.config import get_config

print("✓ Imports successful")

In [None]:
# Load configuration
config = get_config()

print("Configuration:")
print(f"  Seasons to collect: {config.seasons}")
print(f"  Rate limit interval: {config.rate_limit_interval:.3f}s")
print(f"  API timeout: {config.api_timeout}s")
print(f"  Target threshold: {config.target_threshold} points")

## Step 1: Initialize Data Collector

The collector automatically implements:
- Rate limiting (700ms between requests)
- Browser-like headers (avoid bot detection)
- Exponential backoff on errors
- Proper timeout handling

In [None]:
collector = NBADataCollector(config)
print(f"✓ Collector initialized with {len(collector.all_teams)} teams loaded")

## Step 2: Collect Data for One Season (Demo)

**WARNING**: Full season collection takes 2-3 hours due to rate limiting.

For demonstration, we'll collect a small sample. For production:
- Run overnight or in multiple sessions
- Use checkpointing to resume if interrupted
- Monitor API errors and adjust rate limiting if needed

In [None]:
# For demo: collect just 2023-24 season
DEMO_SEASON = '2023-24'
save_dir = config.data_paths['raw'] / DEMO_SEASON

print(f"Starting data collection for {DEMO_SEASON}...")
print(f"Save directory: {save_dir}")
print("\nThis will take approximately 2-3 hours for a full season.")
print("Progress will be logged every 50 games.\n")

In [None]:
# Collect season data
# UNCOMMENT TO RUN FULL COLLECTION (takes 2-3 hours)

# saved_files = collector.collect_season_data(DEMO_SEASON, save_dir)
# print("\nData collection complete!")
# print(f"Files saved:")
# for key, path in saved_files.items():
#     print(f"  {key}: {path}")

## Step 3: Load Collected Data

For this demo, we'll use pre-collected sample data (or skip to feature engineering if needed).

In [None]:
# Load games list
games_file = save_dir / f'games_{DEMO_SEASON}.csv'

if games_file.exists():
    games_df = pd.read_csv(games_file)
    print(f"Loaded {len(games_df)} games")
    print(f"\nSample of games data:")
    display(games_df.head())
else:
    print("No games data found. Run collection first.")

## Step 4: Label Games Using Play-by-Play Data

This is the **critical step** that defines our unique target.

For each game, we analyze play-by-play data to determine:
- Maximum lead for home team
- Maximum lead for away team
- Whether both exceeded our threshold (5 points)

In [None]:
# Initialize labeler
labeler = GameLabeler(threshold=config.target_threshold)
print(f"Labeler initialized with threshold: {config.target_threshold} points")

In [None]:
# Demo: Label a single game
pbp_dir = save_dir / 'play_by_play'

if pbp_dir.exists():
    # Get first available play-by-play file
    pbp_files = list(pbp_dir.glob('*.csv'))
    
    if pbp_files:
        sample_file = pbp_files[0]
        game_id = sample_file.stem
        
        print(f"Labeling game: {game_id}")
        pbp_df = pd.read_csv(sample_file)
        
        label = labeler.label_game_from_pbp(pbp_df)
        
        print(f"\nLabel Results:")
        print(f"  Both teams led by 5+: {label['both_teams_led_5plus']}")
        print(f"  Home max lead: {label['home_max_lead']}")
        print(f"  Away max lead: {label['away_max_lead']}")
        print(f"  Lead changes: {label['total_lead_changes']}")
        print(f"  Volatility score: {label['game_volatility_score']:.2f}")
else:
    print("No play-by-play data found. Run collection first.")

In [None]:
# Label all games in the season
if pbp_dir.exists():
    print(f"Labeling {len(pbp_files)} games...")
    
    all_labels = []
    
    for pbp_file in tqdm(pbp_files[:100]):  # Limit to 100 for demo
        game_id = pbp_file.stem
        pbp_df = pd.read_csv(pbp_file)
        
        label = labeler.label_game_from_pbp(pbp_df)
        label['game_id'] = game_id
        all_labels.append(label)
    
    labels_df = pd.DataFrame(all_labels)
    
    # Save labels
    labels_file = config.data_paths['labels'] / f'labels_{DEMO_SEASON}.csv'
    labels_file.parent.mkdir(parents=True, exist_ok=True)
    labels_df.to_csv(labels_file, index=False)
    
    print(f"\n✓ Labels saved to {labels_file}")
    display(labels_df.head(10))

## Step 5: Analyze Label Distribution

Understanding our target variable is critical for model development.

In [None]:
# Calculate label statistics
if 'labels_df' in locals():
    stats = calculate_label_statistics(labels_df)
    
    print(f"\n{'='*60}")
    print("TARGET VARIABLE ANALYSIS")
    print(f"{'='*60}")
    print(f"\nTotal games analyzed: {stats['total_games']}")
    print(f"\nGames where both teams led by 5+:")
    print(f"  Count: {stats['both_teams_led_count']}")
    print(f"  Percentage: {stats['both_teams_led_pct']:.1%}")
    print(f"\nAverage game statistics:")
    print(f"  Home max lead: {stats['avg_home_max_lead']:.1f} points")
    print(f"  Away max lead: {stats['avg_away_max_lead']:.1f} points")
    print(f"  Lead changes: {stats['avg_lead_changes']:.1f}")
    print(f"  Volatility: {stats['avg_volatility']:.1f}")

In [None]:
# Visualize label distribution
if 'labels_df' in locals():
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Target distribution
    labels_df['both_teams_led_5plus'].value_counts().plot(
        kind='bar', ax=axes[0,0], color=['red', 'green']
    )
    axes[0,0].set_title('Target Distribution: Both Teams Led by 5+', fontsize=12, fontweight='bold')
    axes[0,0].set_xlabel('Both Teams Led by 5+')
    axes[0,0].set_ylabel('Count')
    axes[0,0].set_xticklabels(['False', 'True'], rotation=0)
    
    # Max lead distribution
    labels_df[['home_max_lead', 'away_max_lead']].hist(ax=axes[0,1], bins=20, alpha=0.7)
    axes[0,1].set_title('Distribution of Maximum Leads', fontsize=12, fontweight='bold')
    axes[0,1].set_xlabel('Points')
    axes[0,1].legend(['Home', 'Away'])
    
    # Lead changes
    labels_df['total_lead_changes'].hist(ax=axes[1,0], bins=20, color='purple', alpha=0.7)
    axes[1,0].set_title('Distribution of Lead Changes', fontsize=12, fontweight='bold')
    axes[1,0].set_xlabel('Number of Lead Changes')
    axes[1,0].set_ylabel('Count')
    
    # Volatility
    labels_df['game_volatility_score'].hist(ax=axes[1,1], bins=20, color='orange', alpha=0.7)
    axes[1,1].set_title('Distribution of Game Volatility', fontsize=12, fontweight='bold')
    axes[1,1].set_xlabel('Volatility Score (Std Dev of Point Diff)')
    axes[1,1].set_ylabel('Count')
    
    plt.tight_layout()
    plt.savefig(config.output_path / 'label_distribution.png', dpi=150)
    plt.show()

## Next Steps

1. **Feature Engineering** (Notebook 02): Create predictive features from boxscore data
2. **Model Training** (Notebook 03): Train baseline models
3. **Model Evaluation** (Notebook 04): Comprehensive evaluation and comparison
4. **Deployment** (Notebook 05): Real-time prediction pipeline

## Key Takeaways

- Data collection requires patience (2-3 hours per season)
- Rate limiting is non-negotiable to avoid IP bans
- Play-by-play data is critical for our unique target
- Target distribution will inform model selection and training