# NBA Team Pattern Analysis - Data Collection

In this notebook, we're gathering everything we need to understand NBA team patterns. Think of it like building the foundation of a house - we need all the right materials before we can start building.

## What We're Collecting

We're pulling together three massive datasets:

1. **Shot Data (2004-2024)**
   - Over 2.1 million shots
   - Every single shot taken in NBA games
   - Details like who took it, where they took it from, whether it went in
   - This helps us understand how teams approach offense

2. **Injury Data (1951-2023)**
   - 23,450 injury records
   - Every time a player got hurt or missed games
   - Information about what happened and how long they were out
   - This helps us understand how teams deal with player availability

3. **Team Statistics (1951-2023)**
   - 2,160 team seasons worth of data
   - All the standard basketball stats
   - Everything from points and rebounds to advanced metrics
   - This gives us the big picture of how teams performed

## Implementation

We're building an automated system to collect all this data from Kaggle. First, let's set up our environment with the necessary tools.

In [7]:
import os
import sys
from datetime import datetime

# Add the src directory to the Python path to access utility functions
sys.path.append('..')

# Import utility functions and classes for logging and progress tracking
from src.data.utils import setup_logging, DataCollectionProgress
from src.data.collectors.kaggle_collector import KaggleCollector

# Set up logging to track the progress of the data collection process
logger = setup_logging()

# Initialize a progress tracker to monitor the status of each data collection task
progress = DataCollectionProgress()

# Initialize the KaggleCollector to manage dataset downloads
kaggle = KaggleCollector('../data/raw/kaggle')

### Collecting Shot Data

First, let's get the shot location data. This will show us exactly how teams approach offense - where they shoot from, what kinds of shots they prefer, and how this has changed over time.

In [8]:
# Download NBA Shots Dataset
progress.add_task('download_shots', total_steps=1)
progress.start_task('download_shots')

try:
    result = kaggle.download_dataset('nba_shots', 'mexwell/nba-shots')
    if result['status'] == 'success':
        logger.info("Successfully downloaded NBA shots dataset")
        progress.complete_task('download_shots')
    else:
        logger.error(f"Failed to download NBA shots dataset: {result['error']}")
        progress.complete_task('download_shots', success=False)
except Exception as e:
    logger.error(f"Error downloading NBA shots dataset: {str(e)}")
    progress.complete_task('download_shots', success=False, error=str(e))

2025-02-23 16:05:30 - INFO - Downloading dataset: nba_shots


Dataset URL: https://www.kaggle.com/datasets/mexwell/nba-shots


2025-02-23 16:07:04 - INFO - Successfully downloaded nba_shots
2025-02-23 16:07:04 - INFO - Successfully downloaded NBA shots dataset


### Collecting Injury Data

Next, we'll gather the injury data. This is crucial for understanding how teams adapt when players are unavailable and how injuries affect team performance patterns.

In [9]:
# Download NBA Injury Stats Dataset
progress.add_task('download_injuries', total_steps=1)
progress.start_task('download_injuries')

try:
    result = kaggle.download_dataset('nba_injuries', 'loganlauton/nba-injury-stats-1951-2023')
    if result['status'] == 'success':
        logger.info("Successfully downloaded NBA injury stats dataset")
        progress.complete_task('download_injuries')
    else:
        logger.error(f"Failed to download NBA injury stats dataset: {result['error']}")
        progress.complete_task('download_injuries', success=False)
except Exception as e:
    logger.error(f"Error downloading NBA injury stats dataset: {str(e)}")
    progress.complete_task('download_injuries', success=False, error=str(e))

2025-02-23 16:07:04 - INFO - Downloading dataset: nba_injuries


Dataset URL: https://www.kaggle.com/datasets/loganlauton/nba-injury-stats-1951-2023


2025-02-23 16:07:05 - INFO - Successfully downloaded nba_injuries
2025-02-23 16:07:05 - INFO - Successfully downloaded NBA injury stats dataset


### Collecting Team Statistics

Finally, we'll get the comprehensive team statistics. This data gives us the big picture of how teams performed and how basketball has evolved over seven decades.

In [10]:
# Download NBA Team Stats Dataset
progress.add_task('download_team_stats', total_steps=1)
progress.start_task('download_team_stats')

try:
    result = kaggle.download_dataset('nba_team_stats', 'sumitrodatta/nba-aba-baa-stats')
    if result['status'] == 'success':
        logger.info("Successfully downloaded NBA team stats dataset")
        progress.complete_task('download_team_stats')
    else:
        logger.error(f"Failed to download NBA team stats dataset: {result['error']}")
        progress.complete_task('download_team_stats', success=False)
except Exception as e:
    logger.error(f"Error downloading NBA team stats dataset: {str(e)}")
    progress.complete_task('download_team_stats', success=False, error=str(e))

2025-02-23 16:07:05 - INFO - Downloading dataset: nba_team_stats


Dataset URL: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats


2025-02-23 16:07:09 - INFO - Successfully downloaded nba_team_stats
2025-02-23 16:07:09 - INFO - Successfully downloaded NBA team stats dataset


## What We Learned

During this data collection process, we discovered some interesting things:
- The way NBA data is tracked has changed dramatically over time
- Shot tracking became much more detailed after 2004
- Injury reporting has become more standardized recently
- Some teams changed names or moved cities, which we'll need to account for

Let's verify our collection was successful and see what we've gathered.

In [11]:
# Get a summary of the data collection process
summary = progress.get_summary()

print(f"Data Collection Summary:")
print(f"Total Tasks: {summary['total_tasks']}")
print(f"Completed Successfully: {summary['completed_tasks']}")
print(f"Failed: {summary['failed_tasks']}")
print(f"Total Duration: {summary['duration']}")

print("\nTask Details:")
for name, task in summary['tasks'].items():
    status = task['status']
    duration = task['end_time'] - task['start_time'] if task['end_time'] and task['start_time'] else None
    print(f"\n{name}:")
    print(f"  Status: {status}")
    print(f"  Duration: {duration}")
    if task['error']:
        print(f"  Error: {task['error']}")

Data Collection Summary:
Total Tasks: 3
Completed Successfully: 3
Failed: 0
Total Duration: 0:01:38.885207

Task Details:

download_shots:
  Status: completed
  Duration: 0:01:33.967557

download_injuries:
  Status: completed
  Duration: 0:00:00.843514

download_team_stats:
  Status: completed
  Duration: 0:00:04.005904


## Why This Matters

Having all this data in one place is crucial because:
- We can see the complete picture of how teams play
- We can track changes over many decades
- We can connect different aspects of the game (like how injuries affect shooting)
- We have enough data to find real patterns, not just random variation

## Next Steps

With our data collected and organized, we're ready to move on to cleaning and standardization. Think of it like having all the ingredients for a recipe - now we need to prepare them properly before we can start cooking.