# NBA Playoff Predictor - Data Collection

This notebook gathers historical NBA team stats to create a dataset for predicting playoff outcomes. The data comes from multiple Kaggle sources.

## Notebook Overview

1. **Setup and Imports**: Load libraries and prepare the environment.
2. **Kaggle Data Collection**: Download team stats from Kaggle.
3. **Collection Summary**: Review the collected data.


## Setup and Imports

This section sets up the environment by importing necessary libraries and enabling logging to monitor progress and troubleshoot issues during data collection.


In [6]:
import os
import sys
from datetime import datetime

# Add the src directory to the Python path to access utility functions
sys.path.append('..')

# Import utility functions and classes for logging and progress tracking
from src.data.utils import setup_logging, DataCollectionProgress
from src.data.collectors.kaggle_collector import KaggleCollector

# Set up logging to track the progress of the data collection process
logger = setup_logging()

# Initialize a progress tracker to monitor the status of each data collection task
progress = DataCollectionProgress()

# Initialize the KaggleCollector to manage dataset downloads
kaggle = KaggleCollector('../data/raw/kaggle')

## Kaggle Data Collection

We use the Kaggle API to download historical team stats, handling each dataset individually for better control and visibility.

### NBA Shot Locations Dataset

This dataset provides detailed shot locations and types to analyze team shooting profiles.


In [7]:
# Download NBA Shots Dataset
progress.add_task('download_shots', total_steps=1)
progress.start_task('download_shots')

try:
    result = kaggle.download_dataset('nba_shots', 'mexwell/nba-shots')
    if result['status'] == 'success':
        logger.info("Successfully downloaded NBA shots dataset")
        progress.complete_task('download_shots')
    else:
        logger.error(f"Failed to download NBA shots dataset: {result['error']}")
        progress.complete_task('download_shots', success=False)
except Exception as e:
    logger.error(f"Error downloading NBA shots dataset: {str(e)}")
    progress.complete_task('download_shots', success=False, error=str(e))

2024-12-10 23:38:34 - INFO - Downloading dataset: nba_shots


Dataset URL: https://www.kaggle.com/datasets/mexwell/nba-shots


2024-12-10 23:38:40 - INFO - Successfully downloaded nba_shots
2024-12-10 23:38:40 - INFO - Successfully downloaded NBA shots dataset


### NBA Injury Statistics Dataset

This dataset includes injury data from 1951-2023, offering insights into how injuries impact team performance.


In [8]:
# Download NBA Injury Stats Dataset
progress.add_task('download_injuries', total_steps=1)
progress.start_task('download_injuries')

try:
    result = kaggle.download_dataset('nba_injuries', 'loganlauton/nba-injury-stats-1951-2023')
    if result['status'] == 'success':
        logger.info("Successfully downloaded NBA injury stats dataset")
        progress.complete_task('download_injuries')
    else:
        logger.error(f"Failed to download NBA injury stats dataset: {result['error']}")
        progress.complete_task('download_injuries', success=False)
except Exception as e:
    logger.error(f"Error downloading NBA injury stats dataset: {str(e)}")
    progress.complete_task('download_injuries', success=False, error=str(e))

2024-12-10 23:38:40 - INFO - Downloading dataset: nba_injuries


Dataset URL: https://www.kaggle.com/datasets/loganlauton/nba-injury-stats-1951-2023


2024-12-10 23:38:40 - INFO - Successfully downloaded nba_injuries
2024-12-10 23:38:40 - INFO - Successfully downloaded NBA injury stats dataset


### NBA/ABA/BAA Team Statistics Dataset

This main dataset contains historical team stats from 1950 to today, including regular season data and advanced metrics.


In [9]:
# Download NBA Team Stats Dataset
progress.add_task('download_team_stats', total_steps=1)
progress.start_task('download_team_stats')

try:
    result = kaggle.download_dataset('nba_team_stats', 'sumitrodatta/nba-aba-baa-stats')
    if result['status'] == 'success':
        logger.info("Successfully downloaded NBA team stats dataset")
        progress.complete_task('download_team_stats')
    else:
        logger.error(f"Failed to download NBA team stats dataset: {result['error']}")
        progress.complete_task('download_team_stats', success=False)
except Exception as e:
    logger.error(f"Error downloading NBA team stats dataset: {str(e)}")
    progress.complete_task('download_team_stats', success=False, error=str(e))

2024-12-10 23:38:40 - INFO - Downloading dataset: nba_team_stats


Dataset URL: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats


2024-12-10 23:38:41 - INFO - Successfully downloaded nba_team_stats
2024-12-10 23:38:41 - INFO - Successfully downloaded NBA team stats dataset


## Collection Summary

This section reviews the data collection results, summarizing the total tasks completed and their status.


In [10]:
# Get a summary of the data collection process
summary = progress.get_summary()

print(f"Data Collection Summary:")
print(f"Total Tasks: {summary['total_tasks']}")
print(f"Completed Successfully: {summary['completed_tasks']}")
print(f"Failed: {summary['failed_tasks']}")
print(f"Total Duration: {summary['duration']}")

print("\nTask Details:")
for name, task in summary['tasks'].items():
    status = task['status']
    duration = task['end_time'] - task['start_time'] if task['end_time'] and task['start_time'] else None
    print(f"\n{name}:")
    print(f"  Status: {status}")
    print(f"  Duration: {duration}")
    if task['error']:
        print(f"  Error: {task['error']}")

Data Collection Summary:
Total Tasks: 3
Completed Successfully: 3
Failed: 0
Total Duration: 0:00:07.317472

Task Details:

download_shots:
  Status: completed
  Duration: 0:00:06.069880

download_injuries:
  Status: completed
  Duration: 0:00:00.341802

download_team_stats:
  Status: completed
  Duration: 0:00:00.855184
