# NBA Playoff Predictor - Data Collection

This notebook is responsible for gathering the raw data needed to build our NBA Playoff Predictor. We collect historical NBA team statistics from multiple Kaggle datasets to build a comprehensive dataset for predicting playoff outcomes.

---

## Notebook Overview

This notebook is divided into the following sections for clarity:

1. **Setup and Imports**: Prepare the environment and import necessary libraries.
2. **Directory Structure**: Create folders to store the downloaded data.
3. **Kaggle Data Collection**: Download historical team statistics from Kaggle.
4. **Collection Summary**: Summarize the results of the data collection process, including dataset sizes.

---


## 1. Setup and Imports

In this section, we set up the environment by importing the required libraries and initializing logging. Logging helps us track the progress of the data collection process and identify any issues that occur.

In [1]:
import os
import sys
from datetime import datetime

# Add the src directory to the Python path to access utility functions
sys.path.append('..')

# Import utility functions and classes for logging and progress tracking
from src.data.utils import setup_logging, create_data_directories, DataCollectionProgress
from src.data.collectors.kaggle_collector import KaggleCollector

# Set up logging to track the progress of the data collection process
logger = setup_logging()

# Initialize a progress tracker to monitor the status of each data collection task
progress = DataCollectionProgress()

## 2. Directory Structure

Before downloading any data, we need to ensure that the necessary folder structure exists. This section creates directories to store the raw data files. If the directories already exist, this step will simply confirm their presence.

In [2]:
# Create all required directories for storing raw data
created_dirs = create_data_directories('../data')
logger.info(f"Created {len(created_dirs)} directories")

2024-12-09 14:20:17 - INFO - Created 8 directories


## 3. Kaggle Data Collection

In this section, we use the Kaggle API to download historical team statistics. The KaggleCollector class handles the interaction with Kaggle, including downloading datasets and saving them to the appropriate directories.

### Primary Dataset: NBA/ABA/BAA Team Statistics

- **Source**: NBA Teams Stats (1950-Present)
- **Description**: Historical team statistics from 1950 to present
- **Key Features**:
    - Regular season team statistics (points, rebounds, assists, etc.)
    - Advanced metrics (shooting percentages, efficiency ratings)
    - Team identifiers and season information
- **Size**: Approximately 1,800+ team-season records

### Supporting Dataset: NBA Shot Locations

- **Source**: NBA Shots Dataset
- **Description**: Detailed shot location and type data
- **Features**: Shot coordinates, shot types, makes/misses
- **Purpose**: Will be used to enhance team shooting profile analysis
- **Size**: Approximately 1,000,000+ shot records

### Additional Context: NBA Injury Statistics

- **Source**: NBA Injury Stats (1951-2023)
- **Description**: Historical injury data
- **Purpose**: Provides context for team performance variations
- **Size**: Approximately 50,000+ injury records

### Steps:

1. Initialize the KaggleCollector with the path to the raw data folder.
2. Add the Kaggle download task to the progress tracker.
3. Download all datasets and log the results.
4. If any datasets fail to download, log an error message.

In [3]:
# Initialize the KaggleCollector to manage dataset downloads
kaggle = KaggleCollector('../data/raw/kaggle')

# Add a task to the progress tracker for downloading Kaggle datasets
progress.add_task('kaggle_download', total_steps=len(kaggle.datasets))
progress.start_task('kaggle_download')

try:
    # Download all datasets from Kaggle
    results = kaggle.download_all()
    
    # Check the results of the download process
    failed = [name for name, result in results.items() if result['status'] != 'success']
    if failed:
        logger.error(f"Failed to download datasets: {failed}")
        progress.complete_task('kaggle_download', success=False)
    else:
        logger.info("Successfully downloaded all Kaggle datasets")
        progress.complete_task('kaggle_download')
        
except Exception as e:
    # Log any errors that occur during the download process
    logger.error(f"Error downloading Kaggle datasets: {str(e)}")
    progress.complete_task('kaggle_download', success=False, error=str(e))

2024-12-09 14:20:17 - INFO - Downloading sumitrodatta/nba-aba-baa-stats...


Dataset URL: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats


2024-12-09 14:20:19 - INFO - Successfully downloaded and validated sumitrodatta/nba-aba-baa-stats
2024-12-09 14:20:19 - INFO - Downloading mexwell/nba-shots...


Dataset URL: https://www.kaggle.com/datasets/mexwell/nba-shots


2024-12-09 14:20:28 - INFO - Successfully downloaded and validated mexwell/nba-shots
2024-12-09 14:20:28 - INFO - Downloading loganlauton/nba-injury-stats-1951-2023...


Dataset URL: https://www.kaggle.com/datasets/loganlauton/nba-injury-stats-1951-2023


2024-12-09 14:20:28 - INFO - Successfully downloaded and validated loganlauton/nba-injury-stats-1951-2023
2024-12-09 14:20:28 - INFO - Successfully downloaded all Kaggle datasets


## 4. Collection Summary

After completing the data collection process, this section summarizes the results. It provides details about the total number of tasks, how many were completed successfully, how many failed, and the approximate sizes of the downloaded datasets. This information helps us verify that the data collection process was successful and identify any issues that need to be addressed.

In [4]:
# Get a summary of the data collection process
summary = progress.get_summary()

print(f"Data Collection Summary:")
print(f"Total Tasks: {summary['total_tasks']}")
print(f"Completed Successfully: {summary['completed_tasks']}")
print(f"Failed: {summary['failed_tasks']}")
print(f"Total Duration: {summary['duration']}")

print("\nDataset Sizes:")
print("NBA/ABA/BAA Team Statistics: Approximately 1,800+ team-season records")
print("NBA Shot Locations: Approximately 1,000,000+ shot records")
print("NBA Injury Statistics: Approximately 50,000+ injury records")

print("\nTask Details:")
for name, task in summary['tasks'].items():
    status = task['status']
    duration = task['end_time'] - task['start_time'] if task['end_time'] and task['start_time'] else None
    print(f"\n{name}:")
    print(f"  Status: {status}")
    print(f"  Duration: {duration}")
    if task['error']:
        print(f"  Error: {task['error']}")

Data Collection Summary:
Total Tasks: 1
Completed Successfully: 1
Failed: 0
Total Duration: 0:00:11.009065

Dataset Sizes:
NBA/ABA/BAA Team Statistics: Approximately 1,800+ team-season records
NBA Shot Locations: Approximately 1,000,000+ shot records
NBA Injury Statistics: Approximately 50,000+ injury records

Task Details:

kaggle_download:
  Status: completed
  Duration: 0:00:10.943648
