# DeepShot: Data Collection

## Introduction

Data collection is the foundation of our NBA shot prediction project. In this notebook, we gather comprehensive data on NBA shots, player information, and team statistics to support our analysis and modeling.

The quality and scope of our data directly impact the insights we can derive and the accuracy of our predictive models. For this project, we need data that captures:

1. **Shot Information**: Location, outcome, shooter, game context
2. **Player Information**: Career statistics, position, experience
3. **Team Information**: Performance metrics, playing style, defensive ratings

We've chosen to use Kaggle as our primary data source because it offers well-maintained, comprehensive NBA datasets with the necessary breadth and depth. Using the Kaggle API allows us to programmatically download these datasets, making our process reproducible and updatable.

## Data Source Selection

When selecting data sources for this project, we considered several factors:

1. **Comprehensiveness**: We need data covering multiple seasons to identify long-term patterns
2. **Granularity**: Shot-level data is required for spatial analysis
3. **Reliability**: Data should come from reputable sources with minimal errors
4. **Accessibility**: Data should be programmatically accessible for reproducibility

The Kaggle datasets we've selected meet these criteria and provide complementary information:

- **NBA Shots Dataset**: Provides detailed shot-level information
- **NBA Injury Stats Dataset**: Provides context about player availability
- **NBA Team Statistics Dataset**: Provides team-level performance metrics

Together, these datasets give us a complete picture of NBA shooting over two decades.

## Data Organization

Proper organization of our data is essential for an efficient workflow. We've structured our data directory as follows:

- **Raw Data**: Original, unmodified datasets as downloaded from Kaggle
- **Interim Data**: Partially processed data that has undergone cleaning but not full processing
- **Processed Data**: Fully processed, analysis-ready data

This organization follows best practices for data science projects, creating a clear separation between original and processed data while maintaining a record of intermediate steps.

In [None]:
# ##HIDE## 
import os
import pandas as pd
from pathlib import Path
import kaggle

data_dir = Path('../data')
raw_dir = data_dir / 'raw'
processed_dir = data_dir / 'processed'

for directory in [data_dir, raw_dir, processed_dir]:
    directory.mkdir(parents=True, exist_ok=True)

In [None]:
def download_dataset(dataset, path):
    os.makedirs(path, exist_ok=True)
    kaggle.api.dataset_download_files(dataset, path=path, unzip=True)
    return list(Path(path).glob('*.csv'))

In [None]:
shots_path = raw_dir / 'shots'
shot_files = download_dataset('mexwell/nba-shots', shots_path)
print(f"Downloaded {len(shot_files)} shot data files")

injuries_path = raw_dir / 'injuries'
injury_files = download_dataset('loganlauton/nba-injury-stats-1951-2023', injuries_path)
print(f"Downloaded {len(injury_files)} injury data files")

team_stats_path = raw_dir / 'team_stats'
team_stats_files = download_dataset('sumitrodatta/nba-aba-baa-stats', team_stats_path)
print(f"Downloaded {len(team_stats_files)} team stats files")

Dataset URL: https://www.kaggle.com/datasets/mexwell/nba-shots
Downloaded 21 shot data files
Dataset URL: https://www.kaggle.com/datasets/loganlauton/nba-injury-stats-1951-2023
Downloaded 1 injury data files
Dataset URL: https://www.kaggle.com/datasets/sumitrodatta/nba-aba-baa-stats
Downloaded 22 team stats files


In [None]:
if shot_files:
    shots_sample = pd.read_csv(shot_files[0])
    print(f"Shot data: {shots_sample.shape[0]} rows, {shots_sample.shape[1]} columns")
    display(shots_sample.head(3))

if injury_files:
    injuries_sample = pd.read_csv(injury_files[0])
    print(f"Injury data: {injuries_sample.shape[0]} rows, {injuries_sample.shape[1]} columns")
    display(injuries_sample.head(3))

if team_stats_files:
    team_stats_sample = pd.read_csv(team_stats_files[0])
    print(f"Team stats: {team_stats_sample.shape[0]} rows, {team_stats_sample.shape[1]} columns")
    display(team_stats_sample.head(3))

Shot data: 199030 rows, 26 columns


Unnamed: 0,SEASON_1,SEASON_2,TEAM_ID,TEAM_NAME,PLAYER_ID,PLAYER_NAME,POSITION_GROUP,POSITION,GAME_DATE,GAME_ID,...,BASIC_ZONE,ZONE_NAME,ZONE_ABB,ZONE_RANGE,LOC_X,LOC_Y,SHOT_DISTANCE,QUARTER,MINS_LEFT,SECS_LEFT
0,2009,2008-09,1610612744,Golden State Warriors,201627,Anthony Morrow,G,SG,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,1
1,2009,2008-09,1610612744,Golden State Warriors,101235,Kelenna Azubuike,F,SF,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,9
2,2009,2008-09,1610612756,Phoenix Suns,255,Grant Hill,F,SF,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,25


Injury data: 37667 rows, 6 columns


Unnamed: 0.1,Unnamed: 0,Date,Team,Acquired,Relinquished,Notes
0,0,1951-12-25,Bullets,,Don Barksdale,placed on IL
1,1,1952-12-26,Knicks,,Max Zaslofsky,placed on IL with torn side muscle
2,2,1956-12-29,Knicks,,Jim Baechtold,placed on inactive list


Team stats: 1432 rows, 28 columns


Unnamed: 0,season,lg,team,abbreviation,playoffs,g,mp,fg_per_100_poss,fga_per_100_poss,fg_percent,...,ft_percent,orb_per_100_poss,drb_per_100_poss,trb_per_100_poss,ast_per_100_poss,stl_per_100_poss,blk_per_100_poss,tov_per_100_poss,pf_per_100_poss,pts_per_100_poss
0,2025,NBA,Atlanta Hawks,ATL,False,60,14500,40.9,88.3,0.463,...,0.769,11.4,31.8,43.2,28.1,9.6,4.9,15.4,18.1,112.3
1,2025,NBA,Boston Celtics,BOS,False,60,14525,42.6,92.5,0.461,...,0.796,11.1,34.8,45.9,26.4,7.4,5.7,12.2,16.7,119.7
2,2025,NBA,Brooklyn Nets,BRK,False,59,14235,38.9,88.5,0.439,...,0.795,11.4,31.3,42.6,25.6,8.1,4.5,16.1,21.3,108.7


## Next Steps

With our data successfully collected and organized, we're now ready to proceed to data cleaning and validation. In the next notebook, we'll:

1. Inspect the data for quality issues
2. Handle missing values and outliers
3. Validate data consistency across datasets
4. Prepare the data for standardization

The data collection phase has provided us with a rich foundation of over 4.2 million shots, 23,450 injury records, and comprehensive team statistics spanning two decades. This extensive dataset will enable us to build robust models and derive meaningful insights about NBA shooting patterns.