# NBA Playoff Predictor - Data Cleaning

This notebook focuses on cleaning and preprocessing the raw NBA data from Kaggle sources. We'll prepare the data for feature engineering by standardizing formats, handling missing values, and ensuring data quality across different sources.

## Data Sources and Cleaning Goals

1. NBA/ABA/BAA Stats (sumitrodatta)
   - Player Season Info: Contains individual player statistics per season
     - Cleaning focuses on standardizing team names, filtering for NBA-only data, and handling missing values
   - Team Stats Per Game: Contains team-level performance metrics
     - Cleaning involves normalizing team names and ensuring consistent statistical calculations

2. NBA Injury Stats (loganlauton)
   - Contains historical injury data from 1951-2023
   - Cleaning involves:
     - Standardizing injury descriptions
     - Converting dates to consistent format
     - Matching team names with other datasets
     - Removing duplicate entries

3. NBA Shots Data (mexwell)
   - Contains detailed shot location and outcome data
   - Cleaning involves:
     - Standardizing coordinate systems
     - Validating shot types and distances
     - Ensuring consistent player and team naming
     - Removing invalid or incomplete shot records

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import sys
from pathlib import Path

sys.path.append('..')

from src.data.cleaners.sumitrodatta_cleaner import SumitrodattaCleaner
from src.data.cleaners.loganlauton_cleaner import LoganlautonCleaner
from src.data.cleaners.mexwell_cleaner import MexwellCleaner
from src.data.utils import setup_logging

logger = setup_logging()

sns.set_theme()

## Clean NBA/ABA/BAA Stats

Process data from sumitrodatta's dataset. This section focuses on cleaning two key datasets:
1. Player season data - individual player statistics
2. Team stats data - aggregated team performance metrics

The cleaning process ensures consistent formatting and removes any anomalies that could affect our analysis.

In [13]:
# Load and clean player season data
sumitrodatta = SumitrodattaCleaner()

player_season_df = pd.read_csv('../data/raw/kaggle/sumitrodatta/nba-aba-baa-stats/Player Season Info.csv')
cleaned_player_season = sumitrodatta.clean_player_season_data(player_season_df)
logger.info(f"Cleaned {len(cleaned_player_season)} player season records")

# Load and clean team stats data
team_stats_df = pd.read_csv('../data/raw/kaggle/sumitrodatta/nba-aba-baa-stats/Team Stats Per Game.csv')
cleaned_team_stats = sumitrodatta.clean_team_stats_data(team_stats_df)
logger.info(f"Cleaned {len(cleaned_team_stats)} team statistics records")

2024-12-09 17:35:17 - INFO - Cleaned 32358 player season records
2024-12-09 17:35:17 - INFO - Cleaned 1876 team statistics records



Cleaning player season data...
Cleaned 32358 player season records

Cleaning team stats data...
Cleaned 1876 team statistics records


## Clean NBA Injury Stats

Process data from loganlauton's dataset. This section handles injury data cleaning, which is crucial for:
- Understanding player availability
- Analyzing team performance impact from injuries
- Tracking injury patterns and their effect on playoff chances

In [14]:
# Load and clean injury data
loganlauton = LoganlautonCleaner()
injury_df = pd.read_csv('../data/raw/kaggle/loganlauton/nba-injury-stats-1951-2023/NBA Player Injury Stats(1951 - 2023).csv')
cleaned_injuries = loganlauton.clean_injury_data(injury_df)
logger.info(f"Cleaned {len(cleaned_injuries)} injury records")

2024-12-09 17:35:17 - INFO - Cleaned 37667 injury records



Cleaning injury data...
Cleaned 37667 injury records


## Clean NBA Shots Data

Process data from mexwell's dataset. This section handles shot data cleaning, which provides insights into:
- Team shooting patterns and efficiency
- Player shooting preferences and success rates
- Spatial analysis of scoring

The cleaning process ensures accurate shot coordinates and consistent categorization of shot types.

In [15]:
# Load and clean shots data
mexwell = MexwellCleaner()
shots_dir = Path('../data/raw/kaggle/mexwell/nba-shots')
shots_files = list(shots_dir.glob('NBA_20[0-9][0-9]_Shots.csv')) 

logger.info(f"Looking for shot files in: {shots_dir.absolute()}")

shots_data = []

try:
    # Get shot files
    if not shots_dir.exists():
        raise FileNotFoundError(f"Directory not found: {shots_dir}")
    
    shots_files = sorted(shots_dir.glob('NBA_20[0-9][0-9]_Shots.csv'))
    if not shots_files:
        raise FileNotFoundError(f"No shot files found in {shots_dir}")
    
    logger.info(f"Found {len(shots_files)} shot files:")
    for file in shots_files:
        logger.info(f"- {file.name}")

    total_rows = 0
    for file in shots_files:
        logger.info(f"Loading {file.name}...")
        df = pd.read_csv(file)
        total_rows += len(df)
        logger.info(f"Loaded {len(df):,} records")
        shots_data.append(df)

    shots_df = pd.concat(shots_data, ignore_index=True)
    logger.info(f"Successfully combined {len(shots_df):,} total records")

    logger.info("Initial data overview:")
    logger.info(f"Shape: {shots_df.shape}")
    logger.info("Column types:")
    logger.info(shots_df.dtypes)
    
    # Clean the data
    cleaned_shots = mexwell.clean_shots_data(shots_df)
    logger.info(f"\nCleaned {len(cleaned_shots):,} shot records")
    
    # Show cleaning results
    logger.info("Cleaning summary:")
    logger.info(f"Original records: {len(shots_df):,}")
    logger.info(f"Cleaned records: {len(cleaned_shots):,}")
    logger.info(f"Records removed: {len(shots_df) - len(cleaned_shots):,}")

except Exception as e:
    logger.error(f"Error processing shots data: {str(e)}")
    raise

2024-12-09 17:35:17 - INFO - Looking for shot files in: /Users/luke/src/github.com/lukelittle/csca5622-final-project/notebooks/../data/raw/kaggle/mexwell/nba-shots
2024-12-09 17:35:18 - INFO - Found 21 shot files:
2024-12-09 17:35:18 - INFO - - NBA_2004_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2005_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2006_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2007_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2008_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2009_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2010_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2011_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2012_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2013_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2014_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2015_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2016_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2017_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2018_Shots.csv
2024-12-09 17:35:18 - INFO - - NBA_2


Cleaning shots data...

Initial column types:
SEASON_1: int64 | Sample: 2004
SEASON_2: object | Sample: 2003-04
TEAM_ID: int64 | Sample: 1610612747
TEAM_NAME: object | Sample: Los Angeles Lakers
PLAYER_ID: int64 | Sample: 977
PLAYER_NAME: object | Sample: Kobe Bryant
POSITION_GROUP: object | Sample: G
POSITION: object | Sample: SG
GAME_DATE: object | Sample: 04-14-2004
GAME_ID: int64 | Sample: 20301187
HOME_TEAM: object | Sample: POR
AWAY_TEAM: object | Sample: LAL
EVENT_TYPE: object | Sample: Made Shot
SHOT_MADE: bool | Sample: True
ACTION_TYPE: object | Sample: Jump Shot
SHOT_TYPE: object | Sample: 3PT Field Goal
BASIC_ZONE: object | Sample: Above the Break 3
ZONE_NAME: object | Sample: Left Side Center
ZONE_ABB: object | Sample: LC
ZONE_RANGE: object | Sample: 24+ ft.
LOC_X: float64 | Sample: 20.0
LOC_Y: float64 | Sample: 21.35
SHOT_DISTANCE: int64 | Sample: 25
QUARTER: int64 | Sample: 6
MINS_LEFT: int64 | Sample: 0
SECS_LEFT: int64 | Sample: 0

Processing column: PLAYER_NAME
Curre

2024-12-09 17:35:51 - INFO - 
Cleaned 4,231,262 shot records
2024-12-09 17:35:51 - INFO - Cleaning summary:
2024-12-09 17:35:51 - INFO - Original records: 4,231,262
2024-12-09 17:35:51 - INFO - Cleaned records: 4,231,262
2024-12-09 17:35:51 - INFO - Records removed: 0


Cleaned 4231262 shot records


## Save Cleaned Data

Save each cleaned dataset to its appropriate location in the processed directory.
These cleaned datasets will be used in the feature engineering phase
to create predictive features for our playoff prediction model.

In [16]:
# Save cleaned data
cleaned_player_season.to_csv('../data/processed/historical/player_season.csv', index=False)
cleaned_team_stats.to_csv('../data/processed/historical/team_stats.csv', index=False)
cleaned_injuries.to_csv('../data/processed/historical/injuries.csv', index=False)
cleaned_shots.to_csv('../data/processed/historical/shots.csv', index=False)
print("Saved all cleaned data")

Saved all cleaned data
