# NBA Game Log Data Analysis

Analysis of 523,825 NBA game logs from 2003-2024, focusing on data quality and statistical validation. Key areas examined: team names standardization, location/outcome encoding, and points calculation verification.

## Data Loading
Loading games logs from 03-04 season from CSV files.

In [1]:
import pandas as pd
from nba_data_forge.common.utils.paths import paths

# set options to see all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Combine all season files
seasons_data = []
raw_dir = paths.get_path("raw_archive") # data has been loaded into database and moved into archive

for file in raw_dir.glob("game_logs_*.csv"):
    df = pd.read_csv(file)
    df['season'] = int(file.stem.split('_')[2])  # Extract year from filename
    seasons_data.append(df)

df = pd.concat(seasons_data, ignore_index=True)

In [2]:
def data_overview(df: pd.DataFrame):
    # Basic dataset info
    print("\n=== NBA Game Log Data Overview ===")
    print(f"Total Records: {df.shape[0]:,}")
    print(f"Columns: {df.shape[1]}")
    
    # Date range
    if 'date' in df.columns:
        print("\n=== Time Range ===")
        print(f"Start Date: {df['date'].min()}")
        print(f"End Date: {df['date'].max()}")
        
    # Data types
    print("\n=== Column Data Types ===")
    for col, dtype in df.dtypes.items():
        print(f"{col:<20} {dtype}")        
        
    # Unique values for specified columns
    categorical_cols = ['team', 'location', 'opponent', 'outcome']
    print("\n=== Selected Categorical Column Values ===")
    for col in categorical_cols:
        if col in df.columns:
            unique_vals = df[col].nunique()
            print(f"\n{col} - {unique_vals} unique values:")
            value_counts = df[col].value_counts()
            for val, count in value_counts.items():
                print(f"  {val:<30} {count:,}")     
        
    # Data quality check
    print("\n=== Data Quality ===")
    null_counts = df.isnull().sum()
    if null_counts.any():
        print("\nMissing Values:")
        print(null_counts[null_counts > 0])
    else:
        print("No missing values found")        
        
    print(f"\nDuplicate Rows: {df.duplicated().sum():,}")
            
    

In [3]:
data_overview(df)


=== NBA Game Log Data Overview ===
Total Records: 539,777
Columns: 26

=== Time Range ===
Start Date: 2003-10-28
End Date: 2025-02-04

=== Column Data Types ===
date                 object
team                 object
location             object
opponent             object
outcome              object
active               bool
seconds_played       int64
made_field_goals     int64
attempted_field_goals int64
made_three_point_field_goals int64
attempted_three_point_field_goals int64
made_free_throws     int64
attempted_free_throws int64
offensive_rebounds   int64
defensive_rebounds   int64
assists              int64
steals               int64
blocks               int64
turnovers            int64
personal_fouls       int64
points_scored        int64
game_score           float64
plus_minus           int64
player_id            object
name                 object
season               int64

=== Selected Categorical Column Values ===

team - 37 unique values:
  Team.SAN_ANTONIO_SPURS         19

## Review data fetching using `regular_season_player_box_scores`

### Dataset Overview
- Total records: 539,777 game logs spanning from October 28, 2003 to February 4, 2025
- Complete dataset with no missing values or duplicates
- 26 columns capturing various game statistics and player information

### Key Observations

#### Team Distribution
- Data covers all NBA teams, including historical team changes
- Most represented teams:
  - San Antonio Spurs (19,356 records)
  - Dallas Mavericks (18,698 records)
  - Utah Jazz (18,486 records)
- Historical team transitions visible in the data:
  - Seattle SuperSonics → Oklahoma City Thunder
  - New Jersey Nets → Brooklyn Nets
  - Charlotte Bobcats → Charlotte Hornets
  - New Orleans Hornets → Pelicans

#### Data Quality Issues
- Minor format inconsistencies in team names:
  - Main format: "Team.TEAM_NAME" (e.g., "Team.ATLANTA_HAWKS")
  - Alternative format: Plain text (e.g., "ATLANTA HAWKS")
- Similar inconsistencies in location and outcome fields:
  - Primary format: "Location.HOME"/"Location.AWAY"
  - Alternative format: "HOME"/"AWAY"

#### Game Distribution
- Nearly perfect balance between:
  - Home (269,927) vs Away (269,765) games
  - Wins (269,969) vs Losses (269,723) games

This dataset provides a comprehensive view of NBA game logs with high data quality, though some minor standardization would be beneficial for team names and game attributes.

In [4]:
from nba_data_forge.etl.transformers.game_log_transformer import GameLogTransformer

transformer = GameLogTransformer()
df_transformed = transformer.transform(df)

2025-02-06 12:13:58 | GameLogTransformer | INFO     | Cleaning column: team
2025-02-06 12:13:59 | GameLogTransformer | INFO     | Cleaning column: opponent
2025-02-06 12:13:59 | GameLogTransformer | INFO     | Cleaning column: location
2025-02-06 12:13:59 | GameLogTransformer | INFO     | Cleaning column: outcome
2025-02-06 12:13:59 | GameLogTransformer | INFO     | Transformed 539777 game logs with 31 columns


In [5]:
data_overview(df_transformed)


=== NBA Game Log Data Overview ===
Total Records: 539,777
Columns: 31

=== Time Range ===
Start Date: 2003-10-28 00:00:00
End Date: 2025-02-04 00:00:00

=== Column Data Types ===
date                 datetime64[ns]
team                 object
location             object
opponent             object
outcome              object
active               bool
seconds_played       int64
made_field_goals     int64
attempted_field_goals int64
made_three_point_field_goals int64
attempted_three_point_field_goals int64
made_free_throws     int64
attempted_free_throws int64
offensive_rebounds   int64
defensive_rebounds   int64
assists              int64
steals               int64
blocks               int64
turnovers            int64
personal_fouls       int64
points_scored        int64
game_score           float64
plus_minus           int64
player_id            object
name                 object
season               int64
team_abbrev          object
opponent_abbrev      object
is_home              bo

## Check data obtained by `player_box_scores`

In [6]:
from datetime import datetime
import pandas as pd
from nba_data_forge.common.utils.paths import paths
from nba_data_forge.etl.extractors.daily_game_log_extractor import DailyGameLogExtractor

# comment out as we already fetched the data
# extractor = DailyGameLogExtractor()
# sample_daily = extractor.extract_daily(datetime.strptime("2025-02-05", "%Y-%m-%d").date())
# sample_daily.to_csv(paths.get_path("test")/"test_daily.csv", index=False)

In [7]:
df_daily = pd.read_csv(paths.get_path("test")/"test_daily.csv")

In [8]:
data_overview(df_daily)


=== NBA Game Log Data Overview ===
Total Records: 396
Columns: 23

=== Time Range ===
Start Date: 2025-02-04
End Date: 2025-02-05

=== Column Data Types ===
slug                 object
name                 object
team                 object
location             object
opponent             object
outcome              object
seconds_played       int64
made_field_goals     int64
attempted_field_goals int64
made_three_point_field_goals int64
attempted_three_point_field_goals int64
made_free_throws     int64
attempted_free_throws int64
offensive_rebounds   int64
defensive_rebounds   int64
assists              int64
steals               int64
blocks               int64
turnovers            int64
personal_fouls       int64
plus_minus           float64
game_score           float64
date                 object

=== Selected Categorical Column Values ===

team - 30 unique values:
  Team.TORONTO_RAPTORS           22
  Team.BROOKLYN_NETS             21
  Team.PHILADELPHIA_76ERS        19
  Team.CH

## Review data fetched using `player_box_scores`

### Dataset Overview
- **Total Records**: 396 player game logs
- **Time Period**: 2 days (Feb 4-5, 2025)
- **Data Quality**: Complete with no missing values or duplicates
- **Structure**: 23 columns tracking various game statistics

### Game Distribution

#### Team Participation
- All 30 NBA teams represented
- Most active teams:
 - Toronto Raptors (22 game logs)
 - Brooklyn Nets (21 game logs)

#### Game Balance
- **Location**: Nearly balanced
 - Home: 200 player appearances
 - Away: 196 player appearances
- **Outcomes**: Well balanced
 - Wins: 201 player appearances
 - Losses: 195 player appearances

### Data Structure
- Consistent format for categorical values (e.g., "Team.TEAM_NAME")
- plus_minus is swtiching from int to float

The dataset provides a complete and balanced snapshot of NBA games over this two-day period, with each team's participation properly represented.

In [9]:
print("\n=== Plus_Minus Range ===")
print(f"plus_minus min: {df_daily['plus_minus'].min()}")
print(f"plus_minus max: {df_daily['plus_minus'].max()}")


=== Plus_Minus Range ===
plus_minus min: -38.0
plus_minus max: 42.0


In [10]:
from nba_data_forge.etl.transformers.daily_game_log_transformer import DailyGameLogTransformer

transformer = DailyGameLogTransformer()
df_daily_transformed = transformer.transform(df_daily)

2025-02-06 12:14:00 | DailyGameLogTransformer | INFO     | Cleaning column: team
2025-02-06 12:14:00 | DailyGameLogTransformer | INFO     | Cleaning column: opponent
2025-02-06 12:14:00 | DailyGameLogTransformer | INFO     | Cleaning column: location
2025-02-06 12:14:00 | DailyGameLogTransformer | INFO     | Cleaning column: outcome
2025-02-06 12:14:00 | DailyGameLogTransformer | INFO     | Transformed 396 game logs with 28 columns
2025-02-06 12:14:00 | DailyGameLogTransformer | INFO     | Transformed 396 daily game logs with 31 columns


In [11]:
data_overview(df_daily_transformed)


=== NBA Game Log Data Overview ===
Total Records: 396
Columns: 31

=== Time Range ===
Start Date: 2025-02-04 00:00:00
End Date: 2025-02-05 00:00:00

=== Column Data Types ===
player_id            object
name                 object
team                 object
location             object
opponent             object
outcome              object
seconds_played       int64
made_field_goals     int64
attempted_field_goals int64
made_three_point_field_goals int64
attempted_three_point_field_goals int64
made_free_throws     int64
attempted_free_throws int64
offensive_rebounds   int64
defensive_rebounds   int64
assists              int64
steals               int64
blocks               int64
turnovers            int64
personal_fouls       int64
plus_minus           float64
game_score           float64
date                 datetime64[ns]
team_abbrev          object
opponent_abbrev      object
is_home              bool
is_win               bool
minutes_played       float64
season               int