# NBA Game Log Data Analysis

Analysis of 523,825 NBA game logs from 2003-2024, focusing on data quality and statistical validation. Key areas examined: team names standardization, location/outcome encoding, and points calculation verification.

## Data Loading
Loading 20 seasons of game logs from CSV files, with date parsing and sorting.

In [1]:
import pandas as pd
from pathlib import Path

# Combine all season files
seasons_data = []
raw_dir = Path("../data/raw")

for file in raw_dir.glob("game_logs_*.csv"):
    df = pd.read_csv(file)
    df['season'] = int(file.stem.split('_')[2])  # Extract year from filename
    seasons_data.append(df)

df = pd.concat(seasons_data, ignore_index=True)

# type conversion
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values('date')

In [2]:
# set options to see all columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## Display basic dataset info

In [3]:
print("Dataset Shape:", df.shape)
print("\nColumn Names:")
print(df.columns.tolist())
print("\nData Types:")
print(df.dtypes)

Dataset Shape: (523825, 26)

Column Names:
['date', 'team', 'location', 'opponent', 'outcome', 'active', 'seconds_played', 'made_field_goals', 'attempted_field_goals', 'made_three_point_field_goals', 'attempted_three_point_field_goals', 'made_free_throws', 'attempted_free_throws', 'offensive_rebounds', 'defensive_rebounds', 'assists', 'steals', 'blocks', 'turnovers', 'personal_fouls', 'points_scored', 'game_score', 'plus_minus', 'player_id', 'name', 'season']

Data Types:
date                                 datetime64[ns]
team                                         object
location                                     object
opponent                                     object
outcome                                      object
active                                         bool
seconds_played                                int64
made_field_goals                              int64
attempted_field_goals                         int64
made_three_point_field_goals                  int64
att

### Observation:
team, location, opponent, outcome have object type

## Null and duplicate check

In [4]:
# Check null values
null_counts = df.isnull().sum()
print("Null Values:\n", null_counts[null_counts > 0])

# Check duplicates
duplicate_count = df.duplicated().sum()
print("\nDuplicate Rows:", duplicate_count)

Null Values:
 Series([], dtype: int64)

Duplicate Rows: 0


## Statistical Analysis
Examining distribution of game statistics including minutes played, scoring, rebounds, and efficiency metrics. Key findings:
- Points Per Game: Average 9.94 (Range: 0-81)
- Playing Time: Average 23.2 minutes
- Field Goal Stats: 3.69 made / 8.07 attempted

In [5]:
# Get summary statistics for numeric columns
print(df.describe())

                                date  seconds_played  made_field_goals  \
count                         523825   523825.000000     523825.000000   
mean   2014-02-22 08:51:37.034315008     1394.785056          3.692073   
min              2003-10-28 00:00:00        0.000000          0.000000   
25%              2008-12-19 00:00:00      900.000000          1.000000   
50%              2014-02-26 00:00:00     1434.000000          3.000000   
75%              2019-02-23 00:00:00     1931.000000          5.000000   
max              2024-04-14 00:00:00     3620.000000         28.000000   
std                              NaN      670.955014          3.086132   

       attempted_field_goals  made_three_point_field_goals  \
count          523825.000000                 523825.000000   
mean                8.066530                      0.823838   
min                 0.000000                      0.000000   
25%                 4.000000                      0.000000   
50%                 7.0

In [6]:
# Check unique values in categorical columns
categorical_cols = ['team', 'location', 'opponent', 'outcome']
for col in categorical_cols:
    print(f"\nUnique values in {col}:")
    print(f"Total unique values: {df[col].nunique()}")
    print(df[col].value_counts())
    print("-"*50)
    


Unique values in team:
Total unique values: 37
team
Team.SAN_ANTONIO_SPURS                    18814
Team.DALLAS_MAVERICKS                     18128
Team.UTAH_JAZZ                            17989
Team.BOSTON_CELTICS                       17815
Team.INDIANA_PACERS                       17628
Team.WASHINGTON_WIZARDS                   17620
Team.LOS_ANGELES_CLIPPERS                 17613
Team.MILWAUKEE_BUCKS                      17573
Team.MEMPHIS_GRIZZLIES                    17542
Team.ATLANTA_HAWKS                        17519
Team.GOLDEN_STATE_WARRIORS                17462
Team.MINNESOTA_TIMBERWOLVES               17459
Team.DENVER_NUGGETS                       17448
Team.ORLANDO_MAGIC                        17440
Team.TORONTO_RAPTORS                      17405
Team.LOS_ANGELES_LAKERS                   17393
Team.SACRAMENTO_KINGS                     17383
Team.PHILADELPHIA_76ERS                   17377
Team.DETROIT_PISTONS                      17354
Team.CLEVELAND_CAVALIERS           

In [7]:
# Check if points match field goals and free throws
df['calculated_points'] = (df['made_field_goals'] - df['made_three_point_field_goals']) * 2 + \
                         df['made_three_point_field_goals'] * 3 + \
                         df['made_free_throws']
points_mismatch = (df['calculated_points'] != df['points_scored']).sum()
print("\nPoints calculation mismatches:", points_mismatch)


Points calculation mismatches: 2


In [8]:
points_mismatch_rows = (df['calculated_points'] != df['points_scored'])
mismatch_rows = df[points_mismatch_rows].copy()
scoring_cols = [
   'date', 'name', 'team', 'opponent',
   'made_field_goals', 'made_three_point_field_goals', 
   'made_free_throws', 'calculated_points', 'points_scored'
]
print(mismatch_rows[scoring_cols])

            date             name                     team  \
29866 2021-01-05    Anthony Davis  Team.LOS_ANGELES_LAKERS   
43715 2021-01-05  Dennis Schröder  Team.LOS_ANGELES_LAKERS   

                     opponent  made_field_goals  made_three_point_field_goals  \
29866  Team.MEMPHIS_GRIZZLIES                10                             4   
43715  Team.MEMPHIS_GRIZZLIES                 5                             1   

       made_free_throws  calculated_points  points_scored  
29866                 3                 27             26  
43715                 0                 11             12  


## Data Quality Issues

### Team and Opponent Names
- 37 unique team values (expected: 30)
- 63 unique opponent values (expected: 30)
- Historical transitions (e.g., SuperSonics → Thunder)
- Format inconsistencies in team and opponent data

### Locations and Game Outcomes
Standardization needed for:
- Location values (HOME/AWAY formats)
- Outcome values (WIN/LOSS formats)

### Points Validation
Found 2 scoring discrepancies:
- Identified in Lakers vs Grizzlies game (2021-01-05)
- Point attribution error between Davis and Schröder
- I will omit this record due to it does not have major effect

#### Get all unique values

In [9]:
# get the list 
categorical_cols = ['location', 'outcome']
unique_values = {col: df[col].unique().tolist() for col in categorical_cols}
print("\nUnique values as lists:")
for col, values in unique_values.items():
    print(f"{col}: {values}")
    print("-"*100)


Unique values as lists:
location: ['Location.HOME', 'Location.AWAY', 'AWAY', 'HOME']
----------------------------------------------------------------------------------------------------
outcome: ['Outcome.WIN', 'Outcome.LOSS', 'LOSS', 'WIN']
----------------------------------------------------------------------------------------------------


In [10]:
categorical_cols = ['team', 'opponent']
unique_values = {col: df[col].unique().tolist() for col in categorical_cols}
all_team_names = unique_values["team"] + unique_values["opponent"]
all_team_names = sorted(set(all_team_names))
print("Unique team names occur in the data:")
print(all_team_names)

Unique team names occur in the data:
['ATLANTA HAWKS', 'BOSTON CELTICS', 'CHICAGO BULLS', 'CLEVELAND CAVALIERS', 'DALLAS MAVERICKS', 'DENVER NUGGETS', 'DETROIT PISTONS', 'GOLDEN STATE WARRIORS', 'HOUSTON ROCKETS', 'INDIANA PACERS', 'LOS ANGELES CLIPPERS', 'LOS ANGELES LAKERS', 'MEMPHIS GRIZZLIES', 'MIAMI HEAT', 'MILWAUKEE BUCKS', 'MINNESOTA TIMBERWOLVES', 'NEW JERSEY NETS', 'NEW ORLEANS HORNETS', 'NEW YORK KNICKS', 'ORLANDO MAGIC', 'PHILADELPHIA 76ERS', 'PHOENIX SUNS', 'PORTLAND TRAIL BLAZERS', 'SACRAMENTO KINGS', 'SAN ANTONIO SPURS', 'SEATTLE SUPERSONICS', 'TORONTO RAPTORS', 'Team.ATLANTA_HAWKS', 'Team.BOSTON_CELTICS', 'Team.BROOKLYN_NETS', 'Team.CHARLOTTE_BOBCATS', 'Team.CHARLOTTE_HORNETS', 'Team.CHICAGO_BULLS', 'Team.CLEVELAND_CAVALIERS', 'Team.DALLAS_MAVERICKS', 'Team.DENVER_NUGGETS', 'Team.DETROIT_PISTONS', 'Team.GOLDEN_STATE_WARRIORS', 'Team.HOUSTON_ROCKETS', 'Team.INDIANA_PACERS', 'Team.LOS_ANGELES_CLIPPERS', 'Team.LOS_ANGELES_LAKERS', 'Team.MEMPHIS_GRIZZLIES', 'Team.MIAMI_HEAT'