# GameLogs data exploration
In this notebook we'll explore the data obtained from the `LeagueGameLog` endpoint (docs at https://github.com/swar/nba_api/blob/master/docs/nba_api/stats/endpoints/leaguegamelog.md).

In [1]:
import pandas as pd

In [2]:
from nba_api.stats.endpoints import LeagueGameLog

LeagueGameLog?

[31mInit signature:[39m
LeagueGameLog(
    counter=[32m0[39m,
    direction=[33m'ASC'[39m,
    league_id=[33m'00'[39m,
    player_or_team_abbreviation=[33m'T'[39m,
    season=[33m'2024-25'[39m,
    season_type_all_star=[33m'Regular Season'[39m,
    sorter=[33m'DATE'[39m,
    date_from_nullable=[33m''[39m,
    date_to_nullable=[33m''[39m,
    proxy=[38;5;28;01mNone[39;00m,
    headers=[38;5;28;01mNone[39;00m,
    timeout=[32m30[39m,
    get_request=[38;5;28;01mTrue[39;00m,
)
[31mDocstring:[39m      <no docstring>
[31mFile:[39m           ~/anaconda3/envs/MBAI/lib/python3.13/site-packages/nba_api/stats/endpoints/leaguegamelog.py
[31mType:[39m           type
[31mSubclasses:[39m     

We can get both gamelogs teams and players for an entire season or a specific timespan. As a case study for exploring the endpoint, we'll analyse the past regular season (2024-25).

We first make the api call and then we extract the dataframe. Doing so, we can modify the dfs all the times we want without making any more api requests.   

In [3]:
gamelogs_T = LeagueGameLog(season='2024', player_or_team_abbreviation='T')
gamelogs_P = LeagueGameLog(season='2024', player_or_team_abbreviation='P')

In [4]:
df_T = gamelogs_T.get_data_frames()[0]
df_P = gamelogs_P.get_data_frames()[0]

In [5]:
print("Teams columns:\n", df_T.columns.tolist())
print()
print("Players columns:\n", df_P.columns.tolist())

Teams columns:
 ['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'VIDEO_AVAILABLE']

Players columns:
 ['SEASON_ID', 'PLAYER_ID', 'PLAYER_NAME', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID', 'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'FANTASY_PTS', 'VIDEO_AVAILABLE']


We need to make sure that the two dfs we scraped are consistent with each other, in particular for the `SEASON_ID` and `GAME_ID`s. 

In [6]:
assert df_T['SEASON_ID'].unique() == df_P['SEASON_ID'].unique()
assert set(df_T['GAME_ID']) ^ set(df_P['GAME_ID']) == set()

In [9]:
season_id = df_T['SEASON_ID'].unique()[0]
game_ids = df_T['GAME_ID'].unique()

We'll check that for each `GAME_ID` we have exactly 2 teams and at least 5 players for each team. 

In [10]:
for game_id in game_ids:
    teams_df = df_T[df_T['GAME_ID'] == game_id]
    assert len(teams_df) == 2
    
    players_df = df_P[df_P['GAME_ID'] == game_id]
    for team_id in teams_df['TEAM_ID']:
        assert len(players_df[players_df['TEAM_ID'] == team_id]) >= 5

Now we'll need to select which columns are important for our analysis. First, let's look at which columns are different in the two dfs. 

In [11]:
print("Different columns: ", set(df_T.columns) ^ set(df_P.columns))

Different columns:  {'PLAYER_ID', 'PLAYER_NAME', 'FANTASY_PTS'}


Since we already know the `SEASON_ID`, we can drop it in both dfs. 
We can drop also the `GAME_DATE`, which is an info that we're going to acquire from the `GAME_ID`.  
Since `TEAM_ID` and `PLAYER_ID` are unique identifiers, we don't need any more names or abbreviations. 
We can also detach the `PLAYER_ID` from the `TEAM_ID`. 
We know that we have two teams for each game, therefore the `MATCHUP` column isn't useful. 
We can also drop the `VIDEO_AVAILABLE` and `FANTASY_PTS` columns.

In [12]:
cols2drop = ['SEASON_ID', 'GAME_DATE', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'MATCHUP', 'VIDEO_AVAILABLE']
df_T.drop(columns=cols2drop, inplace=True)
df_P.drop(columns=cols2drop + ['TEAM_ID', 'PLAYER_NAME', 'FANTASY_PTS'], inplace=True)

We are ready to look at the data! Let's start with the teams df. 

In [13]:
df_T.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2460 entries, 0 to 2459
Data columns (total 23 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   TEAM_ID     2460 non-null   int64  
 1   GAME_ID     2460 non-null   object 
 2   WL          2460 non-null   object 
 3   MIN         2460 non-null   int64  
 4   FGM         2460 non-null   int64  
 5   FGA         2460 non-null   int64  
 6   FG_PCT      2460 non-null   float64
 7   FG3M        2460 non-null   int64  
 8   FG3A        2460 non-null   int64  
 9   FG3_PCT     2460 non-null   float64
 10  FTM         2460 non-null   int64  
 11  FTA         2460 non-null   int64  
 12  FT_PCT      2460 non-null   float64
 13  OREB        2460 non-null   int64  
 14  DREB        2460 non-null   int64  
 15  REB         2460 non-null   int64  
 16  AST         2460 non-null   int64  
 17  STL         2460 non-null   int64  
 18  BLK         2460 non-null   int64  
 19  TOV         2460 non-null  

In [14]:
assert df_T['TEAM_ID'].nunique() == 30, "The NBA league is composed of 30 teams, but we're missing some!"

In [15]:
game_counts = df_T.groupby('TEAM_ID').size()
assert game_counts.nunique() == 1, "Teams have different game counts!"
assert game_counts.unique() == 82, "An NBA regular season is composed of 82 games, but we're missing some!"

The `PLUS_MINUS` column should be removed from the teams df, as it just the mean difference in points between the two teams.

In [16]:
for _, teams_df in df_T.groupby('GAME_ID'):
    pts = teams_df['PTS']
    plus_minus = teams_df['PLUS_MINUS']
    
    assert plus_minus.sum() == 0.0
    assert abs(pts.diff().iloc[-1]) == 0.5 * plus_minus.abs().sum()

In [17]:
df_T.drop('PLUS_MINUS', axis=1, inplace=True)

Now, let's dive into the players df. 

In [18]:
df_P.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26306 entries, 0 to 26305
Data columns (total 23 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   PLAYER_ID   26306 non-null  int64  
 1   GAME_ID     26306 non-null  object 
 2   WL          26306 non-null  object 
 3   MIN         26306 non-null  int64  
 4   FGM         26306 non-null  int64  
 5   FGA         26306 non-null  int64  
 6   FG_PCT      24835 non-null  float64
 7   FG3M        26306 non-null  int64  
 8   FG3A        26306 non-null  int64  
 9   FG3_PCT     20944 non-null  float64
 10  FTM         26306 non-null  int64  
 11  FTA         26306 non-null  int64  
 12  FT_PCT      14343 non-null  float64
 13  OREB        26306 non-null  int64  
 14  DREB        26306 non-null  int64  
 15  REB         26306 non-null  int64  
 16  AST         26306 non-null  int64  
 17  STL         26306 non-null  int64  
 18  BLK         26306 non-null  int64  
 19  TOV         26306 non-nul

From what we can see, the only data missing is in `FG_PCT`, `FG3_PCT` and `FT_PCT`. Let's check that the data is missing because there aren't any attempts.

In [19]:
assert (
    df_P[(df_P['FGA'] > 0) & (df_P['FG_PCT'].isnull())].shape[0] == 0 and
    df_P[(df_P['FGA'] == 0) & (df_P['FG_PCT'].isnull())].shape[0] == df_P['FG_PCT'].isnull().sum()
), "Missing FG_PCT should only occur when FGA = 0"

assert (
    df_P[(df_P['FG3A'] > 0) & (df_P['FG3_PCT'].isnull())].shape[0] == 0 and
    df_P[(df_P['FG3A'] == 0) & (df_P['FG3_PCT'].isnull())].shape[0] == df_P['FG3_PCT'].isnull().sum()
), "Missing FG3_PCT should only occur when FG3A = 0"

assert (
    df_P[(df_P['FTA'] > 0) & (df_P['FT_PCT'].isnull())].shape[0] == 0 and
    df_P[(df_P['FTA'] == 0) & (df_P['FT_PCT'].isnull())].shape[0] == df_P['FT_PCT'].isnull().sum()
), "Missing FT_PCT should only occur when FTA = 0"

The solution to this problem is to exclude the `*_PCT` columns from our dfs. 

In [20]:
df_T.drop(columns=['FG_PCT', 'FG3_PCT', 'FT_PCT'], inplace=True)
df_P.drop(columns=['FG_PCT', 'FG3_PCT', 'FT_PCT'], inplace=True)

In [21]:
assert df_T['WL'].nunique() == df_P['WL'].nunique() == 2, "A basketball game can only finish with a win or a lose!"

In [22]:
df_T['WL'] = df_T['WL'].replace({'W': 1, 'L': 0}).astype('bool')
df_P['WL'] = df_P['WL'].replace({'W': 1, 'L': 0}).astype('bool')

  df_T['WL'] = df_T['WL'].replace({'W': 1, 'L': 0}).astype('bool')
  df_P['WL'] = df_P['WL'].replace({'W': 1, 'L': 0}).astype('bool')


Since we're interested in scraping a lot of data, we shouldn't forget about memory! Both teams and players stats are stored as `Int64` (8 bytes)... let's see if we can use something smarter. 

In [23]:
for col in df_T.select_dtypes(include=['int64']).columns:
    df_T[col] = pd.to_numeric(df_T[col], downcast='unsigned')

In [24]:
df_T.dtypes

TEAM_ID    uint32
GAME_ID    object
WL           bool
MIN        uint16
FGM         uint8
FGA         uint8
FG3M        uint8
FG3A        uint8
FTM         uint8
FTA         uint8
OREB        uint8
DREB        uint8
REB         uint8
AST         uint8
STL         uint8
BLK         uint8
TOV         uint8
PF          uint8
PTS         uint8
dtype: object

In [25]:
original_memory = gamelogs_T.get_data_frames()[0].memory_usage(deep=True).sum()
optimized_memory = df_T.memory_usage(deep=True).sum()

print(f"Original teams df memory: {original_memory / 1024**2:.2f} MB")
print(f"Optimized teams df memory: {optimized_memory / 1024**2:.2f} MB")
print(f"Memory saved: {(original_memory - optimized_memory) / 1024**2:.2f} MB ({(1 - optimized_memory/original_memory)*100:.1f}%)")

Original teams df memory: 1.35 MB
Optimized teams df memory: 0.19 MB
Memory saved: 1.16 MB (85.9%)


In [26]:
for col in df_P.select_dtypes(include=['int64']).columns:
    df_P[col] = pd.to_numeric(df_P[col], downcast='unsigned')

df_P['PLUS_MINUS'] = pd.to_numeric(df_P['PLUS_MINUS'], downcast='signed')

In [27]:
df_P.dtypes

PLAYER_ID     uint32
GAME_ID       object
WL              bool
MIN            uint8
FGM            uint8
FGA            uint8
FG3M           uint8
FG3A           uint8
FTM            uint8
FTA            uint8
OREB           uint8
DREB           uint8
REB            uint8
AST            uint8
STL            uint8
BLK            uint8
TOV            uint8
PF             uint8
PTS            uint8
PLUS_MINUS      int8
dtype: object

In [28]:
original_memory = gamelogs_P.get_data_frames()[0].memory_usage(deep=True).sum()
optimized_memory = df_P.memory_usage(deep=True).sum()

print(f"Original players df memory: {original_memory / 1024**2:.2f} MB")
print(f"Optimized players df memory: {optimized_memory / 1024**2:.2f} MB")
print(f"Memory saved: {(original_memory - optimized_memory) / 1024**2:.2f} MB ({(1 - optimized_memory/original_memory)*100:.1f}%)")

Original players df memory: 16.38 MB
Optimized players df memory: 2.03 MB
Memory saved: 14.35 MB (87.6%)


Now we can finally save the dfs as csv files divided by games. 

In [30]:
from pathlib import Path
from tqdm import tqdm

season_path = f"~/MBAI/data/rs{season_id}/"
for game_id in tqdm(game_ids):
    
    game_path = season_path + f"games/g{game_id}/"
    try: 
        Path(game_path).expanduser().mkdir(parents=True, exist_ok=True)
    except Exception as e:
        print(f"Error in creating the game directory: {e}")

    teams_df = df_T[df_T['GAME_ID'] == game_id].drop('GAME_ID', axis=1)
    try:
        teams_df.to_csv(game_path + "teams.csv", index=False)
    except Exception as e:
        print(f"Error in saving the teams csv: {e}")

    players_df = df_P[df_P['GAME_ID'] == game_id].drop('GAME_ID', axis=1)
    try:
        players_df.to_csv(game_path + "players.csv", index=False)
    except Exception as e:
        print(f"Error in saving the players csv: {e}")

100%|███████████████████████████████████████| 1230/1230 [00:33<00:00, 36.23it/s]
