This is everything that I had done for the project so far. I just started working on the data cleaning but that is as far as i got. Im not going to say to much about it here as the class and the functions inside the class are pretty well documented. The 5 or so cells below the class was me started to get into cleaning the data. 
If you do try to run this to make sure it works sometimes the API is a little finicky sometimes and will time out. In the functinos where it is pulling from the endpoints, I have a sleep time of 1 second set. If it gives you a error sometimes added a few seconds will help. although it will take much longer. If you pulling every boxscore state for every game it will take it a good amount of time to run even with the slep time set at 1. 

In [1]:
pip install nba_api

Note: you may need to restart the kernel to use updated packages.


In [32]:
from nba_api.stats.library.http import NBAStatsHTTP
NBAStatsHTTP.timeout = 60  


In [1]:
# import needed libraries
from nba_api.stats.endpoints import leaguegamelog, boxscoretraditionalv2, boxscoreadvancedv2, playercareerstats, teamgamelogs, playergamelogs
from nba_api.stats.endpoints import leaguegamelog
from nba_api.stats.static import players, teams
import numpy as np
import pandas as pd
import time

def retry(func):
    '''
    This is a decorator that will take a function and allow it to be retried up up to three times if it raises an exception. If the number of retries is exhausted, it raises
    an exception indicating that all retry attempts have failed.
    
    Parameters:
    - func (callable): This is the function to be retried. This will be retried up to three times. 
    
    Returns:
    - A wrapper function that includes the retry logic. It will return the result of the func parameter if it succeds within the retry attempts. 
    
    Raises:
    - Execption: An exception is raised indicating that all retry attempts have failed
      or a persistent error has occured.
    '''
    def retry_wrapper(*args, **kwargs):
        retries = 3 # can change the number of retries if needed
        attempts = 0 # counter
        while attempts < retries:
            try:
                return func(*args, **kwargs)
            except Exception as e: 
                raise Exception(f"persistent error encountered: {e}")
            except Exception as e:
                print(f"Attempt {attempts + 1} fialed with error: {e}")
                time.sleep(10) 
                attempts +=1
        raise Exception("All retry attempts failed.")
    return retry_wrapper

class NBA_Data:
    def __init__(self):
        '''
        This is a class for pulling NBA statistics from NBA.com's API. 

        This class provides methods to retrieve various types of NBA data such as player career stats,
        player box scores, team box scores, and comprehensive box scores for all players or teams within a season.
        Each method automatically handles retries upon failure, ensuring data retrieval is robust against transient issues.

        Methods are decorated with a `retry` mechanism to attempt a specified number of retries upon encountering
        exceptions, enhancing reliability in the face of temporary network or API issues.
        The methods provided will pull various types of data that can be found on the NBA.com statistics page such as 
        player career stats, player box scores, team box scores, and comprehensive box scores for all players or teams within a season.
        Each method will handle rerued if it fails to pull the data. 

        Methods:
        - get_player_career_stats(player_name): Gets career statistics for a given player.
        - get_player_boxscores(player_name, season): Gets box scores for a specific player and season.
        - get_team_boxscores(team_name, season): Gets box scores for a specific team and season.
        - get_all_players_boxscores(season, advanced_boxscore=False): Gets box scores for all players in a given season,
          with a option for getting advanced box scores.
        - get_all_teams_boxscore(season, advanced_boxscore=False): Gets box scores for all teams in a specified season,
          with a option for getting advanced statistics.

        Usage:
        - nba_data = NBA_Data()
        - player_career_stats = nba_data.get_player_career_stats('Stephen Curry')
        - team_boxscores = nba_data.get_team_boxscores('Golden State Warriors', 2023)

        Dependencies:
        - For this class to run you will need to install the following packages:
            -Pandas
            -Time
            -NBA API 
        - For the NBA API you will need the following endpoints:
            -from nba_api.stats.endpoints import, leaguegamelog, boxscoretraditionalv2, boxscoreadvancedv2, playercareerstats, teamgamelogs, playergamelogs
            -from nba_api.stats.static import players, teams

        There are no parameters that are required for initializing this class
        '''
    pass

    @retry
    def get_player_career_stats(self, player_name):
        '''
        This function gets a players career statistics. 
        
        Parameters:
        - player_name (str): The full name of the NBA player. Example: 'Stephen Curry'
        
        Returns:
        -DataFrame: A pandas data frame that contains a players career statistics.
        '''
        try:
            # Get players ID
            nba_players =players.get_players()
            player = [player for player in nba_players if player['full_name'] == player_name][0]
            playerID = player['id']

            # Get the stats
            carreer_stats = playercareerstats.PlayerCareerStats(player_id=playerID).get_data_frames()[0]
            return carreer_stats
                              
        except IndexError:
            return f"No data fround for player: {player_name}"
        except Exception as e:
            return f"An error occurred: {e}"
    
    @retry
    def get_player_boxscores(self, player_name, season):
        '''
        This function gets a players box score from a season. 
        
        Parameters:
        - player_name (str): The full name of the NBA player. Example: 'Stephen Curry'
        - season (int): The season to get the box score for. Example: 2023.
        
        Returns:
        -DataFrame: A pandas data frame that contains a players box score for a season.
        '''
        season = str(season) + "-" + str(season+1)[-2:] # Convert year to season format ie. 2020 -> 2020-21
        try:
            # Get players ID
            nba_players =players.get_players()
            player = [player for player in nba_players if player['full_name'] == player_name][0]
            playerID = player['id']

            # Get the stats
            player_boxscore = playergamelogs.PlayerGameLogs(player_id_nullable=playerID, season_nullable=season).get_data_frames()[0]
            return player_boxscore

        except IndexError:
            return f"No data fround for player: {player_name}"
        except Exception as e:
            return f"An error occurred: {e}"
                              
        

    @retry
    def get_team_boxscores(self, team_name, season): 
        '''
        This function gets a players box score from a season. 
        
        Parameters:
        - player_name (str): The full name of the NBA player. Example: 'Stephen Curry'
        - season (int): The season to get the box score for. Example: 2023.
        
        Returns:
        -DataFrame: A pandas data frame that contains a players box score for a season.
        '''
        try:
            season = str(season) + "-" + str(season+1)[-2:] # Convert year to season format ie. 2020 -> 2020-21

            # Get team ID
            nba_teams = teams.get_teams()
            team = [team for team in nba_teams if team["full_name"] == team_name][0]
            teamID = team['id']

            # Get the stats
            teamGameStats = teamgamelogs.TeamGameLogs(team_id_nullable=teamID, season_nullable=season).get_data_frames()[0]
            return teamGameStats
        
        except IndexError:
            return f"No data found for team: {team_name}"
        except Exception as e:
            return f"An error occurred: {e}"
        
    @retry
    def get_all_players_boxscores(self, season, advanced_boxscore=False):
        '''
        Retrieves box score statistics for all games played by every player in a given NBA season.
        
        Parameters:
        - season (int): The season to get the box scores for. Example: 2023.
        - advanced_boxscore (bool, optional): If this is set to True, the advanced boxscore statistics will be pulled in stead of the traditional boxscore statistics

        Returns:
        - pandas.DataFrame: A DataFrame with box score statistics for each game for every player in a specified season. 
        '''
        try:
            season_format = str(season) + "-" + str(season+1)[-2:]
            season_games = leaguegamelog.LeagueGameLog(season=season_format).get_data_frames()[0]
            game_id_list = season_games['GAME_ID'].unique()

            all_games_box_score = []

            for game_id in game_id_list:
                if advanced_boxscore:
                    adv_box_score = boxscoreadvancedv2.BoxScoreAdvancedV2(game_id=game_id).get_data_frames()[0] # 0 returns the player data from the JSON
                    all_games_box_score.append(adv_box_score)
                    
                else:
                    box_score = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_id).get_data_frames()[0] # 0 returns the player data from the JSON
                    all_games_box_score.append(box_score)
                time.sleep(1)
                                               
            boxscore_combined = pd.concat(all_games_box_score, ignore_index=True)
            return boxscore_combined
                                            
        except IndexError:
            return f"No data found for season: {season}"
        except Exception as e:
            return f"An error occurred: {e}"


    @retry
    def get_all_teams_boxscore(self, season, advanced_boxscore=False):
        '''
        Retrieves box score statistics for all games played by every team in a given NBA season.
        
        Parameters:
        - season (int): The season to get the box scores for. Example: 2023.
        - advanced_boxscore (bool, optional): If this is set to True, the advanced boxscore statistics will be pulled in stead of the traditional boxscore statistics

        Returns:
        - pandas.DataFrame: A DataFrame with box score statistics for each game for every team in a specified season. 
        '''
        try:
            season_format = str(season) + "-" + str(season+1)[-2:]
            season_games = leaguegamelog.LeagueGameLog(season=season_format).get_data_frames()[0]
            game_id_list = season_games['GAME_ID'].unique()

            if advanced_boxscore:
                all_games_box_score = []
                for game_id in game_id_list:
                    box_score = boxscoreadvancedv2.BoxScoreAdvancedV2(game_id=game_id)#.get_data_frames()[1] # 1 returns the team data from the JSON
                    team_stats = box_score.team_stats.get_data_frame()
                    all_games_box_score.append(team_stats)
                    time.sleep(5)
                advanced_boxscore = pd.concat(all_games_box_score, ignore_index=True)
                return advanced_boxscore
     
            else:
                return season_games
    
        except IndexError:
            return f"No data found for season: {season}"
        except Exception as e:
            return f"An error occurred: {e}"

In [34]:
nbaData = NBA_Data()
traditional_team_box_scores = nbaData.get_all_teams_boxscore(2022)
advanced_team_box_scores = nbaData.get_all_teams_boxscore(2022, advanced_boxscore=True)

In [35]:
# traditional_team_box_scores
advanced_team_box_scores


"An error occurred: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)"

In [4]:
path = r'/Users/nicholashorton/Documents/NBA Data'
advanced_team_box_scores.to_csv(r'/Users/nicholashorton/Documents/NBA Data/Advanced box scores 2022.csv')
traditional_team_box_scores.to_csv(r'/Users/nicholashorton/Documents/NBA Data/traditional box scores 2022.csv')

AttributeError: 'str' object has no attribute 'to_csv'

In [4]:
nbaData = NBA_Data()
bs = nbaData.get_all_players_boxscores(2023)
bs

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,NICKNAME,START_POSITION,COMMENT,MIN,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,0022300061,1610612747,LAL,Los Angeles,1627752,Taurean Prince,Taurean,F,,29.000000:53,...,1.0,2.0,3.0,1.0,0.0,1.0,1.0,0.0,18.0,-14.0
1,0022300061,1610612747,LAL,Los Angeles,2544,LeBron James,LeBron,F,,29.000000:01,...,1.0,7.0,8.0,5.0,1.0,0.0,0.0,1.0,21.0,7.0
2,0022300061,1610612747,LAL,Los Angeles,203076,Anthony Davis,Anthony,C,,34.000000:09,...,1.0,7.0,8.0,4.0,0.0,2.0,2.0,3.0,17.0,-17.0
3,0022300061,1610612747,LAL,Los Angeles,1630559,Austin Reaves,Austin,G,,31.000000:20,...,4.0,4.0,8.0,4.0,2.0,0.0,2.0,2.0,14.0,-14.0
4,0022300061,1610612747,LAL,Los Angeles,1626156,D'Angelo Russell,D'Angelo,G,,36.000000:11,...,0.0,4.0,4.0,7.0,1.0,0.0,3.0,3.0,11.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32380,0022301198,1610612744,GSW,Golden State,203967,Dario Šarić,Dario,,,12.000000:39,...,2.0,3.0,5.0,1.0,0.0,1.0,3.0,0.0,12.0,-2.0
32381,0022301198,1610612744,GSW,Golden State,1631311,Lester Quinones,Lester,,,17.000000:17,...,1.0,0.0,1.0,3.0,0.0,0.0,1.0,3.0,12.0,9.0
32382,0022301198,1610612744,GSW,Golden State,1630311,Pat Spencer,Pat,,,15.000000:09,...,2.0,2.0,4.0,3.0,0.0,0.0,1.0,0.0,2.0,7.0
32383,0022301198,1610612744,GSW,Golden State,1629010,Jerome Robinson,Jerome,,,5.000000:25,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,5.0,7.0


In [5]:
bs.to_csv('player_box_scores_2023.csv')

In [None]:
# how to merge for the two data frames
'''
This is the logic to combine the advacned and the basic data frames.
'''
# this is for get_all_players_boxscores
 adv_box_score = advanced_boxscore.drop(columns=['TEAM_NAME','TEAM_ABBREVIATION',
                                                                     'TEAM_CITY','MIN'])
                combined_df = pd.merge(adv_box_score, box_score, on=['GAME_ID', 'TEAM_ID'], how='inner')

# this is for get_all_teams_boxscore
 advanced_boxscore_drop = advanced_boxscore.drop(columns=['TEAM_NAME','TEAM_ABBREVIATION',
                                                                     'TEAM_CITY','MIN'])
                combined_df = pd.merge(advanced_boxscore_drop, season_games, on=['GAME_ID', 'TEAM_ID'], how='right')
                return combined_df
                

In [68]:
advanced_boxscore_drop = advanced_team_box_scores.drop(columns=['TEAM_NAME','TEAM_ABBREVIATION',
                                                                     'TEAM_CITY','MIN'])
combined_df = pd.merge(traditional_team_box_scores, advanced_boxscore_drop, on=['GAME_ID', 'TEAM_ID'], how='right')

In [69]:
combined_df

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,MIN,FGM,...,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,E_USG_PCT,E_PACE,PACE,PACE_PER40,POSS,PIE
0,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,240,41,...,12.5,0.511,0.541,1.0,0.199,98.54,95.5,79.58,96,0.421
1,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,240,48,...,12.6,0.604,0.618,1.0,0.196,98.54,95.5,79.58,95,0.579
2,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,240,42,...,18.8,0.500,0.527,1.0,0.198,105.40,101.5,84.58,101,0.564
3,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,240,36,...,10.8,0.406,0.459,1.0,0.198,105.40,101.5,84.58,102,0.436
4,22023,1610612764,WAS,Washington Wizards,0022300064,2023-10-25,WAS @ IND,L,240,44,...,12.7,0.505,0.552,1.0,0.199,113.02,110.5,92.08,110,0.428
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1929,22023,1610612744,GSW,Golden State Warriors,0022300936,2024-03-11,GSW @ SAS,W,240,46,...,14.0,0.559,0.570,1.0,0.198,101.94,100.0,83.33,100,0.526
1930,22023,1610612761,TOR,Toronto Raptors,0022300937,2024-03-11,TOR @ DEN,L,240,47,...,12.0,0.565,0.592,1.0,0.196,101.20,99.5,82.92,100,0.460
1931,22023,1610612743,DEN,Denver Nuggets,0022300937,2024-03-11,DEN vs. TOR,W,240,52,...,9.1,0.600,0.607,1.0,0.198,101.20,99.5,82.92,99,0.540
1932,22023,1610612757,POR,Portland Trail Blazers,0022300938,2024-03-11,POR vs. BOS,L,240,39,...,12.1,0.506,0.539,1.0,0.200,94.06,91.0,75.83,91,0.401


In [6]:
conditions = [traditional_team_box_scores['MATCHUP'].str.contains('vs.'),
              traditional_team_box_scores['MATCHUP'].str.contains('@')]
choices = ['home', 'away']

traditional_team_box_scores['home_away'] = np.select(conditions, choices, default='unknown')
traditional_team_box_scores.to_csv(r'/Users/nicholashorton/Documents/traditoinal_boxscore_2024_12_13.csv')

In [52]:
# This is to get a column of if the game was home or away
conditions = [combined_df['MATCHUP'].str.contains('vs.'),
              combined_df['MATCHUP'].str.contains('@')]
choices = ['home', 'away']

combined_df['home_away'] = np.select(conditions, choices, default='unknown')

In [70]:
combined_df['GAME_DATE'] = pd.to_datetime(combined_df['GAME_DATE'])
combined_df.sort_values(by='GAME_DATE', ascending=False, inplace=True)
combined_df.columns = combined_df.columns.str.lower()

In [21]:
traditional_team_box_scores.columns


Index(['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M',
       'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST',
       'STL', 'BLK', 'TOV', 'PF', 'PTS', 'PLUS_MINUS', 'VIDEO_AVAILABLE'],
      dtype='object')

In [5]:
##  FOR DATA VIZ HOMEWORK
traditional_team_box_scores.columns
df = traditional_team_box_scores[['SEASON_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_NAME', 'GAME_ID',
       'GAME_DATE', 'MATCHUP', 'WL','REB', 'AST',
       'STL', 'BLK', 'TOV', 'PTS']]
df

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,STL,BLK,TOV,PTS
0,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,60,23,5,7,19,108
1,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,49,19,11,6,11,104
2,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,44,23,5,4,12,107
3,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,42,29,9,6,12,119
4,22023,1610612740,NOP,New Orleans Pelicans,0022300071,2023-10-25,NOP @ MEM,W,52,22,8,5,21,111
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,22023,1610612764,WAS,Washington Wizards,0022301186,2024-04-14,WAS @ BOS,L,36,33,9,4,12,122
2456,22023,1610612755,PHI,Philadelphia 76ers,0022301192,2024-04-14,PHI vs. BKN,W,57,30,8,6,14,107
2457,22023,1610612751,BKN,Brooklyn Nets,0022301192,2024-04-14,BKN @ PHI,L,42,19,8,6,13,86
2458,22023,1610612766,CHA,Charlotte Hornets,0022301187,2024-04-14,CHA @ CLE,W,47,36,10,9,10,120


In [6]:
# This is to get a column of if the game was home or away
conditions = [df['MATCHUP'].str.contains('vs.'),
              df['MATCHUP'].str.contains('@')]
choices = ['home', 'away']

df['home_away'] = np.select(conditions, choices, default='unknown')
df['home_away'].isnull().sum()
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['home_away'] = np.select(conditions, choices, default='unknown')


Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,STL,BLK,TOV,PTS,home_away
0,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,60,23,5,7,19,108,away
1,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,49,19,11,6,11,104,home
2,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,44,23,5,4,12,107,away
3,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,42,29,9,6,12,119,home
4,22023,1610612740,NOP,New Orleans Pelicans,0022300071,2023-10-25,NOP @ MEM,W,52,22,8,5,21,111,away
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,22023,1610612764,WAS,Washington Wizards,0022301186,2024-04-14,WAS @ BOS,L,36,33,9,4,12,122,away
2456,22023,1610612755,PHI,Philadelphia 76ers,0022301192,2024-04-14,PHI vs. BKN,W,57,30,8,6,14,107,home
2457,22023,1610612751,BKN,Brooklyn Nets,0022301192,2024-04-14,BKN @ PHI,L,42,19,8,6,13,86,away
2458,22023,1610612766,CHA,Charlotte Hornets,0022301187,2024-04-14,CHA @ CLE,W,47,36,10,9,10,120,away


# Create data for charts

### Get home, away win percentage, and home court advantage differential

In [7]:
# filter for home and away games
home_games = df.loc[df['home_away'] == 'home']
away_games = df.loc[df['home_away'] == 'away']

# calculate home win percentage
home_win_count = home_games[home_games["WL"] == 'W'].groupby('TEAM_NAME').size()
home_total_count = home_games.groupby('TEAM_NAME').size()
home_win_percentage = ((home_win_count / home_total_count) * 100).fillna(0)

# calculate away win percentage 
away_win_count = away_games[away_games['WL'] == 'W'].groupby('TEAM_NAME').size()
away_total_count = away_games.groupby('TEAM_NAME').size()
away_win_percentage = ((away_win_count / away_total_count) * 100).fillna(0)

# merge back to original dataframe
df['home_win_percentage'] = df['TEAM_NAME'].map(home_win_percentage)
df['away_win_percentage'] = df['TEAM_NAME'].map(away_win_percentage)

# calculate home court advantage differential
df['HCAD'] = df['home_win_percentage'] - df['away_win_percentage']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['home_win_percentage'] = df['TEAM_NAME'].map(home_win_percentage)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['away_win_percentage'] = df['TEAM_NAME'].map(away_win_percentage)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['HCAD'] = df['home_win_percentage'] - df['away_win_percentage']


In [20]:
df

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,STL,BLK,TOV,PTS,home_away,home_win_percentage,away_win_percentage,HCAD
0,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,60,23,5,7,19,108,away,60.975610,58.536585,2.439024
1,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,49,19,11,6,11,104,home,51.219512,60.975610,-9.756098
2,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,44,23,5,4,12,107,away,66.666667,47.500000,19.166667
3,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,42,29,9,6,12,119,home,80.487805,58.536585,21.951220
4,22023,1610612750,MIN,Minnesota Timberwolves,0022300069,2023-10-25,MIN @ TOR,L,62,20,8,9,14,94,away,73.170732,63.414634,9.756098
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,22023,1610612740,NOP,New Orleans Pelicans,0022301195,2024-04-14,NOP vs. LAL,L,39,26,8,1,19,108,home,52.500000,66.666667,-14.166667
2456,22023,1610612750,MIN,Minnesota Timberwolves,0022301194,2024-04-14,MIN vs. PHX,L,36,22,7,5,24,106,home,73.170732,63.414634,9.756098
2457,22023,1610612758,SAC,Sacramento Kings,0022301200,2024-04-14,SAC vs. POR,W,51,29,11,6,14,121,home,58.536585,53.658537,4.878049
2458,22023,1610612757,POR,Portland Trail Blazers,0022301200,2024-04-14,POR @ SAC,L,54,18,11,2,18,82,away,26.829268,24.390244,2.439024


### Create data for pie chart

In [68]:
pie = df
home_filter = df[(df['WL'] == 'W') & (df['home_away'] == 'home')]
away_filter = df[(df['WL'] == 'W') & (df['home_away'] == 'away')]
pie['num_home_wins'] = len(home_filter)
pie['num_away_wins'] = len(away_filter)
pie = pie[['num_home_wins','num_away_wins']].drop_duplicates()
pie

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pie['num_home_wins'] = len(home_filter)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pie['num_away_wins'] = len(away_filter)


Unnamed: 0,num_home_wins,num_away_wins
0,668,562


In [69]:
pie.to_csv("pie_chart_data.csv")

### Create data for bar chart

In [91]:
bar = df[['TEAM_NAME', 'home_win_percentage']]
bar = bar.drop_duplicates()
bar['home_win_percentage'] = bar['home_win_percentage'].round(0).astype(int)
#bar.to_csv('bar_chart.csv')
bar

Unnamed: 0,TEAM_NAME,home_win_percentage
0,Golden State Warriors,51
1,Phoenix Suns,61
2,Los Angeles Lakers,67
3,Denver Nuggets,80
4,Chicago Bulls,49
5,Orlando Magic,71
6,Houston Rockets,66
7,Utah Jazz,51
8,Sacramento Kings,59
9,LA Clippers,61


## Create data for line chart

In [17]:

df

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,STL,BLK,TOV,PTS,home_away
0,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,44,23,5,4,12,107,away
1,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,49,19,11,6,11,104,home
2,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,60,23,5,7,19,108,away
3,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,42,29,9,6,12,119,home
4,22023,1610612752,NYK,New York Knicks,0022300065,2023-10-25,NYK vs. BOS,L,47,24,9,0,11,104,home
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,22023,1610612744,GSW,Golden State Warriors,0022301198,2024-04-14,GSW vs. UTA,W,42,35,10,6,9,123,home
2456,22023,1610612762,UTA,Utah Jazz,0022301198,2024-04-14,UTA @ GSW,L,48,22,6,5,20,116,away
2457,22023,1610612745,HOU,Houston Rockets,0022301199,2024-04-14,HOU @ LAC,W,59,31,7,8,18,116,away
2458,22023,1610612746,LAC,LA Clippers,0022301199,2024-04-14,LAC vs. HOU,L,51,23,13,8,12,105,home


In [11]:


# Assuming your DataFrame is already sorted by TEAM_ID and GAME_DATE
df = df.sort_values(by=['TEAM_ID', 'GAME_DATE'])

# Initialize the HCAD column and set starting value to 0 for each team
df['HCAD'] = 0

# Dictionary to store the running HCAD score for each team
team_hcad = {}

# Iterate over the DataFrame row by row
for index, row in df.iterrows():
    team = row['TEAM_NAME']
    
    # Initialize the team's HCAD score at 0 if not already in the dictionary
    if team not in team_hcad:
        team_hcad[team] = 0

    # Only adjust HCAD for home games
    if row['home_away'] == 'home':
        if row['WL'] == 'W':  # If the home team won the game
            team_hcad[team] += 1  # Add 1 for home win
        elif row['WL'] == 'L':  # If the home team lost the game
            team_hcad[team] -= 1  # Subtract 1 for home loss

    # if row['home_away'] == 'away':
    #     if row['WL'] == 'W':  # If the home team won the game
    #         team_hcad[team] += 1  # Add 1 for home win
    #     elif row['WL'] == 'L':  # If the home team lost the game
    #         team_hcad[team] -= 1  # Subtract 1 for home loss
    
    # Update the HCAD column with the current HCAD score
    df.at[index, 'HCAD'] = team_hcad[team]



df

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,STL,BLK,TOV,PTS,home_away,HCAD,home_win_percentage,away_win_percentage
23,22023,1610612737,ATL,Atlanta Hawks,0022300063,2023-10-25,ATL @ CHA,L,42,24,12,1,12,110,away,0,51.219512,63.414634
35,22023,1610612737,ATL,Atlanta Hawks,0022300079,2023-10-27,ATL vs. NYK,L,44,28,7,6,14,120,home,-1,51.219512,63.414634
78,22023,1610612737,ATL,Atlanta Hawks,0022300097,2023-10-29,ATL @ MIL,W,46,32,15,2,17,127,away,-1,51.219512,63.414634
99,22023,1610612737,ATL,Atlanta Hawks,0022300104,2023-10-30,ATL vs. MIN,W,36,28,6,7,11,127,home,0,51.219512,63.414634
127,22023,1610612737,ATL,Atlanta Hawks,0022300117,2023-11-01,ATL vs. WAS,W,57,26,8,3,21,130,home,1,51.219512,63.414634
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2341,22023,1610612766,CHA,Charlotte Hornets,0022301135,2024-04-07,CHA vs. OKC,L,36,29,12,9,18,118,home,-18,26.829268,75.609756
2356,22023,1610612766,CHA,Charlotte Hornets,0022301144,2024-04-09,CHA vs. DAL,L,39,24,8,3,12,104,home,-19,26.829268,75.609756
2386,22023,1610612766,CHA,Charlotte Hornets,0022301159,2024-04-10,CHA @ ATL,W,33,25,11,2,14,115,away,-19,26.829268,75.609756
2415,22023,1610612766,CHA,Charlotte Hornets,0022301173,2024-04-12,CHA @ BOS,L,33,20,3,7,20,98,away,-19,26.829268,75.609756


In [12]:
df.away_win_percentage.max()

83.33333333333334

In [36]:
line = df[['TEAM_NAME','GAME_DATE','HCAD']]
line.to_csv('line_chart.csv')

### Create data for other charts

In [15]:
other = df[['TEAM_NAME','home_win_percentage','away_win_percentage','HCAD']]
other.drop_duplicates(inplace=True)
other.to_csv('scatter_plot.csv')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  other.drop_duplicates(inplace=True)


In [11]:
# Filter for wins
wins_df = df[df['WL'] == 'W']

# Calculate total wins for each team
wins_count = wins_df.groupby('TEAM_NAME').size().reset_index(name='Total_Wins')

# Get home and away win percentages (assuming they are constant per team)
win_percentages = df[['TEAM_NAME', 'home_win_percentage', 'away_win_percentage']].drop_duplicates()

# Merge the wins count with win percentages
result = pd.merge(wins_count, win_percentages, on='TEAM_NAME', how='left')
#result.to_csv('scatter_plot.csv')

## for updated scatter plot with PIE

In [14]:
scatter = df
scatter

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_NAME,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,STL,BLK,TOV,PTS,home_away,home_win_percentage,away_win_percentage,HCAD
0,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,60,23,5,7,19,108,away,60.975610,58.536585,2.439024
1,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,49,19,11,6,11,104,home,51.219512,60.975610,-9.756098
2,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,44,23,5,4,12,107,away,66.666667,47.500000,19.166667
3,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,42,29,9,6,12,119,home,80.487805,58.536585,21.951220
4,22023,1610612740,NOP,New Orleans Pelicans,0022300071,2023-10-25,NOP @ MEM,W,52,22,8,5,21,111,away,52.500000,66.666667,-14.166667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,22023,1610612764,WAS,Washington Wizards,0022301186,2024-04-14,WAS @ BOS,L,36,33,9,4,12,122,away,17.073171,19.512195,-2.439024
2456,22023,1610612755,PHI,Philadelphia 76ers,0022301192,2024-04-14,PHI vs. BKN,W,57,30,8,6,14,107,home,60.975610,53.658537,7.317073
2457,22023,1610612751,BKN,Brooklyn Nets,0022301192,2024-04-14,BKN @ PHI,L,42,19,8,6,13,86,away,48.780488,29.268293,19.512195
2458,22023,1610612766,CHA,Charlotte Hornets,0022301187,2024-04-14,CHA @ CLE,W,47,36,10,9,10,120,away,26.829268,24.390244,2.439024


In [9]:
advanced_team_box_scores = nbaData.get_all_teams_boxscore(2023, advanced_boxscore=True)
advanced_team_box_scores

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_NAME,TEAM_ABBREVIATION,TEAM_CITY,MIN,E_OFF_RATING,OFF_RATING,E_DEF_RATING,DEF_RATING,...,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,E_USG_PCT,E_PACE,PACE,PACE_PER40,POSS,PIE
0,0022300061,1610612747,Lakers,LAL,Los Angeles,240.000000:00,109.4,111.5,119.9,125.3,...,12.5,0.511,0.541,1.0,0.199,98.54,95.5,79.58,96,0.421
1,0022300061,1610612743,Nuggets,DEN,Denver,240.000000:00,119.9,125.3,109.4,111.5,...,12.6,0.604,0.618,1.0,0.196,98.54,95.5,79.58,95,0.579
2,0022300062,1610612756,Suns,PHX,Phoenix,240.000000:00,103.4,106.9,97.8,102.0,...,18.8,0.500,0.527,1.0,0.198,105.40,101.5,84.58,101,0.564
3,0022300062,1610612744,Warriors,GSW,Golden State,240.000000:00,97.8,102.0,103.4,106.9,...,10.8,0.406,0.459,1.0,0.198,105.40,101.5,84.58,102,0.436
4,0022300064,1610612764,Wizards,WAS,Washington,240.000000:00,107.4,109.1,125.1,128.8,...,12.7,0.505,0.552,1.0,0.199,113.02,110.5,92.08,110,0.428
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,0022301192,1610612751,Nets,BKN,Brooklyn,240.000000:00,86.4,86.9,103.3,107.0,...,13.1,0.440,0.475,1.0,0.199,101.54,99.5,82.92,99,0.403
2456,0022301187,1610612739,Cavaliers,CLE,Cleveland,240.000000:00,111.2,113.4,125.2,123.7,...,14.4,0.559,0.567,1.0,0.196,97.40,97.0,80.83,97,0.427
2457,0022301187,1610612766,Hornets,CHA,Charlotte,240.000000:00,125.2,123.7,111.2,113.4,...,10.3,0.610,0.626,1.0,0.198,97.40,97.0,80.83,97,0.573
2458,0022301200,1610612758,Kings,SAC,Sacramento,240.000000:00,121.1,122.2,81.6,82.8,...,14.1,0.569,0.612,1.0,0.198,100.18,99.0,82.50,99,0.731


In [26]:
merge = pd.merge(scatter, advanced_team_box_scores, on=['TEAM_ID','GAME_ID'], how='left')
merge

Unnamed: 0,SEASON_ID,TEAM_ID,TEAM_ABBREVIATION_x,TEAM_NAME_x,GAME_ID,GAME_DATE,MATCHUP,WL,REB,AST,...,TM_TOV_PCT,EFG_PCT,TS_PCT,USG_PCT,E_USG_PCT,E_PACE,PACE,PACE_PER40,POSS,PIE
0,22023,1610612756,PHX,Phoenix Suns,0022300062,2023-10-24,PHX @ GSW,W,60,23,...,18.8,0.500,0.527,1.0,0.198,105.40,101.5,84.58,101,0.564
1,22023,1610612744,GSW,Golden State Warriors,0022300062,2023-10-24,GSW vs. PHX,L,49,19,...,10.8,0.406,0.459,1.0,0.198,105.40,101.5,84.58,102,0.436
2,22023,1610612747,LAL,Los Angeles Lakers,0022300061,2023-10-24,LAL @ DEN,L,44,23,...,12.5,0.511,0.541,1.0,0.199,98.54,95.5,79.58,96,0.421
3,22023,1610612743,DEN,Denver Nuggets,0022300061,2023-10-24,DEN vs. LAL,W,42,29,...,12.6,0.604,0.618,1.0,0.196,98.54,95.5,79.58,95,0.579
4,22023,1610612740,NOP,New Orleans Pelicans,0022300071,2023-10-25,NOP @ MEM,W,52,22,...,20.4,0.553,0.586,1.0,0.199,104.74,103.5,86.25,103,0.543
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2455,22023,1610612764,WAS,Washington Wizards,0022301186,2024-04-14,WAS @ BOS,L,36,33,...,11.4,0.544,0.568,1.0,0.196,108.32,105.5,87.92,105,0.442
2456,22023,1610612755,PHI,Philadelphia 76ers,0022301192,2024-04-14,PHI vs. BKN,W,57,30,...,14.0,0.500,0.521,1.0,0.197,101.54,99.5,82.92,100,0.597
2457,22023,1610612751,BKN,Brooklyn Nets,0022301192,2024-04-14,BKN @ PHI,L,42,19,...,13.1,0.440,0.475,1.0,0.199,101.54,99.5,82.92,99,0.403
2458,22023,1610612766,CHA,Charlotte Hornets,0022301187,2024-04-14,CHA @ CLE,W,47,36,...,10.3,0.610,0.626,1.0,0.198,97.40,97.0,80.83,97,0.573


In [30]:
conditions = [merge['MATCHUP'].str.contains('vs.'),
              merge['MATCHUP'].str.contains('@')]
choices = ['home', 'away']

merge['home_away'] = np.select(conditions, choices, default='unknown')

# filter for home and away games
home_games = merge.loc[merge['home_away'] == 'home']
away_games = merge.loc[merge['home_away'] == 'away']

# calculate home win percentage
home_win_count = home_games[home_games["WL"] == 'W'].groupby('TEAM_NAME_x').size()
home_total_count = home_games.groupby('TEAM_NAME_x').size()
home_win_percentage = ((home_win_count / home_total_count) * 100).fillna(0)

# calculate away win percentage 
away_win_count = away_games[away_games['WL'] == 'W'].groupby('TEAM_NAME_x').size()
away_total_count = away_games.groupby('TEAM_NAME_x').size()
away_win_percentage = ((away_win_count / away_total_count) * 100).fillna(0)

# merge back to original dataframe
merge['home_win_percentage'] = merge['TEAM_NAME_x'].map(home_win_percentage)
merge['away_win_percentage'] = merge['TEAM_NAME_x'].map(away_win_percentage)

wins_df = merge[merge['WL'] == 'W']

# Calculate total wins for each team
wins_count = wins_df.groupby('TEAM_NAME_x').size().reset_index(name='Total_Wins')

# Get home and away win percentages (assuming they are constant per team)
win_percentages = merge[['TEAM_NAME_x', 'home_win_percentage', 'away_win_percentage']].drop_duplicates()

# Merge the wins count with win percentages
result = pd.merge(wins_count, win_percentages, on='TEAM_NAME_x', how='left')
result

Unnamed: 0,TEAM_NAME_x,Total_Wins,home_win_percentage,away_win_percentage
0,Atlanta Hawks,36,51.219512,36.585366
1,Boston Celtics,64,90.243902,65.853659
2,Brooklyn Nets,32,48.780488,29.268293
3,Charlotte Hornets,21,26.829268,24.390244
4,Chicago Bulls,39,48.780488,46.341463
5,Cleveland Cavaliers,48,63.414634,53.658537
6,Dallas Mavericks,50,60.97561,60.97561
7,Denver Nuggets,57,80.487805,58.536585
8,Detroit Pistons,14,17.5,16.666667
9,Golden State Warriors,46,51.219512,60.97561


In [37]:
test = merge[['TEAM_NAME_x','home_away','PIE']]
test

Unnamed: 0,TEAM_NAME_x,home_away,PIE
0,Phoenix Suns,away,0.564
1,Golden State Warriors,home,0.436
2,Los Angeles Lakers,away,0.421
3,Denver Nuggets,home,0.579
4,New Orleans Pelicans,away,0.543
...,...,...,...
2455,Washington Wizards,away,0.442
2456,Philadelphia 76ers,home,0.597
2457,Brooklyn Nets,away,0.403
2458,Charlotte Hornets,away,0.573


In [49]:
# Calculate home PIE and broadcast it to the original DataFrame
test['home_pie'] = test.loc[test['home_away'] == 'home'].groupby('TEAM_NAME_x')['PIE'].transform('mean')

# Calculate away PIE and broadcast it to the original DataFrame
test['away_pie'] = test.loc[test['home_away'] == 'away'].groupby('TEAM_NAME_x')['PIE'].transform('mean')

test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['home_pie'] = test.loc[test['home_away'] == 'home'].groupby('TEAM_NAME_x')['PIE'].transform('mean')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['away_pie'] = test.loc[test['home_away'] == 'away'].groupby('TEAM_NAME_x')['PIE'].transform('mean')


Unnamed: 0,TEAM_NAME_x,home_away,PIE,home_pie,away_pie
0,Phoenix Suns,away,0.564,,0.527512
1,Golden State Warriors,home,0.436,0.506122,
2,Los Angeles Lakers,away,0.421,,0.499975
3,Denver Nuggets,home,0.579,0.555707,
4,New Orleans Pelicans,away,0.543,,0.529429


In [53]:
# Calculate the mean PIE for home games and broadcast it to all rows for each team
test['home_pie'] = test.groupby('TEAM_NAME_x')['PIE'].transform(lambda x: x[test['home_away'] == 'home'].mean())

# Calculate the mean PIE for away games and broadcast it to all rows for each team
test['away_pie'] = test.groupby('TEAM_NAME_x')['PIE'].transform(lambda x: x[test['home_away'] == 'away'].mean())

test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['home_pie'] = test.groupby('TEAM_NAME_x')['PIE'].transform(lambda x: x[test['home_away'] == 'home'].mean())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['away_pie'] = test.groupby('TEAM_NAME_x')['PIE'].transform(lambda x: x[test['home_away'] == 'away'].mean())


Unnamed: 0,TEAM_NAME_x,home_away,PIE,home_pie,away_pie
0,Phoenix Suns,away,0.564,0.527415,0.527512
1,Golden State Warriors,home,0.436,0.506122,0.519268
2,Los Angeles Lakers,away,0.421,0.537071,0.499975
3,Denver Nuggets,home,0.579,0.555707,0.503951
4,New Orleans Pelicans,away,0.543,0.516050,0.529429
...,...,...,...,...,...
2455,Washington Wizards,away,0.442,0.460756,0.449610
2456,Philadelphia 76ers,home,0.597,0.518220,0.492293
2457,Brooklyn Nets,away,0.403,0.503927,0.460317
2458,Charlotte Hornets,away,0.573,0.457293,0.436756


In [54]:
result.head()

Unnamed: 0,TEAM_NAME_x,Total_Wins,home_win_percentage,away_win_percentage
0,Atlanta Hawks,36,51.219512,36.585366
1,Boston Celtics,64,90.243902,65.853659
2,Brooklyn Nets,32,48.780488,29.268293
3,Charlotte Hornets,21,26.829268,24.390244
4,Chicago Bulls,39,48.780488,46.341463


In [69]:
test2 = pd.merge(result, test, left_on='TEAM_NAME_x', right_on='TEAM_NAME_x', how='left')
test2.drop(columns=['PIE', 'home_away'], inplace=True)
test2 = test2.drop_duplicates().reset_index(drop=True)

In [70]:
test2.to_csv('scatter_plot.csv')
