## More Than Just a Crowd: An NBA EDA Quantifying Home-Court Advantage

This project investigates the reality of home-court advantage in the NBA through comprehensive exploratory data analysis. We will quantify its impact by systematically comparing key performance metrics, including **points**, **rebounds**, **assists**, and **shooting efficiency**, between home and away games. 

Furthermore, the analysis will extend to evaluating individual performance disparities, examining how **'star players'** productivity and efficiency fluctuate across different game environments to identify patterns of home-court influence on both team and player levels.

In [1]:
import pandas as pd

In [2]:
df_games_details = pd.read_csv('./Unclean Data/games_details.csv')
df_games = pd.read_csv('./Unclean Data/games.csv')

  df_games_details = pd.read_csv('./Unclean Data/games_details.csv')


In [3]:
cols1 = list(df_games.columns)
cols2 = list(df_games_details.columns)

print("df_games:", cols1)
print("\ndf_games_details:", cols2)

df_games: ['GAME_DATE_EST', 'GAME_ID', 'GAME_STATUS_TEXT', 'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON', 'TEAM_ID_home', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home', 'FG3_PCT_home', 'AST_home', 'REB_home', 'TEAM_ID_away', 'PTS_away', 'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away', 'HOME_TEAM_WINS']

df_games_details: ['GAME_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_ID', 'PLAYER_NAME', 'NICKNAME', 'START_POSITION', 'COMMENT', 'MIN', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM', 'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF', 'PTS', 'PLUS_MINUS']


In [5]:
display(df_games.head(1))
display(df_games_details.head(1))

Unnamed: 0,GAME_DATE_EST,GAME_ID,GAME_STATUS_TEXT,HOME_TEAM_ID,VISITOR_TEAM_ID,SEASON,TEAM_ID_home,PTS_home,FG_PCT_home,FT_PCT_home,...,AST_home,REB_home,TEAM_ID_away,PTS_away,FG_PCT_away,FT_PCT_away,FG3_PCT_away,AST_away,REB_away,HOME_TEAM_WINS
0,2022-12-22,22200477,Final,1610612740,1610612759,2022,1610612740,126.0,0.484,0.926,...,25.0,46.0,1610612759,117.0,0.478,0.815,0.321,23.0,44.0,1


Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,NICKNAME,START_POSITION,COMMENT,MIN,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,22200477,1610612759,SAS,San Antonio,1629641,Romeo Langford,Romeo,F,,18:06,...,1.0,1.0,2.0,0.0,1.0,0.0,2.0,5.0,2.0,-2.0


## Data Cleaning

First we will remove any unnecessary columns in our datasets. These are columns that we either wouldn't explore in our EDA or columns that we considered unnecessary.

In [10]:
print(df_games.columns)
print(df_games_details.columns)

Index(['GAME_DATE_EST', 'GAME_ID', 'HOME_TEAM_ID', 'VISITOR_TEAM_ID', 'SEASON',
       'TEAM_ID_home', 'PTS_home', 'FG_PCT_home', 'FT_PCT_home',
       'FG3_PCT_home', 'AST_home', 'REB_home', 'TEAM_ID_away', 'PTS_away',
       'FG_PCT_away', 'FT_PCT_away', 'FG3_PCT_away', 'AST_away', 'REB_away',
       'HOME_TEAM_WINS'],
      dtype='object')
Index(['GAME_ID', 'TEAM_ID', 'TEAM_ABBREVIATION', 'TEAM_CITY', 'PLAYER_ID',
       'PLAYER_NAME', 'FGM', 'FGA', 'FG_PCT', 'FG3M', 'FG3A', 'FG3_PCT', 'FTM',
       'FTA', 'FT_PCT', 'OREB', 'DREB', 'REB', 'AST', 'STL', 'BLK', 'TO', 'PF',
       'PTS', 'PLUS_MINUS'],
      dtype='object')


In [18]:
## For df_games
df_games = df_games.drop(columns=['GAME_STATUS_TEXT'])

## For df_games_details
df_games_details = df_games_details.drop(columns=['NICKNAME', 'START_POSITION', 'COMMENT', 'MIN'])

print(df_games.columns)
print(df_games_details.columns)

KeyError: "['GAME_STATUS_TEXT'] not found in axis"

Second, we will check both data frames for null values to assess data quality and determine the appropriate handling strategy for missing data.

In [11]:
## Null values in this case can be dropped.
df_games.isna().sum()

GAME_DATE_EST       0
GAME_ID             0
HOME_TEAM_ID        0
VISITOR_TEAM_ID     0
SEASON              0
TEAM_ID_home        0
PTS_home           99
FG_PCT_home        99
FT_PCT_home        99
FG3_PCT_home       99
AST_home           99
REB_home           99
TEAM_ID_away        0
PTS_away           99
FG_PCT_away        99
FT_PCT_away        99
FG3_PCT_away       99
AST_away           99
REB_away           99
HOME_TEAM_WINS      0
dtype: int64

In [12]:
df_games = df_games.dropna()
df_games.isna().sum()

GAME_DATE_EST      0
GAME_ID            0
HOME_TEAM_ID       0
VISITOR_TEAM_ID    0
SEASON             0
TEAM_ID_home       0
PTS_home           0
FG_PCT_home        0
FT_PCT_home        0
FG3_PCT_home       0
AST_home           0
REB_home           0
TEAM_ID_away       0
PTS_away           0
FG_PCT_away        0
FT_PCT_away        0
FG3_PCT_away       0
AST_away           0
REB_away           0
HOME_TEAM_WINS     0
dtype: int64

In [13]:
## Null values means the player didn't play in that game.
df_games_details.isna().sum()

GAME_ID                   0
TEAM_ID                   0
TEAM_ABBREVIATION         0
TEAM_CITY                 0
PLAYER_ID                 0
PLAYER_NAME               0
FGM                  109690
FGA                  109690
FG_PCT               109690
FG3M                 109690
FG3A                 109690
FG3_PCT              109690
FTM                  109690
FTA                  109690
FT_PCT               109690
OREB                 109690
DREB                 109690
REB                  109690
AST                  109690
STL                  109690
BLK                  109690
TO                   109690
PF                   109690
PTS                  109690
PLUS_MINUS           133351
dtype: int64

In [14]:
df_games_details = df_games_details.dropna()
df_games_details.isna().sum()

GAME_ID              0
TEAM_ID              0
TEAM_ABBREVIATION    0
TEAM_CITY            0
PLAYER_ID            0
PLAYER_NAME          0
FGM                  0
FGA                  0
FG_PCT               0
FG3M                 0
FG3A                 0
FG3_PCT              0
FTM                  0
FTA                  0
FT_PCT               0
OREB                 0
DREB                 0
REB                  0
AST                  0
STL                  0
BLK                  0
TO                   0
PF                   0
PTS                  0
PLUS_MINUS           0
dtype: int64

Third step is to assess the data types of each data frame and decide whether some columns' data types need to be changed.

In [15]:
df_games.dtypes

GAME_DATE_EST       object
GAME_ID              int64
HOME_TEAM_ID         int64
VISITOR_TEAM_ID      int64
SEASON               int64
TEAM_ID_home         int64
PTS_home           float64
FG_PCT_home        float64
FT_PCT_home        float64
FG3_PCT_home       float64
AST_home           float64
REB_home           float64
TEAM_ID_away         int64
PTS_away           float64
FG_PCT_away        float64
FT_PCT_away        float64
FG3_PCT_away       float64
AST_away           float64
REB_away           float64
HOME_TEAM_WINS       int64
dtype: object

In [17]:
## Change object data types to a more specific data type.
## In this case only GAME_DATE_EST had an object data type.
df_games['GAME_DATE_EST'] = pd.to_datetime(df_games['GAME_DATE_EST'])
df_games.dtypes

GAME_DATE_EST      datetime64[ns]
GAME_ID                     int64
HOME_TEAM_ID                int64
VISITOR_TEAM_ID             int64
SEASON                      int64
TEAM_ID_home                int64
PTS_home                  float64
FG_PCT_home               float64
FT_PCT_home               float64
FG3_PCT_home              float64
AST_home                  float64
REB_home                  float64
TEAM_ID_away                int64
PTS_away                  float64
FG_PCT_away               float64
FT_PCT_away               float64
FG3_PCT_away              float64
AST_away                  float64
REB_away                  float64
HOME_TEAM_WINS              int64
dtype: object

In [87]:
df_games_details.dtypes

GAME_ID                int64
TEAM_ID                int64
TEAM_ABBREVIATION     object
TEAM_CITY             object
PLAYER_ID              int64
PLAYER_NAME           object
MIN                   object
FGM                  float64
FGA                  float64
FG_PCT               float64
FG3M                 float64
FG3A                 float64
FG3_PCT              float64
FTM                  float64
FTA                  float64
FT_PCT               float64
OREB                 float64
DREB                 float64
REB                  float64
AST                  float64
STL                  float64
BLK                  float64
TO                   float64
PF                   float64
PTS                  float64
PLUS_MINUS           float64
dtype: object

In [None]:
## Change object data types to a more specific data type.
df_games_details['TEAM_ABBREVIATION'] = df_games_details['TEAM_ABBREVIATION'].astype(str)
df_games_details['TEAM_CITY'] = df_games_details['TEAM_CITY'].astype(str)
df_games_details['PLAYER_NAME'] = df_games_details['PLAYER_NAME'].astype(str)
df_games_details['MIN'] = df_games_details['MIN'].astype(str)


In [84]:
df_games_details

Unnamed: 0,GAME_ID,TEAM_ID,TEAM_ABBREVIATION,TEAM_CITY,PLAYER_ID,PLAYER_NAME,MIN,FGM,FGA,FG_PCT,...,OREB,DREB,REB,AST,STL,BLK,TO,PF,PTS,PLUS_MINUS
0,22200477,1610612759,SAS,San Antonio,1629641,Romeo Langford,18:06,1.0,1.0,1.000,...,1.0,1.0,2.0,0.0,1.0,0.0,2.0,5.0,2.0,-2.0
1,22200477,1610612759,SAS,San Antonio,1631110,Jeremy Sochan,31:01,7.0,14.0,0.500,...,6.0,3.0,9.0,6.0,1.0,0.0,2.0,1.0,23.0,-14.0
2,22200477,1610612759,SAS,San Antonio,1627751,Jakob Poeltl,21:42,6.0,9.0,0.667,...,1.0,3.0,4.0,1.0,1.0,0.0,2.0,4.0,13.0,-4.0
3,22200477,1610612759,SAS,San Antonio,1630170,Devin Vassell,30:20,4.0,13.0,0.308,...,0.0,9.0,9.0,5.0,3.0,0.0,2.0,1.0,10.0,-18.0
4,22200477,1610612759,SAS,San Antonio,1630200,Tre Jones,27:44,7.0,12.0,0.583,...,0.0,2.0,2.0,3.0,0.0,0.0,2.0,2.0,19.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
665964,21200001,1610612739,CLE,Cleveland,101139,CJ Miles,17:42,1.0,5.0,0.200,...,0.0,4.0,4.0,1.0,0.0,0.0,3.0,0.0,2.0,2.0
665965,21200001,1610612739,CLE,Cleveland,203092,Tyler Zeller,14:53,2.0,4.0,0.500,...,0.0,2.0,2.0,0.0,1.0,1.0,0.0,2.0,5.0,4.0
665966,21200001,1610612739,CLE,Cleveland,200789,Daniel Gibson,16:11,3.0,5.0,0.600,...,1.0,2.0,3.0,1.0,0.0,1.0,0.0,2.0,10.0,-9.0
665967,21200001,1610612739,CLE,Cleveland,2575,Luke Walton,12:14,1.0,2.0,0.500,...,0.0,1.0,1.0,0.0,0.0,0.0,2.0,0.0,2.0,-11.0
