## NBA Playoffs Home Court Advantage

This notebook examines how home court advantage in the NBA playoffs differs from the regular season.

In [1]:
import numpy as np
import pandas as pd
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
pd.options.display.float_format = '{:.3f}'.format

In [2]:
import collections
import math

In [3]:
from tqdm import tqdm_notebook as tqdm

In [4]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import seaborn as sns
sns.set(context='notebook', palette='colorblind')

We will use the [`pracnbastats`](https://pypi.org/project/pracnbastats/) package to scrape [stats.nba.com](http://stats.nba.com/). You can install this package in your sports analytics Python environment by executing `pip install pracnbastats` in Terminal (on Mac or Linux computers) or at the Anaconda Prompt (on Windows computers).

In [5]:
import pracnbastats as nba

In [6]:
from pathlib import Path

This code assumes the existence of a directory to hold scraped NBA data. You can create and name this directory however you want, and adjust the code in the cell below to suit your preferences. If you've previously scraped the data, the `pracnbastats` library can find it and avoid re-scraping. You just need to specify the location of the previously scraped data using the `store` object defined below.

In [7]:
PROJECT_DIR = Path.cwd().parent
DATA_DIR = PROJECT_DIR / 'data'
STATS_DIR = DATA_DIR / 'stats-nba-com'

To scrape data from [stats.nba.com](http://stats.nba.com/), you need to specify a user agent. Below is the user agent I used. You can find your own user agent by searching for "my user agent" in Google.

In [8]:
USER_AGENT = (
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) '
    'AppleWebKit/537.36 (KHTML, like Gecko) '
    'Chrome/66.0.3359.139 Safari/537.36'
)

In [9]:
session = nba.scrape.NBASession(user_agent=USER_AGENT)

In [10]:
store = nba.store.FlatFiles.CSV(path=STATS_DIR)

In [11]:
scraper = nba.scrape.NBAScraper(session=session, store=store)

### Playoff Data

As with our [prior analysis of NBA home court advantage during the regular season](http://practicallypredictable.com/2018/01/09/first-look-nba-home-court-advantage/), we will use NBA games from the 1996-97 to 2016-17 seasons. This time, we will look at both regular- and post-season games.

In order to understand home court in the playoffs, we need to look at the games in a particular playoff series as a group. We'll look at this in more detail later in this notebook. For now, we want to be able to group all of the games between two teams in a particular season. The easiest way to do this is to create a "match up ID", which will be equal to the combination of the two team abbreviations and the season. The functions below will load the historical team box scores into a `pandas` `DataFrame` and add the new match up ID column.

In [12]:
def matchup_id(season_year, abbrs):
    """Unique identifier for a matchup between two teams in a particular season"""
    assert len(abbrs) == 2
    abbrs = sorted(list(abbrs))
    abbr1 = abbrs.pop(0)
    abbr2 = abbrs.pop(0)
    return f'{abbr1}_{abbr2}_{season_year}'

In [13]:
def create_matchup_ids(row):
    """Matchup identifier for a DataFrame row"""
    return matchup_id(row['season'], set([row['team_abbr_h'], row['team_abbr_r']]))

In [14]:
def load_season_matchups(
        scraper, *,
        season_type=nba.params.SeasonType.Regular,
        start_season=nba.params.MIN_YEAR,
        end_season=nba.params.Season.current_start_year()):
    """Historical NBA playoff match ups"""
    seasons = {
        season: nba.team.BoxScores(
            scraper=scraper,
            season=nba.params.Season(start_year=season),
            season_type=season_type,
        ) for season in tqdm(range(start_season, end_season+1))
    }
    df = pd.concat(seasons[season].matchups for season in seasons)
    df['matchup_id'] = df.apply(create_matchup_ids, axis=1)
    first_cols = ['matchup_id', ]
    cols = first_cols + [col for col in df.columns if col not in first_cols]
    return df[cols].sort_values(by=['date']).reset_index(drop=True)

Let's get the regular season data going back to 1996. We will ignore the 2017-18 season, since we are only going to examine home court advantage using data from complete seasons (regular and playoffs).

In [15]:
nba_reg = load_season_matchups(scraper, end_season=2016)

HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




In [16]:
len(nba_reg)

24798

Our regular season data consist of 24,798 games.

In [17]:
nba_reg.head()

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
0,MIN_SAS_1996,29600008,1996,reg,1996-11-01,N,1610612750,MIN,82,W,1610612759,SAS,78,L,H,MIN,SAS,4
1,ORL_WAS_1996,29600004,1996,reg,1996-11-01,N,1610612753,ORL,92,L,1610612764,WAS,96,W,R,WAS,ORL,4
2,GSW_LAC_1996,29600013,1996,reg,1996-11-01,N,1610612744,GSW,85,L,1610612746,LAC,97,W,R,LAC,GSW,12
3,BOS_CHI_1996,29600001,1996,reg,1996-11-01,N,1610612738,BOS,98,L,1610612741,CHI,107,W,R,CHI,BOS,9
4,POR_VAN_1996,29600014,1996,reg,1996-11-01,N,1610612763,VAN,85,L,1610612757,POR,114,W,R,POR,VAN,29


Now let's get the playoff data.

In [18]:
nba_post = load_season_matchups(scraper=scraper, season_type=nba.params.SeasonType.Playoffs, end_season=2016)

HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




In [19]:
len(nba_post)

1686

We have 1,686 playoff games in our historical data set.

In [20]:
nba_post.tail(10)

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
1676,GSW_SAS_2016,41600313,2016,post,2017-05-20,Y,1610612759,SAS,108,L,1610612744,GSW,120,W,R,GSW,SAS,12
1677,BOS_CLE_2016,41600303,2016,post,2017-05-21,Y,1610612739,CLE,108,L,1610612738,BOS,111,W,R,BOS,CLE,3
1678,GSW_SAS_2016,41600314,2016,post,2017-05-22,Y,1610612759,SAS,115,L,1610612744,GSW,129,W,R,GSW,SAS,14
1679,BOS_CLE_2016,41600304,2016,post,2017-05-23,Y,1610612739,CLE,112,W,1610612738,BOS,99,L,H,CLE,BOS,13
1680,BOS_CLE_2016,41600305,2016,post,2017-05-25,Y,1610612738,BOS,102,L,1610612739,CLE,135,W,R,CLE,BOS,33
1681,CLE_GSW_2016,41600401,2016,post,2017-06-01,Y,1610612744,GSW,113,W,1610612739,CLE,91,L,H,GSW,CLE,22
1682,CLE_GSW_2016,41600402,2016,post,2017-06-04,Y,1610612744,GSW,132,W,1610612739,CLE,113,L,H,GSW,CLE,19
1683,CLE_GSW_2016,41600403,2016,post,2017-06-07,Y,1610612739,CLE,113,L,1610612744,GSW,118,W,R,GSW,CLE,5
1684,CLE_GSW_2016,41600404,2016,post,2017-06-09,Y,1610612739,CLE,137,W,1610612744,GSW,116,L,H,CLE,GSW,21
1685,CLE_GSW_2016,41600405,2016,post,2017-06-12,Y,1610612744,GSW,129,W,1610612739,CLE,120,L,H,GSW,CLE,9


### Regular- and Post-Season Home Win Percentages

Let's look at the average home win percentages over this 21 season period.

In [21]:
def home_win_percentage(df):
    """Fraction of home games won"""
    games = len(df)
    if games > 0:
        return float(df['hr_winner'].value_counts()['H'] / games)
    else:
        return np.nan

In [22]:
home_win_percentage(nba_reg)

0.5980320993628518

In [23]:
home_win_percentage(nba_post)

0.6453143534994069

We see that, during the 1996-97 through 2016-17 NBA seasons, the home team has won about 64.5% of the time during the playoffs, compared to 59.8% during the regular season.

This is identical to the home win percentage observed by the analysts at FiveThirtyEight in [this article](https://fivethirtyeight.com/features/a-home-playoff-game-is-a-big-advantage-unless-you-play-hockey/). It seems clear that the home court advantage is even stronger during the playoffs than it is during the regular season.

#### Differences Between the Regular- and the Post-Season Schedules

The story is a little more complicated than this, however. Let's take a moment to think about how the NBA playoff schedule might impact these results.

The regular season NBA schedule is relatively balanced by design. Each team plays an equal number of home and road games. The league also tries to make sure teams play comparably difficult schedules in terms travel distance and rest between games. You can read a good overview of the NBA scheduling process [here](https://www.nbastuffer.com/analytics101/how-the-nba-schedule-is-made/).

In the post-season, the scheduling is intentionally designed to be unbalanced. The team with the better regular season record gets home court advantage in each playoff series. (See [here](https://www.quora.com/How-is-the-home-court-advantage-determined-in-the-NBA-playoffs) for the list of the home court tie-break rules in case the two playoff teams have the same regular-season record.)

The playoff schedule is designed to give the better teams (as demonstrated in the regular season) the right to play a decisive game 7 at home (or game 5 in the opening rounds prior to 2003; see [here](https://en.wikipedia.org/wiki/NBA_playoffs#Format) for a history of the NBA playoff format). To see the importance of playoff home court advantage, [read what Steve Kerr thinks about it here](http://www.nba.com/article/2018/01/22/golden-state-warriors-steve-kerr-home-court-advantage-probably-top-goal). The playoff home court rules give championship-contender teams a strong incentive to play well and win games during the regular season even after they have clinched a playoff berth.

So, why does this matter?

Remember: home court advantage is supposed to measure the probability that an _average_ team wins at home. On a neutral court, you would expect an exactly _average_ team to win 50% of its games. When we computed the 59.8% historical regular season home court win probability, we implicitly assumed that the NBA regular season schedule is balanced. If the worst teams always played at home and the best teams were always on the road, the league-average home court win percentage would be much lower than 59.8%. This is because the schedule would be very unbalanced.

The playoff home court rules mean that the better team is more likely to play games at home than on the road. Or at least this is true if you believe that the team that happens to get home court advantage is actually the better team. That's a judgment on whether regular season record and the NBA tie-breaking procedure is a reasonable measure of team quality.

#### Creating Balanced Playoff Match Ups

Since it is reasonable to assume that the better team is playing more games at home during the playoffs, the home win percentage is not independent of team quality. We need to figure out a way to try to compensate for this to get a better estimate of NBA playoff home court advantage.

In this notebook, we are going to try to accomplish this by forcing our playoff match ups to be balanced. We are going to accomplish this by _excluding games_ so that each match up has an equal number of home and away games for the team with series home court advantage.

It's always a big step to intentionally exclude data from an analysis. You should never do it lightly, and you should always disclose what you are doing. In this case, we've already looked at the full data set, so you can see the impact of excluding certain games.

We'll walk through the process of excluding games to create balanced playoff series, step-by-step.

First, we want to have a way to get all of the games for a given match up ID.

In [24]:
def extract_matchups(df, *, matchup_id):
    """Matchups between two teams in a particular season"""
    return df[df['matchup_id'] == matchup_id].copy()

Here are the games from fantastic 1996-97 season Bulls-Jazz Finals.

In [25]:
extract_matchups(nba_post, matchup_id=matchup_id(1996, ['CHI', 'UTA']))

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
66,CHI_UTA_1996,49600083,1996,post,1997-06-01,N,1610612741,CHI,84,W,1610612762,UTA,82,L,H,CHI,UTA,2
67,CHI_UTA_1996,49600084,1996,post,1997-06-04,N,1610612741,CHI,97,W,1610612762,UTA,85,L,H,CHI,UTA,12
68,CHI_UTA_1996,49600085,1996,post,1997-06-06,N,1610612762,UTA,104,W,1610612741,CHI,93,L,H,UTA,CHI,11
69,CHI_UTA_1996,49600086,1996,post,1997-06-08,N,1610612762,UTA,78,W,1610612741,CHI,73,L,H,UTA,CHI,5
70,CHI_UTA_1996,49600087,1996,post,1997-06-11,N,1610612762,UTA,88,L,1610612741,CHI,90,W,R,CHI,UTA,2
71,CHI_UTA_1996,49600088,1996,post,1997-06-13,N,1610612741,CHI,90,W,1610612762,UTA,86,L,H,CHI,UTA,4


Notice that this series has 6 games and is already balanced. There are 3 Chicago home games, and 3 Utah home games.

Here are the games from last season's NBA Finals.

In [26]:
extract_matchups(nba_post, matchup_id=matchup_id(2016, ['GSW', 'CLE']))

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
1681,CLE_GSW_2016,41600401,2016,post,2017-06-01,Y,1610612744,GSW,113,W,1610612739,CLE,91,L,H,GSW,CLE,22
1682,CLE_GSW_2016,41600402,2016,post,2017-06-04,Y,1610612744,GSW,132,W,1610612739,CLE,113,L,H,GSW,CLE,19
1683,CLE_GSW_2016,41600403,2016,post,2017-06-07,Y,1610612739,CLE,113,L,1610612744,GSW,118,W,R,GSW,CLE,5
1684,CLE_GSW_2016,41600404,2016,post,2017-06-09,Y,1610612739,CLE,137,W,1610612744,GSW,116,L,H,CLE,GSW,21
1685,CLE_GSW_2016,41600405,2016,post,2017-06-12,Y,1610612744,GSW,129,W,1610612739,CLE,120,L,H,GSW,CLE,9


This series has 5 games, so it is unbalanced. Golden State played 3 games at home. To balance this series, we will need to exclude one of the Golden State home games.

And here are the games from the 2015-16 NBA Finals where LeBron brought it home for Cleveland.

In [27]:
extract_matchups(nba_post, matchup_id=matchup_id(2015, ['GSW', 'CLE']))

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
1600,CLE_GSW_2015,41500401,2015,post,2016-06-02,Y,1610612744,GSW,104,W,1610612739,CLE,89,L,H,GSW,CLE,15
1601,CLE_GSW_2015,41500402,2015,post,2016-06-05,Y,1610612744,GSW,110,W,1610612739,CLE,77,L,H,GSW,CLE,33
1602,CLE_GSW_2015,41500403,2015,post,2016-06-08,Y,1610612739,CLE,120,W,1610612744,GSW,90,L,H,CLE,GSW,30
1603,CLE_GSW_2015,41500404,2015,post,2016-06-10,Y,1610612739,CLE,97,L,1610612744,GSW,108,W,R,GSW,CLE,11
1604,CLE_GSW_2015,41500405,2015,post,2016-06-13,Y,1610612744,GSW,97,L,1610612739,CLE,112,W,R,CLE,GSW,15
1605,CLE_GSW_2015,41500406,2015,post,2016-06-16,Y,1610612739,CLE,115,W,1610612744,GSW,101,L,H,CLE,GSW,14
1606,CLE_GSW_2015,41500407,2015,post,2016-06-19,Y,1610612744,GSW,89,L,1610612739,CLE,93,W,R,CLE,GSW,4


This series had 7 games, so it is also unbalanced. Again, we will need to exclude one of the Golden State home games.

#### Balancing Playoff Series

Now we can develop a function to filter these series, to exclude an extra home game if necessary. We are going to use the above series to make sure our filtering function works correctly.

The function is not that complicated. The function will work on a particular playoff match up, which we will specify by passing a match up ID. We will get all of the games in that playoff series, sorted in order of game date. The team with series home court advantage always plays the first game at home.

All we need to do is the following:

1. Separate the games into two lists: one for the games played on the home court of the team with series home court advantage, and one for the other team.
2. Get the lengths of both of these lists of games.
3. If both lists are the same length, then the series is already balanced and we are all set. Just return the series games as is.
4. Otherwise, find out which list has an extra game, and chop off one of the games from that list.
5. Put the series back together in game order, omitting the excluded game, and return this modified list of games.

Notice that because of the NBA playoff schedule rules, it is always true that chopping off one game will balance the series. In any case, we check this in the function below to make sure that the number of home and away games is balanced after we chop off a game, just to be safe.

If the series is unbalanced, there are two ways to remove a game from the team having an extra home game. We can exclude the first game, or we can exclude the last game. The function below defaults to removing the first home game, but provides a parameter to choose to remove the last home game instead. In the analysis below, we will try it both ways to see if it makes a difference. Of course, in a 7 game series, we could also remove the middle game. The function below ignores this possibility for two reasons. First, it's hard to believe that the middle home game in a 7 game playoff series would have a meaningfully different home court advantage compared to the first or last home games. Second, because it would be cumbersome to decide what to do as a fallback for series that didn't get to 7 games.

In [28]:
def filter_matchup(df, *, matchup_id, last_game=True):
    """Subset of matchup having even number of balanced home and road games"""
    df = (
        extract_matchups(df, matchup_id=matchup_id)
        .reset_index(drop=True)
        .sort_values(by=['date'])
    )
    # Games are sorted by date, so first home team has series home court advantage
    hca_team = df.iloc[0]['team_abbr_h']
    other = df.iloc[0]['team_abbr_r']
    home_courts = df['team_abbr_h'].tolist()
    home_games = {
        team: [
            index for index, value in enumerate(home_courts)
            if value == team
        ] for team in [hca_team, other]
    }
    hca_at_home = len(home_games[hca_team])
    other_at_home = len(home_games[other])
    pairs = min(hca_at_home, other_at_home)
    if hca_at_home == other_at_home:
        # We are done, the series is already balanced
        return df
    elif pairs == hca_at_home:
        # Need to chop one game off the non-HCA team's home games
        # If last_game, then remove the first game other team played at home
        # Otherwise, remove the last game other team played at home
        element = 0 if last_game else -1
        home_games[other].pop(element)
    else:
        # Need to chop one game off the HCA team's home games
        # Same logic applies as for the non-HCA team
        element = 0 if last_game else -1
        home_games[hca_team].pop(element)
    assert len(home_games[hca_team]) == len(home_games[other])
    games = sorted(home_games[hca_team] + home_games[other])
    return df.loc[games]

Let's test out this function on the 2015-16 season NBA Finals.

In [29]:
filter_matchup(nba_post, matchup_id=matchup_id(2015, ['GSW', 'CLE']))

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
1,CLE_GSW_2015,41500402,2015,post,2016-06-05,Y,1610612744,GSW,110,W,1610612739,CLE,77,L,H,GSW,CLE,33
2,CLE_GSW_2015,41500403,2015,post,2016-06-08,Y,1610612739,CLE,120,W,1610612744,GSW,90,L,H,CLE,GSW,30
3,CLE_GSW_2015,41500404,2015,post,2016-06-10,Y,1610612739,CLE,97,L,1610612744,GSW,108,W,R,GSW,CLE,11
4,CLE_GSW_2015,41500405,2015,post,2016-06-13,Y,1610612744,GSW,97,L,1610612739,CLE,112,W,R,CLE,GSW,15
5,CLE_GSW_2015,41500406,2015,post,2016-06-16,Y,1610612739,CLE,115,W,1610612744,GSW,101,L,H,CLE,GSW,14
6,CLE_GSW_2015,41500407,2015,post,2016-06-19,Y,1610612744,GSW,89,L,1610612739,CLE,93,W,R,CLE,GSW,4


This used the default value to exclude the first Warriors home game. We can override this to exclude the last Warriors home game instead.

In [30]:
filter_matchup(nba_post, matchup_id=matchup_id(2015, ['GSW', 'CLE']), last_game=False)

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
0,CLE_GSW_2015,41500401,2015,post,2016-06-02,Y,1610612744,GSW,104,W,1610612739,CLE,89,L,H,GSW,CLE,15
1,CLE_GSW_2015,41500402,2015,post,2016-06-05,Y,1610612744,GSW,110,W,1610612739,CLE,77,L,H,GSW,CLE,33
2,CLE_GSW_2015,41500403,2015,post,2016-06-08,Y,1610612739,CLE,120,W,1610612744,GSW,90,L,H,CLE,GSW,30
3,CLE_GSW_2015,41500404,2015,post,2016-06-10,Y,1610612739,CLE,97,L,1610612744,GSW,108,W,R,GSW,CLE,11
4,CLE_GSW_2015,41500405,2015,post,2016-06-13,Y,1610612744,GSW,97,L,1610612739,CLE,112,W,R,CLE,GSW,15
5,CLE_GSW_2015,41500406,2015,post,2016-06-16,Y,1610612739,CLE,115,W,1610612744,GSW,101,L,H,CLE,GSW,14


Now let's test the filter on last season's Finals.

In [31]:
filter_matchup(nba_post, matchup_id=matchup_id(2016, ['GSW', 'CLE']))

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
1,CLE_GSW_2016,41600402,2016,post,2017-06-04,Y,1610612744,GSW,132,W,1610612739,CLE,113,L,H,GSW,CLE,19
2,CLE_GSW_2016,41600403,2016,post,2017-06-07,Y,1610612739,CLE,113,L,1610612744,GSW,118,W,R,GSW,CLE,5
3,CLE_GSW_2016,41600404,2016,post,2017-06-09,Y,1610612739,CLE,137,W,1610612744,GSW,116,L,H,CLE,GSW,21
4,CLE_GSW_2016,41600405,2016,post,2017-06-12,Y,1610612744,GSW,129,W,1610612739,CLE,120,L,H,GSW,CLE,9


Lastly, here is the balanced 1996-97 Finals series.

In [32]:
filter_matchup(nba_post, matchup_id=matchup_id(1996, ['CHI', 'UTA']))

Unnamed: 0,matchup_id,game_id,season,season_type,date,video,team_id_h,team_abbr_h,pts_h,win_loss_h,team_id_r,team_abbr_r,pts_r,win_loss_r,hr_winner,winner,loser,mov
0,CHI_UTA_1996,49600083,1996,post,1997-06-01,N,1610612741,CHI,84,W,1610612762,UTA,82,L,H,CHI,UTA,2
1,CHI_UTA_1996,49600084,1996,post,1997-06-04,N,1610612741,CHI,97,W,1610612762,UTA,85,L,H,CHI,UTA,12
2,CHI_UTA_1996,49600085,1996,post,1997-06-06,N,1610612762,UTA,104,W,1610612741,CHI,93,L,H,UTA,CHI,11
3,CHI_UTA_1996,49600086,1996,post,1997-06-08,N,1610612762,UTA,78,W,1610612741,CHI,73,L,H,UTA,CHI,5
4,CHI_UTA_1996,49600087,1996,post,1997-06-11,N,1610612762,UTA,88,L,1610612741,CHI,90,W,R,CHI,UTA,2
5,CHI_UTA_1996,49600088,1996,post,1997-06-13,N,1610612741,CHI,90,W,1610612762,UTA,86,L,H,CHI,UTA,4


### Balanced Playoff Series Home Court Advantage

Now that we have the balancing function working, we can run it on the full historical playoff data set.

In [33]:
filtered = pd.concat(
    filter_matchup(nba_post, matchup_id=matchup_id)
    for matchup_id in nba_post['matchup_id'].unique()
)
len(filtered)

1530

We are left with 1,530 games, compared to the original 1,686 games in the unbalanced playoff data set. We can build a before/after comparison to see which particular series were pruned.

In [34]:
summary = pd.DataFrame(dict(
    original=nba_post.groupby(['matchup_id']).size(),
    filtered=filtered.groupby(['matchup_id']).size()
)).reset_index()
summary.head()

Unnamed: 0,matchup_id,filtered,original
0,ATL_BKN_2014,6,6
1,ATL_BOS_2007,6,7
2,ATL_BOS_2011,6,6
3,ATL_BOS_2015,6,6
4,ATL_CHH_1997,4,4


Now that we are comfortable that the series balancing is working correctly, let's look at the home win percentage for the balanced historical playoff games. First, let's run it on the data with the first extra home game removed (the default for the filtering function).

In [35]:
home_win_percentage(filtered)

0.6333333333333333

Now, let's run the analysis again removing the last extra home game.

In [36]:
reverse_filtered = pd.concat(
    filter_matchup(nba_post, matchup_id=matchup_id, last_game=False)
    for matchup_id in nba_post['matchup_id'].unique()
)
len(filtered)

1530

Of course, the number of filtered games we end up with has to be the same whether we remove the first or the last extra home game. Let's see what the home win percentage looks like.

In [37]:
home_win_percentage(reverse_filtered)

0.6281045751633987

#### Results

Balancing the playoff games has a measurable impact on the estimated playoff home court advantage. Recall that on the full data set, the estimated home court win percentage was 64.5%. With balancing, the estimated home court win percentage drops to 63.3% (removing first extra game) or 62.8% (removing last extra game).

I don't have a strong view as to which balancing is more theoretically correct, so let's average both methods.

In [38]:
home_win_pct_post = (
    (home_win_percentage(filtered) +
     home_win_percentage(reverse_filtered)) / 2
)
home_win_pct_post

0.630718954248366

We will use this blended estimate approximately 63.1% in our NBA playoff analysis. Let's see what this is worth in Elo rating points.

In [39]:
def hca_calibrate(home_win_pct):
    """Calibrate Elo home court adjustment to a given historical home team win percentage."""
    if home_win_pct <= 0 or home_win_pct >= 1:
        raise ValueError('invalid home win probability', home_win_pct)
    a = home_win_pct / (1 - home_win_pct)
    hca = 400 * math.log10(a)
    return hca

First, let's look at what the unbalanced home win percentage is worth in terms of Elo rating points.

In [40]:
hca_calibrate(home_win_percentage(nba_post))

103.97108454950012

Now let's look at Elo rating points consistent with the balanced home win percentage.

In [41]:
hca_calibrate(home_win_pct_post)

92.9915462097416

I think an adjustment of 93 Elo rating points is superior to the 104 points estimated from the full, unbalanced data set. Home court advantage is worth more in the NBA playoffs than in the regular season, but not quite as much as it initially appears.