<center>
    <h1>NBA Playoff Predictions</h1>
    <h3>Neel Shah</h3>
</center>


**Table of contents**<a id='toc0_'></a>    
1. [Introduction](#toc1_)    
1.1. [Background Information](#toc1_1_)    
1.2. [Objective](#toc1_2_)    
1.3. [Configuration](#toc1_3_)    
2. [Data Collection](#toc2_)    
2.1. [Data Sources](#toc2_1_)    
2.2. [Scraping Data](#toc2_2_)    
3. [Data Processing](#toc3_)    
3.1. [Cleaning Data](#toc3_1_)    
3.1.1. [Renaming a Single Column](#toc3_1_1_)    
3.1.2. [Renaming All Columns](#toc3_1_2_)    
3.1.3. [Cleaning a `CSV` File](#toc3_1_3_)    
3.2. [Merging Data](#toc3_2_)    
4. [Exploratory Data Analysis & Visualization](#toc4_)    
5. [Modeling: Analysis, Hypothesis Testing, & Machine Learning](#toc5_)    
6. [Interpretation: Insight & Policy Decision](#toc6_)    

<!-- vscode-jupyter-toc-config
	numbering=true
	anchor=true
	flat=true
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# 1. <a id='toc1_'></a>[Introduction](#toc0_)


Let's start off by providing some background information about this topic, defining the objectives of this project, and configuring some things.


## 1.1. <a id='toc1_1_'></a>[Background Information](#toc0_)


The [National Basketball Association (NBA)][NBA] is a professional basketball league in the United States. There are [$30$ teams][NBA teams] in the league, divided evenly into $2$ conferences: the Eastern Conference and the Western Conference.

In the regular season, each team plays $82$ games. [NBA regular season standings][NBA standings] are determined by teams' win-loss records within their conferences.

The top $8$ teams from each conference advance to the playoffs. In the event of a tie in the standings, there is a [tie-breaking procedure][NBA tie-breaking procedure] used to determine playoff seeding.

Starting in the $2019\text{-}20$ season, the NBA [added a play-in tournament][Bleacher Report play-in tournament] to give the $9^\text{th}$ and $10^\text{th}$ place teams in each conference the opportunity to earn a spot in the playoffs. It works as follows:

- The $7^\text{th}$ and $8^\text{th}$ place teams play a game to determine the $7^\text{th}$ seed. The winner advances to the playoffs.
- The $9^\text{th}$ and $10^\text{th}$ place teams play an elimination game. The loser is eliminated.
- The loser of the $7/8$ game and the winner of the $9/10$ game play an elimination game to determine the $8^\text{th}$ seed. The winner advances to the playoffs; the loser is eliminated.

Once the final playoff seeding is determined, each team plays an opponent in a best-of-$7$ series. The first to win $4$ games advances to the next round. The first round is followed by the conference semifinals, then the conference finals, then the finals. The team that wins the NBA Finals in the NBA Champion.

The matchups for each round are determined using a traditional bracket structure, shown below:

![NBA Playoff Bracket](nba-playoff-bracket.png "NBA Playoff Bracket")

[NBA]: https://www.nba.com/
[NBA teams]: https://www.nba.com/teams
[NBA standings]: https://www.nba.com/standings
[NBA tie-breaking procedure]: https://ak-static-int.nba.com/wp-content/uploads/sites/2/2017/06/NBA_Tiebreaker_Procedures.pdf
[Bleacher Report play-in tournament]: https://bleacherreport.com/articles/10031906-adam-silver-envisions-nba-play-in-tournament-becoming-a-fixture-in-this-league


## 1.2. <a id='toc1_2_'></a>[Objective](#toc0_)


We want to perform some analysis to see if we can identify factors underlying teams' level of success in the playoffs. Our ultimate goal will be to predict the outcome of the NBA playoffs using data from the regular season.

Can we accurately predict how many playoff games a team will win?

With this informalikely, we could determine if a team is likely to:

- Make the conference semifinals (i.e. win at least $4$ playoff games)
- Make the conference finals (i.e. win at least $8$ playoff games)
- Make the finals (i.e. win at least $12$ playoff games)
- Win the championship (i.e. win $16$ playoff games)

These are some of the questions we want to answer as we go through the full data science pipeline.


## 1.3. <a id='toc1_3_'></a>[Configuration](#toc0_)


We'll start by importing the [Python][Python] libraries necessary for this project and configuring some things.

[Python]: https://www.python.org/


In [141]:
import warnings
import time
from pathlib import Path
import itertools
import requests
from bs4 import BeautifulSoup, Comment, MarkupResemblesLocatorWarning
import pandas as pd

warnings.filterwarnings("ignore", category=MarkupResemblesLocatorWarning, module="bs4")

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36"
}


# 2. <a id='toc2_'></a>[Data Collection](#toc0_)


Now, we need to collect data that we can use in our analysis.


## 2.1. <a id='toc2_1_'></a>[Data Sources](#toc0_)


We will will scrape data from [Basketball Reference][Basketball Reference], a site that provides historical basketball data.

We will use data from the $2002\text{-}03$ season ([when the NBA switched to the current playoff format, where every series is best-of-$7$][NBA playoffs history]) to the $2022\text{-}23$ season (the current season). For each season, we will scrape the following information:

- Per Game Stats from Season Summary page
- Advanced Stats from Season Summary page
- Expanded Standings from Standings page
- Advanced Stats from Playoffs Summary page

For convenience, we will define a function called `pages_to_scrape`. Given a season, it will return a dictionary which maps the URL of the page we will be scraping to the list of information about each table we will be scraping from that page.

Each element in the list will be in the form of a dictionary with $2$ items:

- The `id` of the `HTML` `table` element we will be scraping from the page
- The `path` where we will be storing the table as a `CSV`

[Basketball Reference]: https://www.basketball-reference.com/
[NBA playoffs history]: https://www.nba.com/magic/history-no-1-vs-no-8-nba-playoffs-20200810


In [142]:
def pages_to_scrape(season):
    return {
        f"https://www.basketball-reference.com/leagues/NBA_{season}.html": [
            {
                "id": "per_game-team",
                "path": f"data/raw/{season}/regular_season/per_game_stats.csv",
            },
            {
                "id": "advanced-team",
                "path": f"data/raw/{season}/regular_season/advanced_stats.csv",
            },
        ],
        f"https://www.basketball-reference.com/leagues/NBA_{season}_standings.html": [
            {
                "id": "expanded_standings",
                "path": f"data/raw/{season}/regular_season/standings.csv",
            }
        ],
        f"https://www.basketball-reference.com/playoffs/NBA_{season}.html": [
            {
                "id": "advanced-team",
                "path": f"data/raw/{season}/playoffs/advanced_stats.csv",
            }
        ],
    }


## 2.2. <a id='toc2_2_'></a>[Scraping Data](#toc0_)


To scrape the data from each page, we will do the following:

- Perform an `HTTP` `GET` request to the appropriate URL using the [Requests][Requests] library.
- Parse the webpage using [Beautiful Soup][Beautiful Soup].
- Find the `HTML` `table` with the appropriate `id`.
- Read the parsed `HTML` `table` into a `DataFrame` using [pandas][pandas].
- Ensure that the appropriate `path` exists in the filesystem using [pathlib][pathlib].
- Write the `DataFrame` to a `CSV` file at the appropriate `path` using [pandas][pandas].

In our approach, there are a few issues that we have to address as well:

- To save time, if the `CSV` files for a page already exist, we will not re-scrape the page.
- It appears that some of the `table` elements are hidden inside of `HTML` comments, so we have to look there if a `table` can't be found normally.
- To avoid hitting rate limits (from making too many requests in a given time period), we have to add a $10$ second `sleep` between each `HTTP` `GET` request using the [time][time] library.

[Requests]: https://requests.readthedocs.io/en/latest/
[Beautiful Soup]: https://beautiful-soup-4.readthedocs.io/en/latest/
[pandas]: https://pandas.pydata.org/
[pathlib]: https://docs.python.org/3/library/pathlib.html
[time]: https://docs.python.org/3/library/time.html


In [143]:
seasons = list(range(2003, 2023 + 1))

for season in seasons:

    pages = pages_to_scrape(season)

    for url, infoList in pages.items():

        if all([Path(info["path"]).exists() for info in infoList]):
            continue

        page = requests.get(url, headers=headers)
        soup = BeautifulSoup(page.content, "html.parser")

        for info in infoList:

            if Path(info["path"]).exists():
                continue

            table = soup.find("table", id=info["id"])
            if table is None:
                for comment in soup.find_all(
                    string=lambda text: isinstance(text, Comment)
                ):
                    comment_soup = BeautifulSoup(comment, "html.parser")
                    table = comment_soup.find("table", id=info["id"])
                    if table is not None:
                        break

            df = pd.read_html(str(table))[0]
            Path(info["path"]).parent.mkdir(parents=True, exist_ok=True)
            df.to_csv(info["path"], index=False)

        time.sleep(10)


# 3. <a id='toc3_'></a>[Data Processing](#toc0_)


Now that we have collected all the data, we need to process it and make it suitable for analysis.


## 3.1. <a id='toc3_1_'></a>[Cleaning Data](#toc0_)


The first step is to figure out how we will be cleaning all the data.


### 3.1.1. <a id='toc3_1_1_'></a>[Renaming a Single Column](#toc0_)


First, let's define a function called `rename_column`. It will take in a column name and return a column name that has been modified to provide more consistency.

The renaming rules are as follows:

- If the column name contains `Unnamed`, then we will replace it with a blank string.
- If the column name is `Rk`, then we will replace it with `Rank`.
- If the column name is `Tm`, then we will replace it with `Team`.
- If the column name is `Offense Four Factors` (which is the name for a group of $4$ different columns), then we will replace it with a blank string. This way, the sub-columns will be assumed to be referring to the team's statistics on offense.
- If the column name is `Defense Four Factors` (which is the name for a group of $4$ different columns), then we will replace it with `Opp`. This way, the sub-columns will be assumed to be referring to the opponent's statistics on offense (the team's statistics on defense).
- Otherwise, we will return the original name.


In [144]:
def rename_column(name):
    if "Unnamed" in name:
        return ""
    elif name == "Rk":
        return "Rank"
    elif name == "Tm":
        return "Team"
    elif name == "Offense Four Factors":
        return ""
    elif name == "Defense Four Factors":
        return "Opp"
    else:
        return name


### 3.1.2. <a id='toc3_1_2_'></a>[Renaming All Columns](#toc0_)


Now, let's define a function called `rename_columns`. It takes in a `DataFrame` and a list of the header rows.

It works as follows:

- If there is a single header row, then it renames each column by calling `rename_column`.
- If there are multiple header rows, then it renames both names for each column by calling `rename_column` and combines them.


In [145]:
def rename_columns(df, header):
    if len(header) == 1:
        df.columns = [rename_column(i) for i in df.columns]
    else:
        df.columns = [f"{rename_column(i)}{rename_column(j)}" for i, j in df.columns]
    return df


### 3.1.3. <a id='toc3_1_3_'></a>[Cleaning a `CSV` File](#toc0_)


Now, let's create a dictionary called `files_to_clean` with information about the `CSV` files we need to clean. It will map the filename to a dictionary with $2$ items:

- The `header` of the `CSV` file (an array of the row indices for the header)
- The `columns` of the `CSV` file that we want to keep for now (after they have been renamed using the `rename_columns` function above)
- The `column_mappings`, which map old column names to new column names (for renaming purposes)


In [146]:
files_to_clean = {
    "regular_season/per_game_stats.csv": {
        "header": [0],
        "columns": [
            "Team",
            "G",
            "MP",
            "FG",
            "FGA",
            "FG%",
            "3P",
            "3PA",
            "3P%",
            "2P",
            "2PA",
            "2P%",
            "FT",
            "FTA",
            "FT%",
            "ORB",
            "DRB",
            "TRB",
            "AST",
            "STL",
            "BLK",
            "TOV",
            "PF",
            "PTS",
        ],
        "column_mappings": {},
    },
    "regular_season/advanced_stats.csv": {
        "header": [0, 1],
        "columns": [
            "Team",
            "Age",
            "W",
            "L",
            "PW",
            "PL",
            "MOV",
            "SOS",
            "SRS",
            "ORtg",
            "DRtg",
            "NRtg",
            "Pace",
            "FTr",
            "3PAr",
            "TS%",
            "eFG%",
            "TOV%",
            "ORB%",
            "FT/FGA",
            "OppeFG%",
            "OppTOV%",
            "OppDRB%",
            "OppFT/FGA",
        ],
        "column_mappings": {"OppDRB%": "DRB%"},
    },
    "regular_season/standings.csv": {
        "header": [0, 1],
        "columns": ["Rank", "Team", "Overall", "PlaceHome", "PlaceRoad"],
        "column_mappings": {
            "Overall": "OverallRecord",
            "PlaceHome": "HomeRecord",
            "PlaceRoad": "RoadRecord",
        },
    },
    "playoffs/advanced_stats.csv": {
        "header": [0, 1],
        "columns": ["Rank", "Team", "W", "L"],
        "column_mappings": {"Rank": "PlayoffRank", "W": "PlayoffW", "L": "PlayoffL"},
    },
}


Now, let's define a function called `clean_csv`. Given the path to a `CSV` file, it will return a `DataFrame` with a cleaned version of the data from the `CSV` file.

To clean a file, we do the following:

- Using the `files_to_clean` dictionary, obtain the information needed to clean the `CSV` file at the specified path.
- Read the `CSV` file at the specified `path` into a `DataFrame` using [pandas][pandas].
- Do an initial renaming of columns by calling the `rename_columns` function.
- Keep only the specified columns of the `DataFrame`.
- Rename the remaining columns using the specified column mappings.
- Remove any rows where the `Team` is `League Average` (since this is an aggregate of all the rows in the `DataFrame`, and we have no use for it).
- Remove the `*` character for any values in the `Team` column (since it is used to indicate if a team made the playoffs, but we already have that data).

[pandas]: https://pandas.pydata.org/


In [147]:
def clean_csv(path):
    name = f"{Path(path).parents[0].name}/{Path(path).name}"
    info = files_to_clean[name]
    df = pd.read_csv(path, header=info["header"])
    rename_columns(df, info["header"])
    df = df[info["columns"]]
    df = df.rename(columns=info["column_mappings"])
    df = df[df["Team"] != "League Average"]
    df["Team"] = df["Team"].str.replace("*", "", regex=False)
    return df


## 3.2. <a id='toc3_2_'></a>[Merging Data](#toc0_)


Now, we need to merge all the cleaned data into a single `DataFrame`.

For each season, we will do the following:

- Clean the $4$ `CSV` files for the season using the `clean_csv` function.
- Merge the $4$ `DataFrame`s for the season on the `Team` column using an outer join in [pandas][pandas].
- Add a `Season` column to the beginning of the `DataFrame` and fill all of the rows with the same season value.

Then, we concatenate the `DataFrame`s for each season into a single `DataFrame`.

[pandas]: https://pandas.pydata.org/


In [148]:
data = pd.DataFrame()

for season in seasons:

    season_data = None

    infoList = list(itertools.chain(*pages_to_scrape(season).values()))
    pathList = [info["path"] for info in infoList]

    for index, path in enumerate(pathList):
        df = clean_csv(path)
        if index == 0:
            season_data = df
        else:
            season_data = season_data.merge(df, on="Team", how="outer")

    season_data.insert(0, "Season", season)

    data = pd.concat(objs=[data, season_data])


Now, we can look at the merged data.


In [149]:
data.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 630 entries, 0 to 29
Data columns (total 55 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Season         630 non-null    int64  
 1   Team           630 non-null    object 
 2   G              628 non-null    float64
 3   MP             628 non-null    float64
 4   FG             628 non-null    float64
 5   FGA            628 non-null    float64
 6   FG%            628 non-null    float64
 7   3P             628 non-null    float64
 8   3PA            628 non-null    float64
 9   3P%            628 non-null    float64
 10  2P             628 non-null    float64
 11  2PA            628 non-null    float64
 12  2P%            628 non-null    float64
 13  FT             628 non-null    float64
 14  FTA            628 non-null    float64
 15  FT%            628 non-null    float64
 16  ORB            628 non-null    float64
 17  DRB            628 non-null    float64
 18  TRB        

In [150]:
data


Unnamed: 0,Season,Team,G,MP,FG,FGA,FG%,3P,3PA,3P%,...,OppTOV%,DRB%,OppFT/FGA,Rank,OverallRecord,HomeRecord,RoadRecord,PlayoffRank,PlayoffW,PlayoffL
0,2003,Dallas Mavericks,82.0,241.2,38.5,85.1,0.453,7.8,20.3,0.381,...,14.8,70.9,0.221,1.0,60-22,33-8,27-14,8.0,10.0,10.0
1,2003,Golden State Warriors,82.0,240.9,37.3,84.6,0.441,5.2,15.1,0.344,...,12.2,67.9,0.220,19.0,38-44,24-17,14-27,,,
2,2003,Sacramento Kings,82.0,241.8,39.5,85.2,0.464,6.0,15.7,0.381,...,13.6,70.6,0.204,3.0,59-23,35-6,24-17,3.0,7.0,5.0
3,2003,Los Angeles Lakers,82.0,243.0,37.7,83.6,0.451,5.9,16.7,0.356,...,13.4,72.7,0.241,6.0,50-32,31-10,19-22,7.0,6.0,6.0
4,2003,Milwaukee Bucks,82.0,242.7,37.1,81.3,0.457,7.1,18.6,0.383,...,13.5,69.9,0.237,16.0,42-40,25-16,17-24,11.0,2.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
25,2023,Orlando Magic,82.0,241.2,40.5,86.3,0.470,10.8,31.1,0.346,...,13.1,77.7,0.211,25.0,34-48,20-21,14-27,,,
26,2023,Charlotte Hornets,82.0,241.8,41.3,90.4,0.457,10.7,32.5,0.330,...,12.5,75.5,0.211,27.0,27-55,13-28,14-27,,,
27,2023,Houston Rockets,82.0,240.9,40.6,88.9,0.457,10.4,31.9,0.327,...,11.8,75.8,0.218,28.0,22-60,14-27,8-33,,,
28,2023,Detroit Pistons,82.0,241.5,39.6,87.1,0.454,11.4,32.4,0.351,...,11.9,74.0,0.231,30.0,17-65,9-32,8-33,,,


# 4. <a id='toc4_'></a>[Exploratory Data Analysis & Visualization](#toc0_)


# 5. <a id='toc5_'></a>[Modeling: Analysis, Hypothesis Testing, & Machine Learning](#toc0_)


# 6. <a id='toc6_'></a>[Interpretation: Insight & Policy Decision](#toc0_)
