# ETL Pipeline for Gathering NBA Official Stats From [nba.com/stats/](https://www.nba.com/stats/)

In [1]:
import copy
from pathlib import Path
import time

from bs4 import BeautifulSoup as soup
import pandas as pd
from splinter import Browser
from webdriver_manager.chrome import ChromeDriverManager

<hr>

## Extract Stage

### Web Scraping Setup

In [2]:
# Set up Splinter (prep the automated browser).
executable_path = {"executable_path": ChromeDriverManager().install()}
browser = Browser("chrome", **executable_path, headless=False)



Current google-chrome version is 100.0.4896
Get LATEST chromedriver version for 100.0.4896 google-chrome
Trying to download new driver from https://chromedriver.storage.googleapis.com/100.0.4896.60/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\cdpet\.wdm\drivers\chromedriver\win32\100.0.4896.60]


### Delay for Loading Pages and Tables

In [3]:
# Use in time.sleep() to allow for pages and tables to load
delay_time = 5

### Collect Team Statistics URLs for Scraping

In [4]:
# Visit NBA stats page.
base_url = "https://www.nba.com"
href = "/stats"
browser.visit(f"{base_url}{href}")
# Delay to allow the page to load.
time.sleep(delay_time)

# Retrieve html.
html = browser.html
nba_soup = soup(html, "html.parser")

# Find the anchor tags within div's that have the sidebar-module... class.
sidebar_divs = nba_soup.find_all("div", class_="sidebar-module / sidebar__leaders / sidebar-module-next / sidebar-module-quick-links")
team_stats_a_tags = sidebar_divs[4].find_all("a")
# Extract the href string and combine with base_url to form the team stats urls.
team_stats_urls = [f'{base_url}{a_tag.attrs["href"]}' for a_tag in team_stats_a_tags]
team_stats_urls

['https://www.nba.com/stats/teams/traditional/',
 'https://www.nba.com/stats/teams/advanced/?sort=W&dir=-1',
 'https://www.nba.com/stats/teams/misc/?sort=W&dir=-1',
 'https://www.nba.com/stats/teams/clutch-traditional/']

### Retrieve the Stat Tables From Each URL

#### URL Stat Tables
* Each URL has stat tables for specific stat types as shown here:
| *team_stats_urls* index | URL                                                     | Stat Type                 | No. of Tables | Range of Seasons   |
| ----------------------- | ------------------------------------------------------- | ------------------------- | ------------- | ------------------ |
| 0                       | https://www.nba.com/stats/teams/traditional/            | Teams General Traditional | 26            | 1996-97 to 2021-22 |
| 1                       | https://www.nba.com/stats/teams/advanced/?sort=W&dir=-1 | Teams General Advanced    | 26            | 1996-97 to 2021-22 |
| 2                       | https://www.nba.com/stats/teams/misc/?sort=W&dir=-1     | Teams General Misc        | 26            | 1996-97 to 2021-22 |
| 3                       | https://www.nba.com/stats/teams/clutch-traditional/     | Teams Clutch Traditional  | 26            | 1996-97 to 2021-22 |

#### URL Selectable Filters 
* Each URL has dropdown menu filters for **Season**, **Season Type**, **Per Mode**, and **Season Segment**.
* **Teams General Advanced** does not have a **Per Mode** filter and thus has a mix of per game stats and full season totals. The index for **Season Segment** would be 2 in the list below.
* Here are the select tag indices if a filter needs to be changed:
    * `browser.find_by_tag("select")[0]`: **Season**
        * 1996-97 to 2021-22 (default)
    * `browser.find_by_tag("select")[1]`: **Season Type**
        * Regular Season (default)
        * Playoffs
    * `browser.find_by_tag("select")[2]`: **Per Mode**
        * Per Game (default)
        * Totals
    * `browser.find_by_tag("select")[3]`: **Season Segment**
        * All Games (default)
* The list above is for finding the correct `<select>` tag. To actually select the filter, you have to first find the `value` attribute on the `<option>` tag within the `<select>` element and chain Splinter's `select(value="<option_value>")` method on the expression from the list above.

#### Data to Collect
* All categories except for ranking. 
    * Ranking is the first column, before the team name. This ranking is tied to their default sorting by winning %. We most likely won't need this data.

#### Helper Functions
* 3 Helper functions facilitate the retrieving of stat tables:
    1. `season_stat_table(stat_table_soup)`
        * **Stat table** for **one season** for **one stat type**.
        * `stat_table_soup` contains the html for the current table to be retrieved.
        * Returns the stat table as a pandas DataFrame.
        
        <br>
        
    2. `all_seasons_stat_tables(seasons)`
        * **Stat tables** for **all seasons** for **one stat type**.
        * `seasons` contains the `<option>` tag `value` attributes for each season found in the **Season** selectable filter.
        * `season_stat_table()` is called once for each season.
        * Returns a dictionary. The keys are the season identifiers (for example: `"2021-22"`), and the values are the DataFrames returned by `season_stat_table()`.
            * `{<season_identifier>: <DataFrame>, ...}`
            
        <br> 
         
    3. `retrieve_team_stats(team_stats_urls)`
        * **Stat tables** for **all seasons** and **all stat types**.
        * `team_stats_urls` contains all the team statistics URLs for the NBA stats site
        * `all_seasons_stat_tables()` is called once for each stat type (URL).
        * Returns a dictionary. The keys are the stat type (for example: `"Teams General Traditional"`), and the values are the dictionaries returned by `all_seasons_stat_tables()`.
            * `{<stat_type>: <dict>, ...}`

#### Example Usage
* To see the advanced stats table for the 2020-21 season, we can use the following code:
    ```python
    team_stats = retrieve_team_stats(team_stats_urls)
    team_stats["Teams General Advanced"]["2021-22"]
    ```

#### Team Stats Helper Function #1: One Season, One Stat Type

In [5]:
def season_stat_table(stat_table_soup, season, stat_type):
    """Retrieve the table of data for a season via the stat_table_soup html"""
    try:
        table = stat_table_soup.find("div", class_="nba-stat-table").find("table")
    except AttributeError:
        print(stat_type, season)
        raise
    # Find the column names in the header of the table.
    headers = table.find("thead").find_all("th")
    # The conditional is for removing hidden header values that have no meaning
    # to us. The first list element is removed since it refers to a ranking that
    # we will not need.
    headers = [header.decode_contents().replace("<br/>", " ").replace("\xa0", " ").strip().upper()
               for header in headers
               if "RANK" not in header.text][1:]
    
    # Rows that contain the table data.
    rows = table.find("tbody").find_all("tr")
    
    # dataframe_data will contain dict elements for each row of data.
    dataframe_data = []
    
    # Loop over each row in the table.
    for row in rows:
        # All the table data "td" tags for a given row (i.e. all column values).
        cols = row.find_all("td")
        # Remove the first element that is a ranking.
        cols = [td.text.strip() for td in cols][1:]
    
        # row_dict represents the data for an entire row.
        row_dict = {}

        # Loop over each column in a given row, add the value to row_dict with
        # a key that is the column's header name. 
        for index, value in enumerate(cols):
            # Add team name string.
            if index == 0:
                row_dict[headers[index]] = value
                continue
            # Add team record information: GP, W, and L as integers.
            if index in (1,2,3):
                row_dict[headers[index]] = int(value)
                continue
            # Add the remaining team stats as floats.
            if "," in value:
                row_dict[headers[index]] = int(value.replace(",", ""))
            else:
                row_dict[headers[index]] = float(value)
        # Add the row's row_dict to dataframe_data.
        dataframe_data.append(row_dict)
        
    return pd.DataFrame(dataframe_data)

#### Team Stats Helper Function #2: All Seasons, One Stat Type

In [6]:
def all_seasons_stat_tables(seasons, stat_type):
    """Retrieve dataframes for all seasons"""
    dataframes = {}
    for season in seasons:
        # Select Season --------------------------------------------------------
        # Find the select elements. The first select is the "SEASON" dropdown 
        # menu and the value to select is the value that the nba website 
        # assigned that was collected in the seasons list.
        browser.find_by_tag("select")[0].select(value=season[1])
        # Delay to allow the page to load.
        time.sleep(delay_time)

        # Retrieve html --------------------------------------------------------
        html = browser.html
        stat_table_soup = soup(html, "html.parser")

        # Retrieve DataFrame ---------------------------------------------------
        # Assign the DataFrame to the dataframes dict as the value, and the
        # season string as the key.
        dataframes[season[0]] = season_stat_table(stat_table_soup, season[0], stat_type)
        
    return dataframes

#### Team Stats Helper Function #3: All Seasons, All Stat Types

In [7]:
def retrieve_team_stats(team_stats_urls):
    season_types = ["Regular Season", "Playoffs"]
    all_stats = {}
    for season_type in season_types:
        team_stats = {}
        for url in team_stats_urls:
            # Visit URL --------------------------------------------------------
            # Visit a team stats url.
            browser.visit(url)
            # Delay to allow the page to load.
            time.sleep(delay_time)

            # Retrieve html
            html = browser.html
            select_soup = soup(html, "html.parser")

            # Select Season Type -----------------------------------------------
            # Find the options for the "SEASON TYPE" dropdown menu
            options = select_soup.find(attrs={"name":"SeasonType"}).find_all("option")
            season_option = [option["value"] for option in options if option.text == season_type]
            # Select the season_type option to bring up that season type's 
            # table.
            browser.find_by_tag("select")[1].select(value=season_option[0])
            # Delay to allow the page to load.
            time.sleep(delay_time)

            # Retrieve Available Seasons and Their <option> Tag Values ---------
            html = browser.html
            select_soup = soup(html, "html.parser")

            # Find the option tags from the first select tag ("SEASON" dropdown 
            # menu).
            options = select_soup.find("select").find_all("option")
            # Store the season and option value strings for each option.
            seasons = [(option.text, option["value"]) for option in options]

            # Retrieve Table Title Using select_soup ---------------------------
            # Find the title for the table at the current url.
            stat_type_nav = select_soup.find("nav-dropdown")
            # The title is configured in three parts (3 different attributes on
            # the nav tag).
            stat_type = list(stat_type_nav.attrs.values())
            # Join the title words and capitalize.
            stat_type = " ".join(stat_type).title()

            # Retrieve the DataFrames for Each Season --------------------------
            team_stats[stat_type] = all_seasons_stat_tables(seasons, stat_type)
        all_stats[season_type] = team_stats
    team_stats_reg_season = all_stats[season_types[0]]
    team_stats_playoffs = all_stats[season_types[1]]
    
    return team_stats_reg_season, team_stats_playoffs

#### Collect All of the Team Stats

In [8]:
team_stats, team_stats_playoffs = retrieve_team_stats(team_stats_urls)



In [9]:
team_stats.keys()

dict_keys(['Teams General Traditional', 'Teams General Advanced', 'Teams General Misc', 'Teams Clutch Traditional'])

In [10]:
team_stats["Teams General Advanced"]["2021-22"]

Unnamed: 0,TEAM,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,AST%,AST/TO,AST RATIO,OREB%,DREB%,REB%,TOV%,EFG%,TS%,PACE,PIE,POSS
0,Phoenix Suns,82,64,18,3946.0,114.2,106.8,7.5,62.7,2.12,19.5,26.4,72.8,50.3,12.9,54.9,58.1,100.26,54.8,8242
1,Memphis Grizzlies,82,56,26,3956.0,114.3,108.9,5.3,59.7,1.97,17.9,33.8,72.6,52.6,13.0,52.2,55.3,100.52,53.0,8295
2,Golden State Warriors,82,53,29,3946.0,112.1,106.6,5.5,66.9,1.82,19.5,26.9,73.6,51.0,15.0,55.2,58.2,98.74,53.6,8121
3,Miami Heat,82,53,29,3971.0,113.0,108.4,4.5,64.4,1.75,18.8,27.8,73.5,51.0,14.9,54.7,58.4,96.53,52.9,7987
4,Dallas Mavericks,82,52,30,3951.0,112.5,109.1,3.5,59.5,1.87,17.8,25.6,73.3,49.6,13.0,53.8,57.2,95.64,51.1,7871
5,Boston Celtics,82,51,31,3981.0,113.6,106.2,7.4,60.9,1.82,18.2,27.7,72.5,50.9,13.9,54.2,57.8,97.26,54.7,8068
6,Milwaukee Bucks,82,51,31,3951.0,114.3,111.1,3.2,57.2,1.78,17.3,26.9,74.7,51.2,13.3,54.6,58.0,100.59,51.6,8284
7,Philadelphia 76ers,82,51,31,3961.0,113.0,110.2,2.8,60.2,1.89,17.9,24.6,72.4,49.0,12.9,53.4,57.8,96.71,51.7,7975
8,Utah Jazz,82,49,33,3946.0,116.2,110.0,6.2,55.2,1.6,16.7,30.0,73.8,52.5,14.3,55.5,58.9,97.5,52.8,8014
9,Denver Nuggets,82,48,34,3961.0,113.8,111.5,2.3,66.7,1.92,20.0,26.8,75.2,51.5,14.6,55.6,59.0,98.41,51.6,8123


### Build Playoff Teams and Champions DataFrames

#### Variable to Extract Playoff Teams From
* `team_stats_playoffs["Teams General Traditional"]`

#### Data to Collect
* Team names
* Total wins for that playoff run

#### Helper Functions
* 1 Helper function to build the DataFrames:
    * `build_playoffs_champions_dataframes(team_stats_playoffs)`
        * **Playoff teams** for **all seasons** in **one DataFrame**.
        * **Champion teams** for **all seasons** (except current season) in **one DataFrame**.
        * Returns a tuple of DataFrames: `(playoff_teams_df, champions_df)`

#### Usage
* To build these DataFrames, use the following code:
    ```python
    playoff_teams_df, champions_df = build_playoffs_champions_dataframes(team_stats_playoffs)
    ```

#### Playoff Teams Helper Function #1: Build Playoff Teams and Champions DataFrames

In [11]:
def build_playoffs_champions_dataframes(team_stats_playoffs):
    # Combine Into One DataFrame -----------------------------------------------
    # Create one DataFrame containing all playoff teams from all seasons. The
    # season identifiers will be the column names, and the team names will be
    # the rows. Additionally, sort by wins descending and drop the win column 
    # before joining. The champions from each year will be left at index 0.
    playoffs = team_stats_playoffs["Teams General Traditional"]
    
    index = 0
    for season, playoff_df in playoffs.items():
        playoff_df = playoff_df[["TEAM", "W"]]
        # Sort the DataFrames by wins descending.
        playoff_df = playoff_df.sort_values(by=["W"], ascending=False)
        # Reset the index (to eliminate joining issues), and drop the win column.
        playoff_df = playoff_df.reset_index(drop=True).drop(columns="W")
        # Change the column name to the season identifier.
        playoff_df.columns = [season]
        if index == 0:
            # Set our output variable `playoff_teams_df` to the most recent 
            # season's playoff teams DataFrame. No join is needed as we only
            # have 1 DataFrame here.
            playoff_teams_df = playoff_df
            index += 1
            continue
        # Join each DataFrame to the output variable `playoff_teams_df`
        playoff_teams_df = playoff_teams_df.join(playoff_df)
        
        champions_df = playoff_teams_df.iloc[[0]].transpose()
        champions_df.columns = ["TEAM"]
        
        index += 1
    champions_df = champions_df.drop(index="2021-22")
    
    return (playoff_teams_df, champions_df)

#### Retrieve the Playoff Team Data

In [12]:
playoff_teams_df, champions_df = build_playoffs_champions_dataframes(team_stats_playoffs)

In [13]:
playoff_teams_df

Unnamed: 0,2021-22,2020-21,2019-20,2018-19,2017-18,2016-17,2015-16,2014-15,2013-14,2012-13,...,2005-06,2004-05,2003-04,2002-03,2001-02,2000-01,1999-00,1998-99,1997-98,1996-97
0,Golden State Warriors,Milwaukee Bucks,Los Angeles Lakers,Toronto Raptors,Golden State Warriors,Golden State Warriors,Cleveland Cavaliers,Golden State Warriors,San Antonio Spurs,Miami Heat,...,Miami Heat,San Antonio Spurs,Detroit Pistons,San Antonio Spurs,Los Angeles Lakers,Los Angeles Lakers,Los Angeles Lakers,San Antonio Spurs,Chicago Bulls,Chicago Bulls
1,Miami Heat,Phoenix Suns,Miami Heat,Golden State Warriors,Cleveland Cavaliers,Cleveland Cavaliers,Golden State Warriors,Cleveland Cavaliers,Miami Heat,San Antonio Spurs,...,Dallas Mavericks,Detroit Pistons,Los Angeles Lakers,New Jersey Nets,New Jersey Nets,Philadelphia 76ers,Indiana Pacers,New York Knicks,Utah Jazz,Utah Jazz
2,Philadelphia 76ers,Atlanta Hawks,Boston Celtics,Milwaukee Bucks,Houston Rockets,Boston Celtics,Oklahoma City Thunder,Houston Rockets,Indiana Pacers,Indiana Pacers,...,Detroit Pistons,Miami Heat,Indiana Pacers,Dallas Mavericks,Sacramento Kings,Milwaukee Bucks,Portland Trail Blazers,Indiana Pacers,Indiana Pacers,Houston Rockets
3,Boston Celtics,LA Clippers,Denver Nuggets,Portland Trail Blazers,Boston Celtics,San Antonio Spurs,Toronto Raptors,Atlanta Hawks,Oklahoma City Thunder,Memphis Grizzlies,...,Phoenix Suns,Phoenix Suns,Minnesota Timberwolves,Detroit Pistons,Boston Celtics,San Antonio Spurs,New York Knicks,Portland Trail Blazers,Los Angeles Lakers,Miami Heat
4,Milwaukee Bucks,Brooklyn Nets,Toronto Raptors,Philadelphia 76ers,New Orleans Pelicans,Washington Wizards,Miami Heat,Los Angeles Clippers,Washington Wizards,Golden State Warriors,...,Los Angeles Clippers,Seattle SuperSonics,New Jersey Nets,Sacramento Kings,Dallas Mavericks,Charlotte Hornets,Miami Heat,Utah Jazz,Charlotte Hornets,New York Knicks
5,Dallas Mavericks,Philadelphia 76ers,LA Clippers,Denver Nuggets,Philadelphia 76ers,Houston Rockets,San Antonio Spurs,Washington Wizards,Los Angeles Clippers,New York Knicks,...,Cleveland Cavaliers,Dallas Mavericks,Sacramento Kings,Los Angeles Lakers,Charlotte Hornets,Toronto Raptors,Philadelphia 76ers,Los Angeles Lakers,San Antonio Spurs,Seattle SuperSonics
6,Memphis Grizzlies,Utah Jazz,Milwaukee Bucks,Houston Rockets,Utah Jazz,Toronto Raptors,Portland Trail Blazers,Memphis Grizzlies,Portland Trail Blazers,Oklahoma City Thunder,...,San Antonio Spurs,Indiana Pacers,San Antonio Spurs,Philadelphia 76ers,Detroit Pistons,Dallas Mavericks,Phoenix Suns,Philadelphia 76ers,New York Knicks,Los Angeles Lakers
7,Minnesota Timberwolves,Denver Nuggets,Houston Rockets,Boston Celtics,Toronto Raptors,Utah Jazz,Atlanta Hawks,Chicago Bulls,Brooklyn Nets,Chicago Bulls,...,New Jersey Nets,Washington Wizards,Miami Heat,Boston Celtics,San Antonio Spurs,Sacramento Kings,Utah Jazz,Atlanta Hawks,Seattle SuperSonics,Atlanta Hawks
8,New Orleans Pelicans,Dallas Mavericks,Oklahoma City Thunder,San Antonio Spurs,Indiana Pacers,LA Clippers,Charlotte Hornets,San Antonio Spurs,Atlanta Hawks,Brooklyn Nets,...,Los Angeles Lakers,Boston Celtics,New Orleans Hornets,Orlando Magic,Indiana Pacers,New York Knicks,Milwaukee Bucks,Detroit Pistons,Houston Rockets,Detroit Pistons
9,Phoenix Suns,Los Angeles Lakers,Utah Jazz,LA Clippers,Milwaukee Bucks,Atlanta Hawks,Indiana Pacers,Brooklyn Nets,Dallas Mavericks,Atlanta Hawks,...,Chicago Bulls,Houston Rockets,Dallas Mavericks,Portland Trail Blazers,Philadelphia 76ers,Utah Jazz,Sacramento Kings,Miami Heat,Miami Heat,Orlando Magic


In [14]:
champions_df

Unnamed: 0,TEAM
2020-21,Milwaukee Bucks
2019-20,Los Angeles Lakers
2018-19,Toronto Raptors
2017-18,Golden State Warriors
2016-17,Golden State Warriors
2015-16,Cleveland Cavaliers
2014-15,Golden State Warriors
2013-14,San Antonio Spurs
2012-13,Miami Heat
2011-12,Miami Heat


In [15]:
browser.quit()

In [16]:
seasons = list(team_stats["Teams General Traditional"].keys())
stat_types = list(team_stats.keys())

for stat_type in stat_types: 
    for index, season in enumerate(seasons):
        if index == 0:
            df = team_stats[stat_type][season]
            df["SEASON"] = season
            continue 
        df2 = team_stats[stat_type][season]
        df2["SEASON"] = season
        df = pd.concat([df, df2])
    team_stats[stat_type] = df

In [17]:
playoff_seasons = list(team_stats_playoffs["Teams General Traditional"].keys())
playoff_stat_types = list(team_stats_playoffs.keys())

for playoff_stat_type in playoff_stat_types: 
    for index, playoff_season in enumerate(playoff_seasons):
        if index == 0:
            df = team_stats_playoffs[playoff_stat_type][playoff_season]
            df["SEASON"] = playoff_season
            continue 
        df2 = team_stats_playoffs[playoff_stat_type][playoff_season]
        df2["SEASON"] = playoff_season
        df = pd.concat([df, df2])
    team_stats_playoffs[playoff_stat_type] = df

<br>
<hr>
<br>

### Write to CSVs

In [19]:
data_folder = Path("data")
data_folder.mkdir(parents=True, exist_ok=True)

season_type_stats_list = [team_stats, team_stats_playoffs]

# Team Stats -------------------------------------------------------------------
for index, season_type_stats in enumerate(season_type_stats_list):
    for stat_type in season_type_stats:
        # Split and join the stat_type string with underscores for the filepath
        # below.
        split_string = stat_type.split()
        stat_type_underscores = "_".join(split_string).lower()
        if index == 1:
            stat_type_underscores = f"playoffs_{stat_type_underscores}"
            
        # Create filepath for writing to
        table_filepath = Path(f"{data_folder}/{stat_type_underscores}.csv")

        # Write the DataFrame to file
        season_type_stats[stat_type].to_csv(table_filepath, index=False)

# stat_types_and_seasons.txt ---------------------------------------------------
# Write the stat types and season identifiers to file to make it easier for 
# reading in the csv's
with open(Path(f"{data_folder}/stat_types_and_seasons.txt"), 'w') as opened_file:
    for stat_type in team_stats:
        opened_file.write(f"{stat_type}\n")
    for season in team_stats["Teams General Traditional"]:
        opened_file.write(f"{season}\n")

# Playoffs ---------------------------------------------------------------------
playoff_teams_filepath = Path(f"{data_folder}/playoff_teams_df.csv")
playoff_teams_df.to_csv(playoff_teams_filepath, index=False)

# Champions --------------------------------------------------------------------
champions_filepath = Path(f"{data_folder}/champions_df.csv")
champions_df.to_csv(champions_filepath, index=True, index_label="SEASON")

In [None]:
# seasons =
# stat_types = list(team_stats.keys)

# for stat_type in stat_types:
#     for index, season in enumerate(seasons):
#         if index == 0:
#             df = team_stats[stat_type][season]
#             df["SEASON"] = season
#             continue
#         df2 = team_stats[stat_type][season]
#         df2["SEASON"] = season
#         df = pd.concat([df, df2])
#     team_stats[stat_type] = df