# NBA Stats Web Scraper Notebook

## Luke DiPerna

This notebook displays the code I used to scrape NBA boxscore data from [basketball-reference](https://www.basketball-reference.com/) and create a SQLite database which can be found on [Kaggle](https://www.kaggle.com/datasets/lukedip/nba-boxscore-dataset).

WARNING: The runtime is extremely long and this notebook should not be run unless you understand the implications of the code below.

Basketball-Reference does allow web scraping, and I have taken the necessary steps to follow web scraping best practices and ensure that the scraper does not violate the website terms and conditions (as of the time of writing this), which can be found [here](https://www.sports-reference.com/bot-traffic.html). However, these terms could change and it is imperative that you check before running this code. Even with these safeguards, it is possible that you will still run into issues with interruptions, so you may need to take additional steps such as lengthen the delay or use a VPN with a rotating IP.

The data consists of team and player boxscore statistics from each regular season game. In order to collect the data, the scraper collects the urls for each game, accesses the webpage using Selenium, processes and stores the data in dataframes, and then adds the processed data to the database. A typical game stats webpage looks like [this](https://www.basketball-reference.com/boxscores/201610260LAL.html). There are a lot of data tables on the page, but I am focused on four in particular: home team boxscore, home team advanced boxscore, away team boxscore, and away team advanced boxscore. The scraper collects these stats, along with some game meta-information, and includes them in the database. I also feature engineered the PIE (Player Impact Estimate) statistic for each player.

In [None]:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup

import pandas as pd
import numpy as np
import string
import time
import sqlite3
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)

In [None]:
# global variables

# used to create id strings later
base_url = 'https://www.basketball-reference.com'

season_gamecount = 1

precovid_seasons = ['1314', '1415', '1516', '1617', '1718', '1819']
precovid_url_years = ['2014', '2015', '2016', '2017', '2018', '2019']
postcovid_seasons = ['1920', '2021', '2122', '2223']
postcovid_url_years = ['2020', '2021', '2022', '2023']

post_covid_season_dict = {'1920': {'month_len': 8, 'final_month_gamecount': 83},
                          '2021': {'month_len': 6, 'final_month_gamecount': 140},
                          '2122': {'month_len': 7, 'final_month_gamecount': 83},
                          '2223': {'month_len': 7, 'final_month_gamecount': 72}
                         }
# used to create sql database table columns
info_columns = ['game_id', 'season', 'date', 'away_team', 'away_score', 'home_team', 'home_score', 'result']
num_columns = ['FG', 'FGA', '3P', '3PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS', '+/-',
               'FG%', '3P%', 'FT%', 'TS%', 'eFG%', '3PAr', 'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'ORtg', 'DRtg', 'BPM']
# pause between each server call
delay = time.sleep(np.random.randint(3,6))

In [None]:
def create_game_info(url, season_id, season_gamecount):
    
    game_count = str(season_gamecount)
    while len(game_count) < 4:
        game_count = '0' + game_count
    
    id_string = url.strip(string.ascii_letters+string.punctuation)
    year = id_string[0:4]
    month = id_string[4:6]
    day = id_string[6:8]
    
    date = year+'-'+month+'-'+day
    
    game_id = int(season_id+month+day+game_count)
    season_id = int(season_id)
    
    return [game_id, season_id, date]

In [None]:
def create_team_info(table):
    '''
    Create a dataframe with game results. Uses an html table as input.
    
    ---
    Inputs:
    
    table: a BeautifulSoup html table
    ---
    Outputs:
    
    team_info: a dataframe with the relevant game information (team_ids, scores, and boolean 'results' column)
    '''
    
    # get team_ids
    id_rows = table.findAll('th', attrs={'class':'center', 'data-stat':'team', 'scope':'row'})
    team_ids = [row.text.strip() for row in id_rows]
    
    # get final score
    scores = table.findAll('td', attrs={'class': 'center', 'data-stat': 'T'})
    final_scores = [int(score.text.strip()) for score in scores]
    
    # boolean game-winner: away=0, home=1
    if final_scores[0] > final_scores[1]:
        result=0
    else:
        result=1
    
    team_info = [team_ids[0], final_scores[0], team_ids[1], final_scores[1], result]
    
    return team_info

In [None]:
def create_info_df(game_info, team_info, info_columns):
    info = game_info + team_info
    info_df = pd.DataFrame([info], columns=info_columns)
    return info_df

In [None]:
def create_boxscores(table, game_id):

    # ignore first 'tr', it is table title, not column
    rows = table.findAll('tr')[1:]
    # first 'th' is 'Starters', but will be changed into the player names
    headers = rows[0].findAll('th')
    # provide column names
    headerlist = [h.text.strip() for h in headers]
    
    # ignore first row (headers)
    data = rows[1:]
    # get names column
    player_names = [row.find('th').text.strip() for row in rows]
    # get player stats
    player_stats = [[stat.text.strip() for stat in row.findAll('td')] for row in data]
    # add player name as first entry in each row
    for i in range(len(player_stats)):
        # ignore header with i+1
        player_stats[i].insert(0, player_names[i+1])
    
    # create player stats dataframe
    player_box_df = pd.DataFrame(player_stats, columns=headerlist)
    # drop 'Reserves' row
    player_box_df.drop(player_box_df[player_box_df['Starters'] == 'Reserves'].index, inplace=True)
    
    # add game id column
    player_box_df.insert(loc=0, column='game_id', value=game_id)
    
    # create team stats dataframe from last row in player stats
    team_box_df = pd.DataFrame(player_box_df.iloc[-1]).T
    
    #drop team totals from player stats df
    player_box_df = player_box_df[:-1].rename(columns={'Starters': 'player'})

    return player_box_df, team_box_df

In [None]:
def merge_boxscores(boxscore_list, team_ids, scope):

    # create tuple for every 2 boxscores in list
    pairs = [((boxscore_list[i]), (boxscore_list[i + 1])) for i in range(0, len(boxscore_list), 2)]
    
    clean_boxscores= []
    
    for pair in pairs:
        
        # combine regular and adv boxscores
        df = pd.concat([*pair], axis=1)
        # drop columns with duplicate names
        df = df.loc[:,~df.columns.duplicated()].copy()
        
        clean_boxscores.append(df)
    
    for i in range(len(clean_boxscores)):
        
        if scope=='team':
            clean_boxscores[i].rename(columns={'Starters': 'team'}, inplace=True)
            clean_boxscores[i]['team'] = team_ids[i]
            
        elif scope=='player':
            clean_boxscores[i].insert(loc=2, column='team', value=team_ids[i])
    
    return clean_boxscores

In [None]:
def change_dtypes(df, num_columns):

    df.replace(to_replace='', value='-99', inplace=True)
    
    for column in num_columns:
        df[column] = df[column].astype('float64')
        
    df.replace(to_replace=-99, value=np.nan, inplace=True)
    
    return df

In [None]:
def create_PIE(player_boxes, totals):
    
    PIE_denom = (totals['PTS'] + totals['FG'] + totals['FT'] - totals['FGA'] - totals['FTA'] + totals['DRB'] + (0.5*totals['ORB']) + totals['AST'] + totals['STL'] + (0.5*totals['BLK']) - totals['PF'] - totals['TOV'])
    player_boxes['PIE'] = round((100 * (player_boxes['PTS'] + player_boxes['FG'] + player_boxes['FT'] - player_boxes['FGA'] - player_boxes['FTA'] + player_boxes['DRB'] + (0.5*player_boxes['ORB']) + player_boxes['AST'] + player_boxes['STL'] + (0.5*player_boxes['BLK']) - player_boxes['PF'] - player_boxes['TOV']) / PIE_denom), 1)
    
    return player_boxes

In [None]:
# connect to sql database
conn = sqlite3.connect('data/temp/NBA-Game-Database-temp')
driver = webdriver.Chrome(ChromeDriverManager().install())

In [None]:
for i in range(len(precovid_seasons)):
    
    season_id = precovid_seasons[i]
    season_gamecount = 1
    start_url = 'https://www.basketball-reference.com/leagues/NBA_' + precovid_url_years[i] + '_games.html'
    
    # open the season schedule page
    driver.get(start_url)
    # delay between each server call
    delay
    src = driver.page_source
    # create beautiful soup object from html/xml
    parser = BeautifulSoup(src, 'lxml')
    
    # every month from the season
    months = parser.find('div', attrs = {'class': 'filter'})
    # partial urls for each month
    links = months.findAll('a')
    # full urls for each month
    month_links = [base_url + link['href'] for link in links]
    # only include regular season months (oct-apr)
    month_links = month_links[0:7]
    
    for month_url in month_links:
        
        # create new browser instance to reduce chance of interruptions
        driver.quit()
        driver = webdriver.Chrome(ChromeDriverManager().install())
        delay
        driver.get(month_url)
        delay
        src = driver.page_source
        parser = BeautifulSoup(src, 'lxml')
        table = parser.find('div', attrs = {'class': 'table_container is_setup'})
        
        # check if final month (apr). if true, set limit for game_urls before playoffs start
        row_num = None
        splits = table.findAll('tr', attrs = {'class': 'thead'})
        for split in splits:
            if 'Playoffs' in split.text:
                row_num = int(split['data-row'])
                
        # get partial urls of every game in the month (if apr, stop before playoffs)
        if row_num == None:
            game_partial_urls = table.findAll('td', attrs = {'class': 'center', 'data-stat': 'box_score_text'})
        elif row_num != None:
            game_partial_urls = table.findAll('td', attrs = {'class': 'center', 'data-stat': 'box_score_text'}, limit=row_num)
        
        game_urls = [base_url + url.a['href'] for url in game_partial_urls]
        
        # open every game url, retrieve and manipulate data, add to sql database
        for i in range(len(game_urls)):
    
            driver.get(game_urls[i])
            delay
            src = driver.page_source
            parser = BeautifulSoup(src, 'lxml')
            
            # game_info database:
            
            id_table = parser.find('table', attrs = {'class': 'suppress_all stats_table', 'id': 'line_score'})
            game_info = create_game_info(url=game_urls[i],
                                         season_id=season_id,
                                         season_gamecount=season_gamecount)
            # will use game_id with create_boxscores()
            game_id = game_info[0]
            team_info = create_team_info(id_table)
            # will use team_ids with merge_boxscores()
            team_ids = [team_info[0], team_info[2]]
            
            info_df = create_info_df(game_info=game_info,
                                     team_info=team_info,
                                     info_columns=info_columns)
            # write game info to sql database
            info_df.to_sql('game_info', con=conn, if_exists='append', index=False)

            # team/player databases:
            
            # 4 boxscore tables : away_box, away_box_adv, home_box, home_box_adv
            stat_tables = parser.findAll('table', attrs = {'class': 'sortable stats_table now_sortable'})
            
            player_box_list = [None, None, None, None]
            team_box_list = [None, None, None, None]

            # create team and player boxscores
            for i in range(len(stat_tables)):
                # split player and team boxscores
                player_box_list[i], team_box_list[i] = create_boxscores(stat_tables[i], game_id=game_id)
            
            # team_stats database:
            
            # combine boxscore and advanced boxscore for each team
            away_team_box, home_team_box = merge_boxscores(team_box_list, team_ids=team_ids, scope='team')
            team_boxes = pd.concat([away_team_box, home_team_box])
            team_boxes.reset_index(drop=True, inplace=True)
            # prepare numeric data
            team_boxes = change_dtypes(team_boxes, num_columns)
            # write to sql database
            team_boxes.to_sql('team_stats', con=conn, if_exists='append', index=False)
            
            # player_stats database:
            
            # combine boxscore and advanced boxscore for each team
            away_player_box, home_player_box = merge_boxscores(player_box_list, team_ids=team_ids, scope='player')
            player_boxes = pd.concat([away_player_box, home_player_box])
            player_boxes.reset_index(drop=True, inplace=True)
            # prepare numeric data
            player_boxes = change_dtypes(player_boxes, num_columns)
            # create team totals for PIE calculation
            totals = dict(team_boxes.loc[:,'FG':'PTS'].sum())
            # add PIE column to player boxscore
            player_boxes = create_PIE(player_boxes, totals)
            # write to sql database
            player_boxes.to_sql('player_stats', con=conn, if_exists='append', index=False)

            # increase gamecount to create next game_id
            season_gamecount += 1


In [None]:
for i in range(len(postcovid_seasons)):
    
    season_id = postcovid_seasons[i]
    season_gamecount = 1
    start_url = 'https://www.basketball-reference.com/leagues/NBA_' + postcovid_url_years[i] + '_games.html'
    
    # open the season schedule page
    driver.get(start_url)
    delay
    src = driver.page_source
    # create beautiful soup object from html/xml
    parser = BeautifulSoup(src, 'lxml')
    
    # every month from the season
    months = parser.find('div', attrs = {'class': 'filter'})
    # partial urls for each month
    links = months.findAll('a')
    # full urls for each month
    month_links = [base_url + link['href'] for link in links]
    # only include regular season months
    month_links = month_links[0:post_covid_season_dict[season_id]['month_len']]
    
    for month_url in month_links:
        
        driver.quit()
        driver = webdriver.Chrome(ChromeDriverManager().install())
        delay
        
        driver.get(month_url)
        delay
        src = driver.page_source
        parser = BeautifulSoup(src, 'lxml')
        table = parser.find('div', attrs = {'class': 'table_container is_setup'})
        
      
        # check if final month. if true, set limit for game_urls before playoffs start        
        if month_url != month_links[-1]:
            game_partial_urls = table.findAll('td', attrs = {'class': 'center', 'data-stat': 'box_score_text'})
        else:
            play_in = table.find('td', string='Play-In Game').find_parent()
            play_in_row = int(play_in['data-row'])
            body= table.find('tbody')
            all_rows = body.findAll('tr', limit=play_in_row)
            
            game_rows = []
            for row in all_rows:
                try:
                    row['class']
                except KeyError:
                    game_rows.append(row)
                else:
                    pass
            game_partial_urls = [row.find(attrs = {'class': 'center', 'data-stat': 'box_score_text'}) for row in game_rows]
        
        # get game_urls
        game_urls = [base_url + url.a['href'] for url in game_partial_urls]
        
        # open every game url, retrieve and manipulate data, add to sql database
        for i in range(len(game_urls)):
    
            driver.get(game_urls[i])
            delay
            src = driver.page_source
            parser = BeautifulSoup(src, 'lxml')
            
            # game_info database:
            
            id_table = parser.find('table', attrs = {'class': 'suppress_all stats_table', 'id': 'line_score'})
            game_info = create_game_info(url=game_urls[i],
                                         season_id=season_id,
                                         season_gamecount=season_gamecount)
            # will use game_id with create_boxscores()
            game_id = game_info[0]
            team_info = create_team_info(id_table)
            # will use team_ids with merge_boxscores()
            team_ids = [team_info[0], team_info[2]]
            info_df = create_info_df(game_info=game_info,
                                     team_info=team_info,
                                     info_columns=info_columns)
            # write to sql database
            info_df.to_sql('game_info', con=conn, if_exists='append', index=False)

            # team/player databases:
            
            # 4 boxscore tables : away_box, away_box_adv, home_box, home_box_adv
            stat_tables = parser.findAll('table', attrs = {'class': 'sortable stats_table now_sortable'})
            
            player_box_list = [None, None, None, None]
            team_box_list = [None, None, None, None]

            # create team and player boxscores
            for i in range(len(stat_tables)):
                # split player and team boxscores
                player_box_list[i], team_box_list[i] = create_boxscores(stat_tables[i], game_id=game_id)
            
            # team_stats database:
            
            # combine boxscore and advanced boxscore for each team
            away_team_box, home_team_box = merge_boxscores(team_box_list, team_ids=team_ids, scope='team')
            team_boxes = pd.concat([away_team_box, home_team_box])
            team_boxes.reset_index(drop=True, inplace=True)
            # prepare numeric data
            team_boxes = change_dtypes(team_boxes, num_columns)
            # write to sql database
            team_boxes.to_sql('team_stats', con=conn, if_exists='append', index=False)
            
            # player_stats database:
            
            # combine boxscore and advanced boxscore for each team
            away_player_box, home_player_box = merge_boxscores(player_box_list, team_ids=team_ids, scope='player')
            player_boxes = pd.concat([away_player_box, home_player_box])
            player_boxes.reset_index(drop=True, inplace=True)
            # prepare numeric data
            player_boxes = change_dtypes(player_boxes, num_columns)
            # create team totals for PIE calculation
            totals = dict(team_boxes.loc[:,'FG':'PTS'].sum())
            # add PIE column to player boxscore
            player_boxes = create_PIE(player_boxes, totals)
            # write to sql database
            player_boxes.to_sql('player_stats', con=conn, if_exists='append', index=False)

            # increase gamecount to create next game_id
            season_gamecount += 1