# Data Collection

The purpose of this notebook is to scrape basketball-reference.com. Using custom functions to scrape certain parts of the website, this code will go through every single game from the 2018 season to the 2024 season using PlayWright and download its HTML file using BeautifulSoup. It will then save each games HTML file under the directory '/data'. 

In [2]:
import os
import requests
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
import time

In [None]:
!playwright install

### Pointer Logic

- Create pointer variables to save .html files to a particular directory

In [None]:
DATA_DIR = "data"
STANDINGS_DIR = os.path.join(DATA_DIR, "standings")
SCORES_DIR = os.path.join(DATA_DIR, "scores")

### Get HTML
- Leverages Playwright to launch a browser instance and navigate to a particular URL passed in through the parameters. Based on a certain HTML selector passed in as well (basically a tag), it will select that part of the HTML file and return it. 
- Retries are implemented in the scenario where we face a PlaywrightTimeout error. We will retry 3 times, and pause the program for 3, 6, or 9 seconds depending on how many times we have faced the error

In [None]:
async def get_html(url, selector, sleep=3.01, retries=3):
    html = None
    for i in range(1, retries + 1):
        # to avoid getting rate-limited, we employ time.sleep to wait between requests
        time.sleep(sleep * i)

        try:
            async with async_playwright() as p:
                # launch the browser instance with playwright
                browser = await p.firefox.launch()

                # create a new tab within the browser
                page = await browser.new_page()

                # navigate to the particular url using the new tab given 
                await page.goto(url)

                print(await page.title())

                html = await page.inner_html(selector)

        except PlaywrightTimeout:
            print(f"Timeout error on {url}")
            continue
            
        else:
            break
    
    return html

### Scrape Season

- Based on the structure of basketball-reference.com, each NBA season is seperated into months. To scrape each month in a season, we need to write a custom function to visit each month's HTML page because they reside at different URLs – we achieve this by finding each href. 
- We then go through each month, download its HTML file, and save it to a directory in the '/data' folder

In [None]:
async def scrape_season(season):
    url = f"https://www.basketball-reference.com/leagues/NBA_{season}_games.html"
    html = await get_html(url, "#content .filter")

    soup = BeautifulSoup(html)
    links = soup.find_all("a")
    hrefs = [l['href'] for l in links] 
    standings_pages = [f"https://basketball-reference.com{l}" for l in hrefs]
    
    for url in standings_pages:
        save_path = os.path.join(STANDINGS_DIR, url.split("/")[-1])
        if os.path.exists(save_path):
            continue

        html = await get_html(url, "#all_schedule")
        with open(save_path, "w+") as f:
            f.write(html)

### Scrape Season Execution

- Since we are interested in seasons from 2018 to 2025, we create a list of seasons and iterate through them to pass them into our scrape_season() function

In [None]:
seasons = list(range(2018, 2025))
for season in seasons:
    await scrape_season(season)

### Standings Files

- Each 'standing file' refers to a particular months schedule. In other words, it is an HTML file that contains an overview of each game played in a particular month. For example, "NBA_2018_games-april.html" will contain a generic overview of each game played in the 2018 NBA season in April specifically.

In [None]:
standings_files = os.listdir(STANDINGS_DIR)
standings_files = [s for s in standings_files if "html" in s]
standings_files.sort(reverse=True)

standings_files

### Scrape Game

- Now that we have each months schedule for every season, we need to iterate through the games themselves now. We first open each 'standing file' and parse through each game to find its boxscore href. In other words, the link that leads to their boxscore page. We leverage the get_html() function with a specific selector to scrape each individual game in a season now.

In [None]:
async def scrape_game(standings_file):
    with open(standings_file, 'r') as f:
        html = f.read()

    soup = BeautifulSoup(html)
    links = soup.find_all("a")
    hrefs = [l.get('href') for l in links]
    box_scores = [f"https://www.basketball-reference.com{l}" for l in hrefs if l and "boxscore" in l and '.html' in l]

    for url in box_scores:
        save_path = os.path.join(SCORES_DIR, url.split("/")[-1])
        if os.path.exists(save_path):
            continue

        html = await get_html(url, "#content")
        if not html:
            continue
        with open(save_path, "w+") as f:
            f.write(html)

### Scrape Game Execution 

- For every single month in a season, we call scrape_game() to get every game's HTML file.

In [None]:
for f in standings_files:
    filepath = os.path.join(STANDINGS_DIR, f)

    await scrape_game(filepath)