# NBA Game Prediction (1) - Data Collection

## Introduction

In this script, we are scraping NBA data for our project to predict the outcome of NBA games. This script focuses on data collection, specifically on gathering NBA season standings and game scores from the basketball-reference.com website. We use the async_playwright package to interact with the website and BeautifulSoup to parse the HTML content.

## Importing Libraries

We import the necessary libraries. os is used for interacting with the operating system, BeautifulSoup is used for parsing HTML content, async_playwright is used for automating browser tasks, and time is used for time-related tasks.

In [None]:
import os 
from bs4 import BeautifulSoup
from playwright.async_api import async_playwright, TimeoutError as PlaywrightTimeout
import time

In [35]:
# %pip install beautifulSoup4

In [36]:
# %pip install playwright

In [37]:
# !playwright install

## Defining Constants and Variables

Here, we define some constants and variables. SEASONS is a list of the seasons we are interested in, and DATA_DIR, STANDINGS_DIR, and SCORES_DIR are the directories where we'll store our scraped data.



In [38]:
SEASONS = list(range(2016, 2023))

In [39]:
# DATA_DIR = "NBA Game Prediction/data"
DATA_DIR = "data"
STANDINGS_DIR = os.path.join(DATA_DIR, "standings")
SCORES_DIR = os.path.join(DATA_DIR, "scores")

## Defining the get_html Function
This function is used to fetch the HTML content of a web page. It takes in a URL and a CSS selector to target specific parts of the web page.

In [40]:
async def get_html(url, selector, sleep=5, retries=3):
    html = None
    for i in range(1, retries+1):
        time.sleep(sleep * i)
        
        try:
            async with async_playwright() as p:
                # browser = await p.chromium.launch()
                browser = await p.firefox.launch()  #use firefox instead of chrome if there is timeout error with code "html = await get_html(url, "#content .filter")"
                page = await browser.new_page()
                await page.goto(url)
                print(await page.title())
                html = await page.inner_html(selector)
        except PlaywrightTimeout:
            print(f"Timeout error on {url}")
            continue
        else:
            break
    return html

## Defining the scrape_season Function
This function scrapes the season standings for a specific season. It saves the scraped HTML content to the standings directory.

In [41]:
import logging
from tqdm.notebook import tqdm

# Set up logging
logging.basicConfig(filename='seasons_scraping.log', level=logging.INFO)

async def scrape_season(season):
    url = f"https://www.basketball-reference.com/leagues/NBA_{season}_games.html"
    html = await get_html(url, "#content .filter")

    soup = BeautifulSoup(html)
    links = soup.find_all("a")
    href = [l["href"] for l in links]
    standings_pages = [f"https://basketball-reference.com{l}" for l in href]

    for url in tqdm(standings_pages, desc=f'Scraping season {season}'):
        save_path = os.path.join(STANDINGS_DIR, url.split("/")[-1])
        if os.path.exists(save_path):
            continue

        html = await get_html(url, "#all_schedule")
        if not html:
            continue
        with open(save_path, "w+") as f:
            f.write(html)
        
        # Log the successful scrape
        logging.info(f'Successfully scraped {url}')


## Scraping Season Standings
Here, we loop over each season in SEASONS and scrape the season standings.

In [None]:
# code to execute scraping seasons
for season in SEASONS:
    await scrape_season(season)    

In [42]:
# print current working directory for the noteboook
import os
print(os.getcwd())


/Users/ruijiezheng/DS Project/NBA Game Prediction


## Getting the List of Standings Files
We get a list of the standings files we've saved.

In [43]:
standings_files = os.listdir(STANDINGS_DIR)

In [44]:
standings_files

['NBA_2022_games-october.html',
 'NBA_2021_games-june.html',
 'NBA_2020_games-march.html',
 'NBA_2020_games-september.html',
 'NBA_2020_games-january.html',
 'NBA_2020_games-august.html',
 'NBA_2019_games-april.html',
 'NBA_2022_games-may.html',
 'NBA_2019_games-february.html',
 'NBA_2018_games-february.html',
 'NBA_2016_games-april.html',
 'NBA_2021_games-march.html',
 'NBA_2018_games-january.html',
 'NBA_2017_games-february.html',
 'NBA_2016_games-february.html',
 'NBA_2017_games-october.html',
 'NBA_2018_games-april.html',
 'NBA_2020_games-december.html',
 'NBA_2019_games-october.html',
 'NBA_2020_games-november.html',
 'NBA_2021_games-may.html',
 'NBA_2021_games-december.html',
 'NBA_2022_games-april.html',
 'NBA_2020_games-october-2019.html',
 'NBA_2022_games-december.html',
 'NBA_2017_games-april.html',
 'NBA_2022_games-november.html',
 'NBA_2016_games-january.html',
 'NBA_2018_games-october.html',
 'NBA_2017_games-march.html',
 'NBA_2021_games-february.html',
 'NBA_2020_games-fe

## Defining the scrape_game Function
This function scrapes the box score for each game in a season. It saves the scraped HTML content to the scores directory.

In [45]:
import logging
from tqdm.notebook import tqdm

# Set up logging
logging.basicConfig(filename='scores_scraping.log', level=logging.INFO)

async def scrape_game(standings_file):

    with open(standings_file, 'r') as f:
        html = f.read()

    soup = BeautifulSoup(html)
    links = soup.find_all("a")
    hrefs = [l.get("href") for l in links]
    box_scores = [l for l in hrefs if l and "boxscore" in l and ".html" in l]
    box_scores = [f"https://www.basketball-reference.com{l}" for l in box_scores]  # give us full links of the box scores

    for url in tqdm(box_scores, desc='Scraping games'):
        save_path = os.path.join(SCORES_DIR, url.split("/")[-1])
        if os.path.exists(save_path):
            continue

        html = await get_html(url, "#content")
        if not html:
            continue
        with open (save_path, "w+") as f:
            f.write(html)

        # Log the successful scrape
        logging.info(f'Successfully scraped {url}')

In [48]:
# filter out any werid file
standings_files = [s for s in standings_file if ".html" in s]

## Scraping Box Scores
Here, we loop over each file in standings_files and scrape the box scores for each game in the corresponding season.

In [49]:
# code to execute scraping box_score
for f in standings_files:
    filepath = os.path.join(STANDINGS_DIR, f)
    
    await scrape_game(filepath)
        
    

## Conclusion
This script successfully scrapes NBA season standings and game scores from the basketball-reference.com website. The scraped data is saved to the standings and scores directories, respectively. This data will be used in the next step of our project, where we'll parse the HTML content and transform it into a structured format for further analysis.