# NBA Game Prediction (2) - Parse Data

## Introduction

In this script, we are parsing the HTML content that was scraped in the previous step. This script focuses on extracting the relevant information from the HTML content and transforming it into a structured format for further analysis. We use BeautifulSoup to parse the HTML content and pandas to create and manipulate dataframes.

## Importing Necessary Libraries
We import the necessary libraries. os is used for interacting with the operating system, pandas is used for data manipulation, and BeautifulSoup is used for parsing HTML content.


In [None]:
import os 
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
print(os.getcwd())


/Users/ruijiezheng/DS Project/NBA Game Prediction


## Defining Constants and Variables
Here, we define some constants and variables. SCORES_DIR is the directory where our scraped box scores are saved, and box_scores is a list of the box score files.

In [3]:
SCORES_DIR = "data/scores"

In [4]:
box_scores = os.listdir(SCORES_DIR)

In [5]:
len(box_scores)

1566

In [6]:
box_scores = [os.path.join(SCORES_DIR, f) for f in box_scores if f.endswith(".html")]

## Defining the parse_html Function
This function is used to parse the HTML content of a box score file. It takes in a box score file, reads its content, and creates a BeautifulSoup object for further parsing.

In [7]:
def parse_html(box_score):
    with open(box_score) as f:
        html = f.read()
        
    soup = BeautifulSoup(html)
    [s.decompose() for s in soup.select("tr.over_header")]
    [s.decompose() for s in soup.select("tr.thead")]
    return soup

## Defining the read_line_score Function
This function extracts the line score table from the HTML content. It takes in a BeautifulSoup object, finds the line score table, and transforms it into a pandas dataframe.


In [8]:
def read_line_score(soup):
    line_score = pd.read_html(str(soup), attrs={"id": "line_score"})[0]
    cols = list(line_score.columns)
    cols[0] = "team"
    cols[-1] = "total"
    line_score.columns = cols
    
    line_score = line_score[["team", "total"]]
    return line_score

## Defining the read_stats Function
This function extracts a specific stats table (either basic or advanced) for a specific team from the HTML content.

In [9]:
def read_stats(soup, team, stat):
    df = pd.read_html(str(soup), attrs = {"id": f"box-{team}-game-{stat}"}, index_col=0)[0]
    df = df.apply(pd.to_numeric, errors="coerce")
    return df

In [10]:
box_score = box_scores[0]
soup = parse_html(box_score)

line_score = read_line_score(soup)
teams = list(line_score["team"])

summaries = []

for team in teams:
    df = pd.read_html(str(soup), attrs = {"id": f"box-{team}-game-basic"}, index_col=0)[0]
    


In [11]:
teams

['UTA', 'NOP']

In [12]:
df

Unnamed: 0_level_0,MP,FG,FGA,FG%,3P,3PA,3P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,+/-
Starters,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Lonzo Ball,42:07,2,12,.167,1,5,.200,0,0,,0,4,4,13,0,2,2,3,5,+11
Brandon Ingram,41:24,15,25,.600,3,8,.375,16,20,.800,1,7,8,6,0,1,4,3,49,+1
Josh Hart,36:35,3,8,.375,2,5,.400,1,2,.500,1,6,7,0,3,1,1,6,9,+4
Derrick Favors,36:07,10,12,.833,0,0,,1,2,.500,4,7,11,3,2,3,1,3,21,+4
Nicolò Melli,25:55,3,6,.500,1,1,1.000,0,0,,1,3,4,0,0,0,0,2,7,-18
E'Twaun Moore,28:25,7,11,.636,2,3,.667,0,1,.000,0,2,2,1,1,0,2,3,16,+2
Frank Jackson,25:36,3,5,.600,0,0,,4,6,.667,1,2,3,1,1,0,2,0,10,+23
Nickeil Alexander-Walker,16:05,5,11,.455,2,4,.500,0,0,,1,0,1,3,1,1,0,2,12,+2
Jaxson Hayes,12:46,3,3,1.000,0,0,,3,4,.750,0,5,5,2,0,1,1,5,9,+1
Zylan Cheatham,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play,Did Not Play


In [13]:
def read_season_info(soup):
    nav = soup.select("#bottom_nav_container")[0]
    hrefs = [a["href"] for a in nav.find_all("a")]
    season = os.path.basename(hrefs[1]).split("_")[0]
    return season
    

## Reading the Box Scores
Here, we loop over each box score in box_scores, parse the HTML content, read the line score and stats, and compile the data into a single dataframe.

We set up logging to keep track of the progress and any potential issues during the parsing process.

In [15]:
## with progress bar and log. try this 

import logging
from tqdm.notebook import tqdm

logging.basicConfig(filename='full_game_scraping.log', level=logging.INFO)

base_cols = None
games = []

# Add tqdm to box_scores for progress bar
for box_score in tqdm(box_scores, desc="Processing box scores"):
    soup = parse_html(box_score)
    line_score = read_line_score(soup)
    teams = list(line_score["team"])

    summaries = []

    for team in teams:
        basic = read_stats(soup, team, "basic")
        advanced = read_stats(soup, team, "advanced")

        totals = pd.concat([basic.iloc[-1, :], advanced.iloc[-1,:]])
        totals.index = totals.index.str.lower()

        maxes = pd.concat([basic.iloc[:-1,:].max(), advanced.iloc[:-1,:].max()])
        maxes.index = maxes.index.str.lower() + "_max"

        summary = pd.concat([totals, maxes])

        if base_cols is None:
            base_cols = list(summary.index.drop_duplicates(keep="first"))
            base_cols = [b for b in base_cols if "bpm" not in b]

        summary = summary[base_cols]

        summaries.append(summary)
    summary = pd.concat(summaries, axis=1).T

    game = pd.concat([summary, line_score], axis = 1)

    game["home"] = [0,1]
    game_opp = game.iloc[::-1].reset_index()
    game_opp.columns += "_opp"

    full_game = pd.concat([game, game_opp], axis=1)

    full_game["season"] = read_season_info(soup)

    full_game["date"] = os.path.basename(box_score)[:8]
    full_game["date"] = pd.to_datetime(full_game["date"], format="%Y%m%d")

    full_game["won"] = full_game["total"] > full_game["total_opp"]

    games.append(full_game)

    # Add logging for every 100 games processed
    if len(games) % 100 == 0:
        logging.info(f"Processed {len(games)} out of {len(box_scores)} games.")


Processing box scores:   0%|          | 0/1564 [00:00<?, ?it/s]

In [21]:
full_game

Unnamed: 0,mp,mp.1,fg,fga,fg%,3p,3pa,3p%,ft,fta,...,tov%_max_opp,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won
0,240.0,240.0,41.0,85.0,0.482,9.0,26.0,0.346,26.0,30.0,...,27.7,27.1,150.0,126.0,MIA,106,1,2020,2020-09-19,True
1,240.0,240.0,33.0,85.0,0.388,12.0,44.0,0.273,28.0,34.0,...,51.5,36.2,141.0,114.0,BOS,117,0,2020,2020-09-19,False


In [64]:
# full_game.to_csv('full_game.csv', index=False)

In [17]:
games_df = pd.concat(games, ignore_index=True)

In [18]:
games_df

Unnamed: 0,mp,mp.1,fg,fga,fg%,3p,3pa,3p%,ft,fta,...,tov%_max_opp,usg%_max_opp,ortg_max_opp,drtg_max_opp,team_opp,total_opp,home_opp,season,date,won
0,265.0,265.0,46.0,100.0,0.460,15.0,39.0,0.385,25.0,32.0,...,20.7,39.9,160.0,125.0,NOP,138,1,2020,2020-01-16,False
1,265.0,265.0,51.0,93.0,0.548,11.0,26.0,0.423,25.0,35.0,...,25.0,39.5,197.0,132.0,UTA,132,0,2020,2020-01-16,True
2,240.0,240.0,48.0,91.0,0.527,20.0,41.0,0.488,13.0,21.0,...,50.0,33.5,147.0,141.0,TOR,105,1,2021,2021-03-03,True
3,240.0,240.0,34.0,77.0,0.442,12.0,36.0,0.333,25.0,30.0,...,25.2,31.2,197.0,115.0,DET,129,0,2021,2021-03-03,False
4,240.0,240.0,38.0,83.0,0.458,12.0,28.0,0.429,16.0,20.0,...,41.0,32.0,180.0,118.0,DAL,115,1,2021,2021-03-10,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3123,240.0,240.0,35.0,94.0,0.372,7.0,32.0,0.219,28.0,34.0,...,33.3,45.4,192.0,115.0,GSW,132,0,2019,2019-04-18,False
3124,240.0,240.0,34.0,82.0,0.415,9.0,33.0,0.273,10.0,16.0,...,26.8,33.4,164.0,99.0,DEN,103,1,2019,2019-02-11,False
3125,240.0,240.0,38.0,89.0,0.427,16.0,37.0,0.432,11.0,13.0,...,41.5,25.7,226.0,117.0,MIA,87,0,2019,2019-02-11,True
3126,240.0,240.0,41.0,85.0,0.482,9.0,26.0,0.346,26.0,30.0,...,27.7,27.1,150.0,126.0,MIA,106,1,2020,2020-09-19,True


In [19]:
# check if there is error in the columns number for each row
[g.shape[1] for g in games if g.shape[1] != 150]  

[]

In [22]:
games_df.to_csv("nba_games.csv")

## Conclusion
This script successfully parses the HTML content of NBA box scores and transforms it into a structured format. The parsed data is saved to a CSV file for further analysis. This data will be used in the next step of our project, where we'll explore the data and train a machine learning model to predict the outcome of NBA games.