# NBA Web Scraping

My goal is to predict the next NBA MVP.
I will use web scraping from "Basketball Reference" to get data about:
1. Past MVP winners
2. Players stats per season
3. Team record per season

In [2]:
import requests
import time
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver

I'm going to work with data from 1990/1991 season to 2021/2022 season.

A total of 32 seasons

In [15]:
years = list(range(1991,2023))

## Web Scraping for Past MVP Winners

Downloaing each html web page and saving it.

I got banned because I had too many requests in a short period of time.

I had to download 2020, 2021 and 2022 manually.

In [3]:
url_start = "https://www.basketball-reference.com/awards/awards_{}.html"

for year in years:
  url = url_start.format(year)
  data = requests.get(url)

  with open("mvp//{}.html".format(year), "w+") as f:
    f.write(data.text)

Parsing each html page and extracting the table into a list

We want to concatinate all of the data frames into one,

So adding a 'Year' column will help disambiguate which year each row is taken from.

In [6]:
dfs = []

for year in years:
  with open("mvp\\{}.html".format(year)) as f:
    page = f.read()

  soup = BeautifulSoup(page, "html.parser")
  soup.find("tr", class_ = "over_header").decompose()
  mvp_table = soup.find_all(id="mvp")
  mvp = pd.read_html(str(mvp_table))[0] # [0] because it returns a list of dataframes
  mvp['Year'] = year
  dfs.append(mvp)

Concatinating all of the data frames into one.

In [7]:
mvps = pd.concat(dfs)
mvps.head()

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,1,Michael Jordan,27,CHI,77.0,891.0,960,0.928,82,37.0,...,6.0,5.5,2.7,1.0,0.539,0.312,0.851,20.3,0.321,1991
1,2,Magic Johnson,31,LAL,10.0,497.0,960,0.518,79,37.1,...,7.0,12.5,1.3,0.2,0.477,0.32,0.906,15.4,0.251,1991
2,3,David Robinson,25,SAS,6.0,476.0,960,0.496,82,37.7,...,13.0,2.5,1.5,3.9,0.552,0.143,0.762,17.0,0.264,1991
3,4,Charles Barkley,27,PHI,2.0,222.0,960,0.231,67,37.3,...,10.1,4.2,1.6,0.5,0.57,0.284,0.722,13.4,0.258,1991
4,5,Karl Malone,27,UTA,0.0,142.0,960,0.148,82,40.3,...,11.8,3.3,1.1,1.0,0.527,0.286,0.77,15.5,0.225,1991


Saving the combined dataframe as a csv.

In [8]:
mvps.to_csv("mvp\\mvps.csv")

But past MVPs is not enough!

We also need data about player stats for each season.

## Web Scraping for Players Stats per Season

"Basketball Reference" uses JavaScript to load the entire web.

I'm going to use ChromeDriver and the selenium library to render the web page and load the complete html.

In [3]:
driver = webdriver.Chrome(executable_path = "C://Users//omri8//Desktop//NBA MVP Prediction//webDriver//chromedriver")

  driver = webdriver.Chrome(executable_path = "C://Users//omri8//Desktop//NBA MVP Prediction//webDriver//chromedriver")


Downloading each html page and saving it.

In [19]:
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

for year in years:
    url = player_stats_url.format(year)
    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)") # This is a javascript code that scrolls the page down
    time.sleep(2) # Wait for script to finish executing
    html = driver.page_source
    with open("player/{}.html".format(year), "w+") as f:
        f.write(html)

Extracting the table from the html file.

The header repeats itself, I'm going to decompose it.

In [58]:
dfs = []

for year in years:
  with open("player\\{}.html".format(year)) as f:
    page = f.read()

  soup = BeautifulSoup(page, "html.parser")
  soup.find("tr", class_ = "thead").decompose()
  player_table = soup.find_all(id="per_game_stats")
  player = pd.read_html(str(player_table))[0] # [0] because it returns a list of dataframes
  player['Year'] = year
  dfs.append(player)

Concatinating of all the seasons to one dataframe.

In [59]:
players = pd.concat(dfs)
players.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Alaa Abdelnaby,PF,22,POR,43,0,6.7,1.3,2.7,...,0.6,1.4,2.1,0.3,0.1,0.3,0.5,0.9,3.1,1991
1,2,Mahmoud Abdul-Rauf,PG,21,DEN,67,19,22.5,6.2,15.1,...,0.5,1.3,1.8,3.1,0.8,0.1,1.6,2.2,14.1,1991
2,3,Mark Acres,C,28,ORL,68,0,19.3,1.6,3.1,...,2.1,3.2,5.3,0.4,0.4,0.4,0.6,3.2,4.2,1991
3,4,Michael Adams,PG,28,DEN,66,66,35.5,8.5,21.5,...,0.9,3.0,3.9,10.5,2.2,0.1,3.6,2.5,26.5,1991
4,5,Mark Aguirre,SF,31,DET,78,13,25.7,5.4,11.7,...,1.7,3.1,4.8,1.8,0.6,0.3,1.6,2.7,14.2,1991


Saving the combined dataframe as a csv.

In [60]:
players.to_csv("player\\players.csv")

## Web Scraping for Team Record per Season

Getting the division standings of each division

(put sleep to avoid getting banned again)

In [7]:
team_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"

for year in years:
    url = team_stats_url.format(year)
    data = requests.get(url)
    
    with open ("team//{}.html".format(year), "w+") as f:
        f.write(data.text)
    time.sleep(6.3)

There are 2 tables to scrape - Eastern Conference and Western Conference

In [33]:
dfs = []

for year in years:
    with open("team//{}.html".format(year)) as f:
        page= f.read()
    soup = BeautifulSoup(page, "html.parser")
    
    # There are multiple unnecessary headers: Atlantic Division, Central Division, Midwest Division, Pacific Division.
    # Need to remove them all.
    headers = soup.find_all("tr", class_ = "thead")
    for header in headers:
        header.decompose()

    # East Conference
    team_table = soup.find_all(id="all_divs_standings_E")
    team = pd.read_html(str(team_table))[0] 
    team['Year'] = year
    team = team.rename(columns={"Eastern Conference" : "Team"}) # Because the team name column is under the 'Eastern Conference' column
    team['Conference'] = "East"
    dfs.append(team)
    
    # West Conference
    team_table = soup.find_all(id="all_divs_standings_W")
    team = pd.read_html(str(team_table))[0] 
    team['Year'] = year
    team = team.rename(columns={"Western Conference" : "Team"})
    team['Conference'] = "West"
    dfs.append(team)

Concatinating of all the seasons to one dataframe.

In [34]:
teams = pd.concat(dfs)

In [36]:
teams

Unnamed: 0,Team,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Conference
0,Boston Celtics*,56,26,0.683,—,111.5,105.7,5.22,1991,East
1,Philadelphia 76ers*,44,38,0.537,12.0,105.4,105.6,-0.39,1991,East
2,New York Knicks*,39,43,0.476,17.0,103.1,103.3,-0.43,1991,East
3,Washington Bullets,30,52,0.366,26.0,101.4,106.4,-4.84,1991,East
4,New Jersey Nets,26,56,0.317,30.0,102.9,107.5,-4.53,1991,East
...,...,...,...,...,...,...,...,...,...,...
10,Memphis Grizzlies*,56,26,0.683,—,115.6,109.9,5.37,2022,West
11,Dallas Mavericks*,52,30,0.634,4.0,108.0,104.7,3.12,2022,West
12,New Orleans Pelicans*,36,46,0.439,20.0,109.3,110.3,-0.84,2022,West
13,San Antonio Spurs,34,48,0.415,22.0,113.2,113.0,0.02,2022,West


Saving the complete table to a csv.

In [30]:
teams.to_csv("team//teams.csv")