# Data Collection

## Introduction

The goal of this project is to scrape data from the web to create a model that predicts the Most Valuable Player (MVP) for a given NBA season. The first step of this project will be to gather the data. I will be gathering the data from Basketball Reference (https://www.basketball-reference.com/). We will need to gather data on who won the MVP, player stats from that season, and the record of each NBA team that season.

First, we will begin by getting player stats from each NBA season. The stats for the 2022–23 season can be found here: https://www.basketball-reference.com/leagues/NBA_2023_per_game.html. I have chosen to use data dating back to the 1987 NBA season. I wanted to start after the ABA-NBA merger, which took place in 1976, as well as include Michael Jordan and Magin Johnson's MVPs. Magic Johnson was the first of the two to win an MVP, which he did in 1987, so I will be starting in 1987.

In [1]:
#Import packages
import requests
import os
import time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from io import StringIO
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
#Define the years we want to gather data from
years = list(range(1987, 2024))

# Scraping Player Stats

In [3]:
stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_per_game.html'

Notice this is very similar to the URL for the 2023 season. {} replaces 2023, so we are able to iterate over all years from 1987 to 2023 in the url. When we try to load the data from the basketball reference, the entire HTML does not load. We will need to use a webdriver in order to get around this issue. 

In [4]:
#Selenium Driver
options = webdriver.ChromeOptions()
options.binary_location = "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
service = Service(executable_path="/Users/natefarber/Desktop/chromedriver-mac-arm64/chromedriver")
driver = webdriver.Chrome(service=service, options=options)

In [5]:
#Loop over each year
for year in years:
    #Get the url with driver
    driver.get(stats_url.format(year))
    #Scroll down page to load full site
    driver.execute_script('window.scrollTo(1, document.body.scrollHeight)')
    #Save file
    with open ('player_data/{}.html'.format(year), "w+") as f:
        f.write(driver.page_source)
    #Sleep to prevent rate limits
    time.sleep(20)

It is important to note that this table has extra rows inserted into it that contain the header. These rows do not contain player stats, so we will have to find a way to remove these extra rows so they do not mess up our Pandas dataframe. Let's start by looking at just the year 1987.

In [6]:
#Define a function that gets the stats in a given year
def get_player_stats(year):
    with open(f"player_data/{year}.html") as f:
        soup = BeautifulSoup(f, 'html.parser')
        #Remove the unwanted rows in the dataframe
        for thead in soup.find_all('tr', class_='thead'):
            thead.decompose()
        
        html_content = str(soup.find(id='per_game_stats'))
        player_stats = pd.read_html(StringIO(html_content))[0]
        player_stats["Year"] = year
        return player_stats

In [7]:
#Call our function on the year 1987 to observe the data
get_player_stats(1987)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Kareem Abdul-Jabbar*,C,39,LAL,78,78,31.3,7.2,12.7,...,1.9,4.8,6.7,2.6,0.6,1.2,2.4,3.1,17.5,1987
1,2,Alvan Adams,C,32,PHO,68,40,24.9,4.6,9.1,...,1.3,3.6,5.0,3.3,0.9,0.5,2.0,3.0,11.1,1987
2,3,Michael Adams,PG,24,WSB,63,0,20.7,2.5,6.2,...,0.6,1.3,2.0,3.9,1.3,0.1,1.3,1.4,7.2,1987
3,4,Rafael Addison,SF,22,PHO,62,12,11.5,2.4,5.3,...,0.7,1.0,1.7,0.7,0.4,0.1,0.9,1.2,5.8,1987
4,5,Mark Aguirre,SF,27,DAL,80,80,33.3,9.8,19.9,...,2.3,3.1,5.3,3.2,1.1,0.4,2.7,3.0,25.7,1987
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
373,333,Brad Wright,PF,24,NYK,14,0,9.9,1.4,3.3,...,1.8,2.0,3.8,0.1,0.2,0.4,0.9,1.4,3.7,1987
374,334,Danny Young,PG,24,SEA,73,26,20.3,1.8,3.9,...,0.3,1.2,1.5,4.8,1.0,0.0,1.2,1.0,4.8,1987
375,335,Perry Young,SG,23,TOT,9,0,8.0,0.7,2.3,...,0.3,0.6,0.9,0.8,0.6,0.1,0.4,1.6,1.4,1987
376,335,Perry Young,SG,23,CHI,5,0,4.0,0.4,0.8,...,0.0,0.2,0.2,0.0,0.2,0.0,0.2,0.6,1.0,1987


Our data looks good, now we will create a list with data for all of the years

In [8]:
stats_df_list = [get_player_stats(year) for year in years]

Now we have a list of dataframes. We will finally combine this into one dataframe containing all of the data and export it as a CSV file.

In [9]:
stats_df = pd.concat(stats_df_list)

In [10]:
stats_df.head(5)

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Kareem Abdul-Jabbar*,C,39,LAL,78,78,31.3,7.2,12.7,...,1.9,4.8,6.7,2.6,0.6,1.2,2.4,3.1,17.5,1987
1,2,Alvan Adams,C,32,PHO,68,40,24.9,4.6,9.1,...,1.3,3.6,5.0,3.3,0.9,0.5,2.0,3.0,11.1,1987
2,3,Michael Adams,PG,24,WSB,63,0,20.7,2.5,6.2,...,0.6,1.3,2.0,3.9,1.3,0.1,1.3,1.4,7.2,1987
3,4,Rafael Addison,SF,22,PHO,62,12,11.5,2.4,5.3,...,0.7,1.0,1.7,0.7,0.4,0.1,0.9,1.2,5.8,1987
4,5,Mark Aguirre,SF,27,DAL,80,80,33.3,9.8,19.9,...,2.3,3.1,5.3,3.2,1.1,0.4,2.7,3.0,25.7,1987


In [11]:
#Convert to csv
stats_df.to_csv('stats_df.csv')

Now that we have finished scraping all the player stats dating back to 1987, we need to scrape results from the MVP voting dating back to 1987. The link for the MVP voting in 1987 is https://www.basketball-reference.com/awards/awards_1987.html. If you open the link, you can see that Magic Johnson won the MVP this year.

# Scraping MVP Data

In [12]:
#Define the URL we want to scrape
mvp_url = 'https://www.basketball-reference.com/awards/awards_{}.html'

Notice this is very similar to the URL for the 2023 season. {} replaces 2023, so we are able to iterate over all years from 1987 to 2023 in the url, as we did for player stats. The entire page loads, so we do not need to use selenium.

In [13]:
#Iterate over each year
for year in years:
    #Create a url for each year
    mvp_url_data = mvp_url.format(year)
    #Get the website for the year and add it to mvp_data folder
    mvp_data = requests.get(mvp_url_data)
    with open("mvp_data/{}.html".format(year), "w+") as f: 
        f.write(mvp_data.text)
    #Add sleep timer to prevent rate limits
    time.sleep(20)

Now we have a folder called mvp_data with the webpage for MVP voting results for each year between 1987 and 2023. Next, we have to pull the data from the HTML file and finally get it into a Pandas DataFrame. To do this, we will use the BeautifulSoup package. Let's do it in just 1987 first before we loop over all years. First, I found that the ID in HTML where the table is stored is named mvp.

In [14]:
with open(f"mvp_data/1987.html") as f:
    html_content = f.read()
    soup = BeautifulSoup(html_content, 'html.parser')
    mvp_table = str(soup.find(id="mvp"))
    mvp_df = pd.read_html(StringIO(mvp_table))[0]

In [15]:
mvp_df.head(5)

Unnamed: 0_level_0,Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Voting,Voting,Voting,Voting,Unnamed: 8_level_0,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game,Shooting,Shooting,Shooting,Advanced,Advanced
Unnamed: 0_level_1,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,1,Magic Johnson,27,LAL,65.0,733.0,780,0.94,80,36.3,23.9,6.3,12.2,1.7,0.5,0.522,0.205,0.848,15.9,0.263
1,2,Michael Jordan,23,CHI,10.0,449.0,780,0.576,82,40.0,37.1,5.2,4.6,2.9,1.5,0.482,0.182,0.857,16.9,0.247
2,3,Larry Bird,30,BOS,1.0,271.0,780,0.347,74,40.6,28.1,9.2,7.6,1.8,0.9,0.525,0.4,0.91,15.2,0.243
3,4,Kevin McHale,29,BOS,0.0,254.0,780,0.326,77,39.7,26.1,9.9,2.6,0.5,2.2,0.604,0.0,0.836,14.8,0.232
4,5,Dominique Wilkins,27,ATL,0.0,128.0,780,0.164,79,37.6,29.0,6.3,3.3,1.5,0.6,0.463,0.292,0.818,12.2,0.197


We do not need the top row in the dataframe that includes "Unnamed: 0_level_0," so we will drop this.

In [16]:
mvp_df.columns = mvp_df.columns.droplevel()

In [17]:
mvp_df.head(5)

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,1,Magic Johnson,27,LAL,65.0,733.0,780,0.94,80,36.3,23.9,6.3,12.2,1.7,0.5,0.522,0.205,0.848,15.9,0.263
1,2,Michael Jordan,23,CHI,10.0,449.0,780,0.576,82,40.0,37.1,5.2,4.6,2.9,1.5,0.482,0.182,0.857,16.9,0.247
2,3,Larry Bird,30,BOS,1.0,271.0,780,0.347,74,40.6,28.1,9.2,7.6,1.8,0.9,0.525,0.4,0.91,15.2,0.243
3,4,Kevin McHale,29,BOS,0.0,254.0,780,0.326,77,39.7,26.1,9.9,2.6,0.5,2.2,0.604,0.0,0.836,14.8,0.232
4,5,Dominique Wilkins,27,ATL,0.0,128.0,780,0.164,79,37.6,29.0,6.3,3.3,1.5,0.6,0.463,0.292,0.818,12.2,0.197


Now that we are able to do it for one year we can use a for loop to do it for each year.

In [18]:
#Define list to store the dataframes
mvp_dfs = []

#Iterate over each year
for year in years:
    with open(f"mvp_data/{year}.html") as f:
        #Read the HTML content and parse it
        html_content = f.read()
        soup = BeautifulSoup(html_content, 'html.parser')
        table = soup.find(id="mvp")
        html_file = StringIO(str(table))
        mvp_df = pd.read_html(html_file)[0]
        
        #remove the unwanted information
        mvp_df.columns = mvp_df.columns.droplevel()
        
        #Add a year column 
        mvp_df['Year'] = year
        
        #Add dataframe to the list
        mvp_dfs.append(mvp_df)

mvp_dfs is a list of dataframes. We will convert it to one dataframe using Pandas.

In [19]:
mvp_df = pd.concat(mvp_dfs)

In [20]:
mvp_df.head(5)

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,1,Magic Johnson,27,LAL,65.0,733.0,780,0.94,80,36.3,...,6.3,12.2,1.7,0.5,0.522,0.205,0.848,15.9,0.263,1987
1,2,Michael Jordan,23,CHI,10.0,449.0,780,0.576,82,40.0,...,5.2,4.6,2.9,1.5,0.482,0.182,0.857,16.9,0.247,1987
2,3,Larry Bird,30,BOS,1.0,271.0,780,0.347,74,40.6,...,9.2,7.6,1.8,0.9,0.525,0.4,0.91,15.2,0.243,1987
3,4,Kevin McHale,29,BOS,0.0,254.0,780,0.326,77,39.7,...,9.9,2.6,0.5,2.2,0.604,0.0,0.836,14.8,0.232,1987
4,5,Dominique Wilkins,27,ATL,0.0,128.0,780,0.164,79,37.6,...,6.3,3.3,1.5,0.6,0.463,0.292,0.818,12.2,0.197,1987


Now we have to convert it to a CSV file so we can use the data later.

In [21]:
mvp_df.to_csv('mvp_df.csv')

# Scraping Team Record Data

The final piece of information we will scrape is team records. On the NBA website (found here: https://www.nba.com/news/blogtable-what-criteria-matters-most-making-mvp-decision), it states that "standings matter, but are not entirely," so we will take this data into consideration. We will also get this data from basketball reference. Here is the link for the 1987 NBA season standings: https://www.basketball-reference.com/leagues/NBA_1987_standings.html. As usual, I will start by just getting the data for 1987 to ensure that the process works before doing it for every year. We can either pull data from the division standings table or the expanded standings table. I will pull data from the expanded standings table to minimize web scraping complexity. Because the expanded standings table does not load using regular request functions, we will have to use Selenium again to load the entire page.

In [22]:
team_url = 'https://www.basketball-reference.com/leagues/NBA_{}_standings.html'

In [23]:
#Loop over each year
for year in years:
    #Get the url with driver
    driver.get(team_url.format(year))
    #Scroll down page to load full site
    driver.execute_script('window.scrollTo(1, document.body.scrollHeight)')
    #Save file
    with open ('team_data/{}.html'.format(year), "w+") as f:
        f.write(driver.page_source)
    #Sleep to prevent rate limits
    time.sleep(20)

In [24]:
#Define a function to exctract standings from html file
def get_team_record(year):
    with open(f"team_data/{year}.html") as f:
        soup = BeautifulSoup(f, 'html.parser')
        html_string = str(soup.find(id='div_expanded_standings'))
        player_stats = pd.read_html(StringIO(html_string))[0]
        #Remove unwanted row
        player_stats.columns = player_stats.columns.droplevel()
        player_stats["Year"] = year
        return player_stats

In [25]:
#Check 1987 data before concatinating all years
test_year_record = get_team_record(1987)

In [26]:
test_year_record.head(5)

Unnamed: 0,Rk,Team,Overall,Home,Road,E,W,A,C,M,...,≤3,≥10,Oct,Nov,Dec,Jan,Feb,Mar,Apr,Year
0,1,Los Angeles Lakers,65-17,37-4,28-13,18-4,47-13,9-1,9-3,23-7,...,6-1,40-10,,12-2,10-4,12-4,10-4,13-1,8-2,1987
1,2,Boston Celtics,59-23,39-2,20-21,38-20,21-3,15-9,23-11,11-1,...,5-9,36-6,1-0,9-4,10-5,13-2,9-4,11-5,6-3,1987
2,3,Atlanta Hawks,57-25,35-6,22-19,38-20,19-5,21-7,17-13,11-1,...,6-7,35-9,,12-3,8-4,8-8,7-6,13-2,9-2,1987
3,4,Dallas Mavericks,55-27,35-6,20-21,13-9,42-18,7-3,6-6,19-11,...,8-4,34-9,1-0,9-5,9-4,9-6,9-5,12-4,6-3,1987
4,5,Detroit Pistons,52-30,32-9,20-21,38-20,14-10,21-7,17-13,7-5,...,12-6,27-12,0-1,6-5,11-3,11-6,9-3,10-7,5-5,1987


Looks good! Now let's combine all the years into one list using a list comprehension.

In [27]:
team_record_df_list = [get_team_record(year) for year in years]

In [28]:
#Convert the list of dataframes into one drataframe
team_record_df = pd.concat(team_record_df_list)

In [29]:
#Export as csv
team_record_df.to_csv('team_record_df.csv')

Now we have collected all of the data we will need for the project. The next step will be to clean up the data we have so we can use it to create a machine learning model to predict the MVP. This step will be done in a separate notebook called data_cleaning.