# Project: Predicting future NBA Most Valuable Player (MVP)

# Part I: Aquiring historical data - Web Scraping

In the first part of this project, we'll scrap the different pieces of data we'll need for our analysis and prediction model. 

All the data will be scrapped from the [**Basketball Reference**](https://www.basketball-reference.com/), which a reliable and extensive resource for basketball stats. It alsa has the benefit to be very well structured. For this project we'll use data from 1991 to 2023 (2023 being the last complete season to data).


We're going to extra 3 sets of data:
* MVP Award Ranking for each season (contains the point count from the vote and stats for each player nomminee) 
* All players stats for each season 
* Teams Stats for each season

## Format of the data (1991 as an example):

* [**MVP AWARD RANKING**](https://www.basketball-reference.com/awards/awards_1991.html)
* [**PLAYERS STATS**](https://www.basketball-reference.com/leagues/NBA_1991_per_game.html)
* [**TEAMS STATS**](https://www.basketball-reference.com/leagues/NBA_1991_standings.html)

In [1]:
# Let's import the libraries we need:
import requests
import time

## MVP Award Ranking

The **Basketball Reference** website is well made and the *MVP Award Ranking* for each year is formated in the same way. So we can simply iterate through the years we are interested in and download each .html page using the `requests` library. Saving the webpages locally allows to minimise the amount of request we submit to the site (best practice!). 

In [2]:
# Create our list of years (range is exclusive at the top-end):
years = list(range(1991,2024))

# Baseline url
url_base_mvp = "https://www.basketball-reference.com/awards/awards_{}.html"

In [3]:
# Iterate through years and save each year's webpage to the mvp folder in our directory
for year in years:
    
    # Use the *time* library to remain within request limits of website
    time.sleep(1)
    
    url = url_base_mvp.format(year)
    
    data = requests.get(url)
    
    with open("mvp/{}.html".format(year), "w+") as f:
        f.write(data.text)

Next, for each webpage we'll parse the table we are interested in. To do so we'll use `BeautifulSoup` which will allow us to initialise a parser class to extract the table. We'll then save all the data in one big pandas dataframe (as we iterate through each year, we'll actually create a list of dataframes but will combine these into our final datframe at the end).

We know the structure of the table we want to extract, in this case there's a double header which will be impractical for pandas so we'll go ahead and remove the top header. Looking at the source code we've identified the header element tag: **'tr' (class_="over_header")** - so will remove that before reading the table of the .html file.

Finally, we know that the table has a specific id tag **'mvp'** that we'll use to extract the data we are interested in.

In [4]:
# Import libraries:
from bs4 import BeautifulSoup
import pandas as pd

In [5]:
# Initialise our list of dataframe: 
dfs = []

# Iterate through each year:
for year in years:
    # First, read the content of the page (string format)
    with open("mvp/{}.html".format(year)) as f:
        page = f.read()
    
    # Then, we initalise BeautifulSoup as an .html parser
    soup = BeautifulSoup(page, 'html.parser')
    
    # Remove the top header of the table
    soup.find('tr', class_="over_header").decompose()
    
    # Use the table 'id' to extract the data
    mvp_table = soup.find_all(id="mvp")[0]
    
    # We can use pandas capability to read html (note: table need to be converted to string)
    mvp_df = pd.read_html(str(mvp_table))[0]
    
    # Add the year of the data we've just extracted as we'll combine all years
    mvp_df["Year"] = year
    
    # Finally append to our pre-defined list
    dfs.append(mvp_df)

Let's combine the data from each year into one dataframe and save it to .csv file:

In [6]:
mvps = pd.concat(dfs)
#mvps

In [7]:
mvps.to_csv("mvps.csv")

## Players stats

Now that we have the scraped the data for the **MVP Award Ranking**, we'll do the same for the **Player Stats** - this will contain the stats from all players in the league, not just the MVP nominee. We'll be able to use this data to compare *MVP worthy stats/performance* to the rest of the league to train our ML algorithm in part 2. 

In this case, the table we're interested in isn't fully loaded when we open the page - therefore when we send a request it only return what is loaded initially. When we visit the page, the rest of the table is rendered as we scroll down (scrolling down actually triggers a java script to be run). Essentially, rather than just sending a request, we need to *run* that script within the browser (to generate all the data) before extracting it. 

To do so we can use `selenium` package and **driver**. We'll install the Chrome driver from [here](https://chromedriver.chromium.org/downloads), this will will allow python to automate the browser. We'll just need to store the executable file in sensible location. We can then control the browser using `selenium`to visit different pages or in our case fully load a page. 

In [8]:
# Baseline url
url_base_player = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

In [9]:
# Import the libraries
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

In [10]:
# On mac we need to run this line as chrome driver hasn't been verified by Apple
!xattr -d com.apple.quarantine /Users/Leo/chromedriver

# Initialise the driver:
driver = webdriver.Chrome(executable_path="/Users/Leo/chromedriver")

xattr: /Users/Leo/chromedriver: No such xattr: com.apple.quarantine


In [11]:
# Let's grab the html of each year
for year in years:
    
    url = url_base_player.format(year)
    
    # Tell the driver to render the url in the browser 
    driver.get(url)
    
    # Execute the java script to make sure we render the whole table
    driver.execute_script("window.scrollTo(1,10000)")
    
    # Use the *time* library to allow for the java script to be executed
    time.sleep(2)
    
    # Write and save the data to a file in our player folder
    with open("player/{}.html".format(year), "w+") as f:
        f.write(driver.page_source)

Similarly to the **MVP Award Ranking**, we'll parse the table of interest from the html pages we've just saved. We'll also have to remove the extra headers and at a `year`column to each data frame. Finally we'll combine all the dfs into one big one and saved it to .csv file.

In [12]:
dfs = []
for year in years:
    with open("player/{}.html".format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="thead").decompose()
    player_table = soup.find_all(id="per_game_stats")[0]
    player_df = pd.read_html(str(player_table))[0]
    player_df["Year"] = year
    dfs.append(player_df)

In [13]:
players = pd.concat(dfs)
#players

In [14]:
players.to_csv("players.csv")

## Teams Stats

Now on to **Team Stats**. Teams W/L rate is an important factor for the MVP voting, we'll therefore use it as a predictor in our ML model in part 2.

We'll extract and then combine the data from the two *Standing* tables (Eastern and Western conferences). In this case we can simply use the `request`library without going through a driver. 

Note: *We could also get the data from the **Extended Standing** table, but this would required a browser driver and some manipulation in Pandas to seperate the W/L column of interest. We'll use the other method we've mentioned above to familiarise ouselves with scraping and combining multiple tables.*

Finally we'll combine all the dfs into one big one and saved it to .csv file.

In [22]:
# Baseline url
url_base_team = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"

In [23]:
for year in years:
    
    time.sleep(1)
    
    url = url_base_team.format(year)
    
    data = requests.get(url)
    
    with open("team/{}.html".format(year), "w+") as f:
        f.write(data.text)

In [None]:
dfs = []
for year in years:
    with open("team/{}.html".format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="thead").decompose()
    
    team_table = soup.find_all(id="divs_standings_E")[0]
    team_df = pd.read_html(str(team_table))[0]
    team_df["Year"] = year
    
    # We'll add create a common team column for the data comming from each table, and delete the initial column
    team_df["Team"] = team_df["Eastern Conference"]
    del team_df["Eastern Conference"]
    
    dfs.append(team_df)
    
    team_table = soup.find_all(id="divs_standings_W")[0]
    team_df = pd.read_html(str(team_table))[0]
    team_df["Year"] = year
    
    # Same as above
    team_df["Team"] = team_df["Western Conference"]
    del team_df["Western Conference"]
    
    dfs.append(team_df)

In [None]:
teams = pd.concat(dfs)
#teams

In [None]:
teams.to_csv("teams.csv")

### This is the end of Part 1 - We now have the data we need saved in 3 different .csv, ready for the next step