# Data Scraping for MVP Predictor
For this project, we'll be scraping data from basketball-reference.com.  
  
There are a few different pages we'll be interested in for every given year:
- The page containing all players who received votes for the MVP award
- The page with every player's basic stats (points, assists, rebounds etc.)
- The page with every player's advanced stats (player efficiency rating, value over replacment, etc.)
- The page containing the NBA's standings and team statistics


## Imports

In [1]:
import pandas as pd
import numpy as np
import requests
import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

In [None]:
years = range(1980, 2023)

In [None]:
headers={"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 12871.102.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.141 Safari/537.36"}

## Scraping MVP votes data

First, we'll save the data locally to minimize the number of requests we make.

In [None]:
for year in years:
    url = f"https://www.basketball-reference.com/awards/awards_{year}.html"
    mvp_votes = requests.get(url, headers = headers).text
    
    with open(f'Data/mvp_votes_{year}', "w+") as f:
        f.write(mvp_votes)

Now, we'll read in the data for every year as a Pandas dataframe and concatenate all these dataframes

In [None]:
mvp_dfs = []
for year in years:
    with open(f'Data/mvp_votes_{year}') as f:
        contents = f.read()
        
    soup = BeautifulSoup(contents, "html.parser")
    soup.find('tr', class_ = 'over_header').decompose()
    
    tab_html = soup.find(id = 'mvp')
    table = pd.read_html(str(tab_html))[0]
    
    table['Year'] = year
    
    mvp_dfs.append(table)

In [None]:
full_mvp = pd.concat(mvp_dfs, axis = 0)
full_mvp.head()

In [None]:
## Saving as CSV
full_mvp.to_csv('data/full_mvp.csv', index = False)

## Scraping Basic Stats 
Now we can scrape the basic stats. We'll use selenium along with a webdriver to do this.  
Repeating the same process, we'll save the data locally, then read it back in as a dataframe.

In [None]:
driver = webdriver.Chrome(executable_path = '/Users/orenciolli/Downloads/chromedriver')

In [None]:
for year in years:
    url = f'https://www.basketball-reference.com/leagues/NBA_{year}_per_game.html'
    driver.get(url)
    driver.execute_script('window.scrollTo(1, 10000)')

    time.sleep(2)

    with open(f'player_stats_{year}.html', 'w+') as f:
        f.write(driver.page_source)

In [None]:
basic_stats = []
for year in years: 
    with open(f'player_stats_{year}.html') as f:
        contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")
    soup.find('tr', class_ = 'thead').decompose()

    tab_html = soup.find(id = 'per_game_stats')
    table = pd.read_html(str(tab_html))[0]

    table['Year'] = year

    basic_stats.append(table)

In [None]:
full_basic_stats = pd.concat(basic_stats, axis = 0)

In [None]:
full_basic_stats.head()

It's worth noting that if a player was traded mid-season, they will have multiple rows in that year: one for each team they played with and one representing their total stats. We'll only keep the total row.  
Additionally, since we will later need to merge on the `team` column, we'll replace the string 'TOT' (representing total) with the last team that player played for in the season in question.

In [None]:
def player_traded(df):
    if df.shape[0] > 1:
        total = df[df['Tm'] == 'TOT']
        total['Tm'] = df.iloc[-1,:]['Tm']
        return total
    else:
        return df

In [None]:
#applying to the dataframe
full_basic_stats = full_basic_stats.groupby(['Player', 'Year']).apply(player_traded)

In [None]:
full_basic_stats.index = full_basic_stats.index.droplevel()
full_basic_stats.index = full_basic_stats.index.droplevel()

There's also a few entries (such as Kareem Abdul-Jabbar in 1980) which have asterisks next to their names. We'll delete these asterisks so that the dataframe is easier to merge later.

In [None]:
full_basic_stats['Player'] = full_basic_stats['Player'].str.replace('*', '', regex = False)
full_basic_stats = full_basic_stats.drop(columns = 'Rk') #dropping unnecessary column

In [None]:
#saving as csv
full_basic_stats.to_csv('data/full_basic_stats.csv', index = False)

## Scraping Team Data
Now, we'll repeat the same process as before to extract the team data from basketball-reference's standings page.

In [None]:
for year in years:
    team_url = f'https://www.basketball-reference.com/leagues/NBA_{year}_standings.html'
    
    data = requests.get(team_url, headers = headers)
    with open(f'data/team_data_{year}.html', 'w+') as f:
        f.write(data.text)

In [None]:
teams = []
for year in years:
    with open(f'data/team_data_{year}.html') as f:
        contents = f.read()

    soup = BeautifulSoup(contents, "html.parser")
    soup.find('tr', class_ = 'thead').decompose()

    #Scraping the eastern conference table
    tab_html = soup.find(id = 'divs_standings_E')
    table = pd.read_html(str(tab_html))[0]
    table['Team'] = table['Eastern Conference']
    table['Year'] = year
    teams.append(table)

    #scraping the western conference table
    tab_html = soup.find(id = 'divs_standings_W')
    table = pd.read_html(str(tab_html))[0]
    table['Team'] = table['Western Conference']
    table['Year'] = year

    teams.append(table)

In [None]:
teams = pd.concat(teams, axis = 0)

Since these data were scraped from the 'Division Standings' table, we have unwanted rows which are meant to separate the different divisions (for example, we may notice rows which say "Atlantic Division"). We'll simply drop these rows, as they don't contain any useful information for us.

In [None]:
teams = teams[~teams['Team'].str.contains('Division')]

We also combined the 'Western Conference' and 'Eastern Conference' columns into a single column called `Team` which contains team names from both conferences.  
We'll now drop the two original columns.

In [None]:
teams = teams.drop(columns = ['Western Conference', 'Eastern Conference'])

And finally, we must deal with the unwanted asterisks again, as they appear in the `Team` column here.

In [None]:
teams_df['Team'] = teams_df['Team'].str.replace('*', '', regex = False)

In [None]:
# Saving as csv
teams.to_csv('data/team_data.csv', index = False)

## Scraping Advanced Player Data
Now, we'll again repeat the process to extract each player's advanced stats from each year.

In [None]:
for year in years:
    players_url = f'https://www.basketball-reference.com/leagues/NBA_{year}_advanced.html'
    
    adv_yr = requests.get(players_url, headers = headers).text
    
    with open(f'Data/advanced_stats_{year}', "w+") as g:
        g.write(adv_yr)

In [None]:
advanced_df = []
for year in years:
    with open(f'Data/advanced_stats_{year}') as g:
        contents = g.read()

    soup = BeautifulSoup(contents, "html.parser")
    table_html = soup.find('table', class_ = 'sortable stats_table')

    adv_stats_tab  = pd.read_html(str(table_html))[0]
    adv_stats_tab['Year'] = year
    advanced_df.append(adv_stats_tab[['Year', 'Player', 'Pos', 'Age', 'Tm', 'G', 'MP', 'PER', 'TS%', '3PAr',
       'FTr', 'ORB%', 'DRB%', 'TRB%', 'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%'
                                      , 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM',
       'DBPM', 'BPM', 'VORP']])

In [None]:
full_advanced = pd.concat(advanced_df, axis = 0)

We have to handle the unwanted asterisks in player's names as we did in the basic player stats:

In [None]:
full_advanced['Player'] = full_advanced['Player'].str.replace('*', '', regex = False)

And, we'll deal with midseason trades as we did before as well:

In [None]:
full_advanced = full_advanced.groupby(['Player', 'Year']).apply(player_traded)

full_advanced.index = full_advanced.index.droplevel()
full_advanced.index = full_advanced.index.droplevel()

In [None]:
#Saving as CSV
full_advanced.to_csv('data/full_advanced.csv', index = False)

## Merging dataframes
Finally, we have all the relevant data, and we now must merge them into a single dataframe which will allow us to train our model.  
As the first step, we'll narrow down the MVP dataframe to avoid creating redundant columns (as a few of the columns are shared with the basic stats dataframe)

In [None]:
narrow_mvp = full_mvp[['Player', 'Year',
                       'Pts Won', 'Pts Max', 'Share', 'Rank']]

Now we can merge these two dataframes.  
  
Note that we employ an outer merge, which will create NAN values in the columns unique to the MVP dataframe ('Pts Won', 'Pts Max', 'Share', and 'Rank') for players who don't appear in this dataframe.  
We'll utilize the fillna method to handle this, replacing all the values with 0 (since these players by definition won no votes).

In [None]:
players = full_basic_stats.merge(narrow_mvp, left_on = ['Player', 'Year'],
                              right_on = ['Player', 'Year'], how = 'outer').fillna(0)

In order to facilitate a merge between the `players` dataframe (which has full team names) and the `teams` dataframe (which has abbreviations), we'll use another csv file which contains all the abbreviations and team names.

This data was not scraped. I simply wrote the csv file by hand as there aren't many teams present in the data and it's easy enough to assemble manually.  

In [None]:
abbs = pd.read_csv('data/abbreviations.csv')

We'll now use this dataframe to create a dictionary of abbreviation, team name pairs, and then map this to the players dataframe to create a new column containing the team's abbreviation

In [None]:
abbs_dict = abbs.set_index('Abbreviation').to_dict(orient = 'dict')['Name']
players['Team_abb'] = players['Tm'].map(abbs_dict)

And finally, we can merge our dataframes:

In [None]:
merged = players.merge(teams_df, left_on = ['Team_abb', 'Year'], right_on = ['Team', 'Year'], how = 'outer')

Now, to merge with the advanced dataframe, we must again exclude certain columns already represented, as this would introduce redundant information.

In [None]:
full_advanced = full_advanced.drop(columns = ['Pos', 'Age', 'Tm', 'G', 'MP'])

We can now make our final merge

In [None]:
full_table = merged.merge(full_advanced, 
                          left_on = ['Player', 'Year'], 
                          right_on = ['Player', 'Year'], how = 'inner')

Before we save this file, we should to convert as many columns as possible to numeric d-types, which we'll accomplish using Pandas' dataframe.apply() method and .to_numeric() function

In [None]:
full_table = full_table.apply(lambda x: pd.to_numeric(x, errors = 'ignore'))

In [None]:
# Saving as CSV
full_table.to_csv('data/nba_player_stats.csv', index = False)

And now, we're ready to use this data to train our model!