# <b>1 <span style='color:#3f4d63'>|</span> Data Scraping</b>

The first thing we are gonna do is take into account the kind of data we need to get. As we want to predict who is gonna win the NBA MVP we'll need data from each of the winners of the last seasons. Moreover, we'll need stats for each player, of each season that we want to make predictions for. All the data is going to be scrapped from [Basketball Reference](https://www.basketball-reference.com/).

* `Basketball Reference`: it's a site that have lot of statistcs of the NBA from the very beginnings to the present. The data is nicely structured and in a well formated way. 

In [87]:
from IPython.display import clear_output
!pip install requests
clear_output()

## 1.1 MVP Stats

What we are gonna do now, is iterate through the years. We'll create one specific url from each of the years. Then, we are going to make the request to the web site, in order to download the web page. Finally, we'll create one specific .html file for each of the years. 

In [88]:
import requests

years = list(range(1980,2022))
# URL we want to scrape. I replace the year with brackets. 
# This will allow us to replace them with each year we want to scrape
url_start = "https://www.basketball-reference.com/awards/awards_{}.html"

for year in years: 
    url = url_start.format(year)  # This replace {} with the specific year
    data = requests.get(url)      # This enables us to download the web page into data (html code)
    
    with open("./HTML files/MVPs/{}.html".format(year), "w+") as f:  # Finally, we create one specific .html file for each
        f.write(data.text)                              # of the years we want to scrape

It's actually nice to save those files because when you are parsing the files you do not want to redownload them each time. It's convenient to minimize the amount of requests we do to a web page when we are web scraping. Now, we'll parse the votes table. To do that, we'll make use of **Beautiful Soup Library**.

In [89]:
!pip install beautifulsoup4
clear_output()
from bs4 import BeautifulSoup

with open("./HTML files/MVPs/1991.html") as f:
    page = f.read()

#page 
soup = BeautifulSoup(page, "html.parser")

What we have done previosuly is open the .html file and create a parser class that will enable us to extract that table we want from the page. The first thing we want to do is to get rid of some useless stuff.

* For example, the header of the table is useless for us. Thus, it'll be the first thing we are going to get rid of. 

![](./Images/mvps_table.png)

In [90]:
# The header of the table is useless for us. Thus, it'll be the first thing we are going to erase
soup.find('tr', class_="over_header").decompose()
# As we just want the first table, let's get rid of the others. We'll do this by finding out the concrete table 
# we want to scrape. In order to do that, we search its 'id'. That's unique in the whole .html file. 
mvp_table = soup.find(id="mvp")

In [91]:
# Pandas is capable of reading tables in html
import pandas as pd
!pip install lxml
clear_output()

Let's take a brief look at `mvp_1991`. As we can observe below, this is not a DataFrame. Instead, it's a list of them. Therefore, we are just going to take the first element there. Thus, we'll have parsed the 1991 html page and  pulled out a single table, loading it into pandas. 

In [92]:
# mvp_1991 = pd.read_html(str(mvp_table))   Argument must be a string
# mvp_1991  It's not a DataFrame. Instead, it's a list of them.  
mvp_1991 = pd.read_html(str(mvp_table))[0]    # Argument must be a string
mvp_1991.head()

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,1,Michael Jordan,27,CHI,77.0,891.0,960,0.928,82,37.0,31.5,6.0,5.5,2.7,1.0,0.539,0.312,0.851,20.3,0.321
1,2,Magic Johnson,31,LAL,10.0,497.0,960,0.518,79,37.1,19.4,7.0,12.5,1.3,0.2,0.477,0.32,0.906,15.4,0.251
2,3,David Robinson,25,SAS,6.0,476.0,960,0.496,82,37.7,25.6,13.0,2.5,1.5,3.9,0.552,0.143,0.762,17.0,0.264
3,4,Charles Barkley,27,PHI,2.0,222.0,960,0.231,67,37.3,27.6,10.1,4.2,1.6,0.5,0.57,0.284,0.722,13.4,0.258
4,5,Karl Malone,27,UTA,0.0,142.0,960,0.148,82,40.3,29.0,11.8,3.3,1.1,1.0,0.527,0.286,0.77,15.5,0.225


Let's do the same for the rest of the seasons we want to make predictions with. We repeat the same proccess. As there is no way to find out to which year it belongs, we are going to add a new feature called `Year`. Furthermore, due to the fact that we do not want to work with a list of DataFrames, we are just going to concat them into one. Finally, we'll save this DataFrame into a .csv file, in order to ease things.  

In [93]:
# list of dataframes
dfs = []

for year in years:
    with open("./HTML files/MVPs/{}.html".format(year)) as f:
        page = f.read()
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="over_header").decompose()
    mvp_table = soup.find(id="mvp")
    mvp = pd.read_html(str(mvp_table))[0]  # Argument must be a string
    # New feature Year. There is no way to find out to which year data belongs to
    mvp['Year'] = year  
    
    dfs.append(mvp)
    
mvps = pd.concat(dfs)
mvps.to_csv("./CSV files/mvps.csv")   
mvps.head()

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,1,Kareem Abdul-Jabbar,32,LAL,147.0,147.0,221,0.665,82,38.3,...,10.8,4.5,1.0,3.4,0.604,0.0,0.765,14.8,0.227,1980
1,2,Julius Erving,29,PHI,31.5,31.5,221,0.143,78,36.1,...,7.4,4.6,2.2,1.8,0.519,0.2,0.787,12.5,0.213,1980
2,3,George Gervin,27,SAS,19.0,19.0,221,0.086,78,37.6,...,5.2,2.6,1.4,1.0,0.528,0.314,0.852,10.6,0.173,1980
3,4,Larry Bird,23,BOS,15.0,15.0,221,0.068,82,36.0,...,10.4,4.5,1.7,0.6,0.474,0.406,0.836,11.2,0.182,1980
4,5T,Tiny Archibald,31,BOS,2.0,2.0,221,0.009,80,35.8,...,2.5,8.4,1.3,0.1,0.482,0.222,0.83,8.9,0.148,1980


## 1.2 Players Stats

If we want to predict who is gonna receive the next NBA MVP award, **we cannot just take into account votes per year**. We have to consider as well stats from every single player in the NBA. Thus, what we are gonna do next is to scrape data **statistics from all the players** that have played in the NBA from 1991 to 2021. This stats are going to consist in each player stats **per game** in a season. 

In [94]:
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

# At first, we are going to just download the data from one year. 
url = player_stats_url.format(1991)
data = requests.get(url)
with open("./HTML files/Players/1991.html", "w+") as f: 
    f.write(data.text)

If we check the .html file that we have just created, it only has 17 rows. However, in the navigator it has plenty more. The reason is that this web page is using **JavaScript**. Therefore, we'll use `Selenium` to scrape this data. 

In [95]:
#!pip install -U selenium
#!pip install webdriver_manager
from selenium import webdriver
#from selenium.webdriver.chrome.service import Service
#ser = Service("/usr/local/bin/chromedriver_linux64/chromedriver")
#op = webdriver.ChromeOptions()
#driver = webdriver.Chrome(service=ser, options=op)
driver = webdriver.Firefox(executable_path= r"/home/javier/Descargas/geckodriver-v0.30.0-linux64/geckodriver")
clear_output()

In [96]:
#driver = webdriver.Chrome(executable_path=r"/usr/local/bin/chromedriver_linux64/chromedriver")

This actually created a new Firefox window that it's been controlled by Selenium. We can actually write code that tells the browser to go to different websites. This will enable us to grab the html. In this case, we'll use that browser to render the page with all of the rows we need and we'll grab the html. As we did it before, we are going to start doing this with 1991 year. 

In [97]:
import time

year = 1991
url = player_stats_url.format(1991)

driver.get(url)
driver.execute_script("window.scrollTo(1,10000)")
html = driver.page_source

with open("./HTML files/Players/{}.html".format(year), "w+") as f:
    f.write(html)

If we open this .html file now, we will appreciate that it has more than 17 rows. Concretely, it has all rows we need. That's because we execute the Java script. Once having done this for 1991, hereafter we'll repeat the proccess for each of the years. 

In [98]:
for year in years:
    url = player_stats_url.format(year)

    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(2)
    html = driver.page_source
    with open("./HTML files/Players/{}.html".format(year), "w+") as f:
        f.write(html)

After doing this, we'll obtain the respective .csv files proceeding with `BeautifulSoup` as we did before for MVP stats. We'll again combine them all into one large DataFrame with every player stats. 

In [99]:
dfs = []
for year in years: 
    with open("./HTML files/Players/{}.html".format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    player_table = soup.find(id="per_game_stats")
    player = pd.read_html(str(player_table))[0]             # Argument must be a string
    player['Year'] = year   
    dfs.append(player)
    
players = pd.concat(dfs)
players.to_csv("./CSV files/players.csv") 
players.tail()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
725,536,Delon Wright,PG,28,SAC,27,8,25.8,3.9,8.3,...,1.0,2.9,3.9,3.6,1.6,0.4,1.3,1.1,10.0,2021
726,537,Thaddeus Young,PF,32,CHI,68,23,24.3,5.4,9.7,...,2.5,3.8,6.2,4.3,1.1,0.6,2.0,2.2,12.1,2021
727,538,Trae Young,PG,22,ATL,63,63,33.7,7.7,17.7,...,0.6,3.3,3.9,9.4,0.8,0.2,4.1,1.8,25.3,2021
728,539,Cody Zeller,C,28,CHO,48,21,20.9,3.8,6.8,...,2.5,4.4,6.8,1.8,0.6,0.4,1.1,2.5,9.4,2021
729,540,Ivica Zubac,C,23,LAC,72,33,22.3,3.6,5.5,...,2.6,4.6,7.2,1.3,0.3,0.9,1.1,2.6,9.0,2021


## 1.3 Team Stats

Another important factor that must be taken into account is the whole team performance during that year. Obviously, the better the team performance is, the higher probabilities to be awarded you have. Let's now scrape this kind of data. If we observe both tables available in the following url, we would appreciate that Conference Tables use JavaScript. Thus, in order to ease everything, we are going to scrape the data from the ones belonging to divisions. 

![](./Images/tablas_scraping.png)

In [100]:
team_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html" 

for year in years: 
    url = team_stats_url.format(year)
    data = requests.get(url)
    with open ("./HTML files/Teams/{}.html".format(year), "w+") as f:
        f.write(data.text)

![](./Images/division_table.png)

As we can observe, here there is some stuff that is useless for us. Concretely, the header that reports which division the table data is from. Therefore, we are going to erase it from both tables. 

In [101]:
dfs = []
for year in years: 
    with open("./HTML files/Teams/{}.html".format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    team_table = soup.find(id="divs_standings_E")
    team = pd.read_html(str(team_table))[0]             # Argument must be a string
    team['Year'] = year   
    team["Team"] = team["Eastern Conference"]
    del team["Eastern Conference"]
    dfs.append(team)
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    team_table = soup.find(id="divs_standings_W")
    team = pd.read_html(str(team_table))[0]             # Argument must be a string
    team['Year'] = year   
    team["Team"] = team["Western Conference"]
    del team["Western Conference"]
    dfs.append(team)
    
teams = pd.concat(dfs)
teams.to_csv("./CSV files/teams.csv")
teams.tail()

Unnamed: 0,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Team
13,42,30,0.583,—,112.4,110.2,2.26,2021,Dallas Mavericks*
14,38,34,0.528,4.0,113.3,112.3,1.07,2021,Memphis Grizzlies*
15,33,39,0.458,9.0,111.1,112.8,-1.58,2021,San Antonio Spurs
16,31,41,0.431,11.0,114.6,114.9,-0.2,2021,New Orleans Pelicans
17,17,55,0.236,25.0,108.8,116.7,-7.5,2021,Houston Rockets


## 1.4 More Player Stats

In the last part of this section we are going to scrape a bit more atributes from each of the players that have played in NBA since 1991. Those attributes are going to be: 

* `Height`
* `Weight`
* `First year playing in NBA`
* `Last year playing in NBA`
* `Colleges`

![](./Images/name_table.png)

In [102]:
attributes_url = "https://www.basketball-reference.com/players/{}/"

import string as st
alphabet = list(st.ascii_lowercase)

for l in alphabet: 
    url = attributes_url.format(l)
    data = requests.get(url)
    with open ("./HTML files/Players_more_stats/{}.html".format(l), "w+") as f:
        f.write(data.text)
        
# list of dataframes
dfs = []

for l in alphabet:
    with open("./HTML files/Players_more_stats/{}.html".format(l)) as f:
        page = f.read()
        
    soup = BeautifulSoup(page, "html.parser")
    players_table = soup.find(id="players")
    players_a = pd.read_html(str(players_table))[0]    # Argument must be a string      
    
    dfs.append(players_a)
    
players_more_stats = pd.concat(dfs)
players_more_stats.to_csv("./CSV files/players_more_stats.csv")   
players_more_stats.head()

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Birth Date,Colleges
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240,"June 24, 1968",Duke
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235,"April 7, 1946",Iowa State
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225,"April 16, 1947",UCLA
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162,"March 9, 1969",LSU
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223,"November 3, 1974","Michigan, San Jose State"
