# NBA Web scraping 


The project is done as a practice exercise to learn how to get data from the internet and how to 
clean the said data.

This is the starting point of the project, where NBA statistics are being collected as an end goal of using them to
predict the next Most Valuabe Player (MVP) for NBA. In this file the following data is scraped from the web:

- Most Valuable Player data
- Overall player data
- Team data

With these we can move to cleaning the data, which can later be used in machine learning. 

In [1]:
import requests
import os
import shutil
import pandas as pd

In [2]:
## First it is decided that data between years 1990-2023 will be taken into account in the project.

years = list(range(1990, 2024))

## MVP data

Few different types of data will be scraped to get comprehensive understanding of the player statistics.


In [3]:
## Basketball-Reference has the data available, so that site will be used.

url_start = 'https://www.basketball-reference.com/awards/awards_{}.html'

In [4]:
## Scraping the data of the MVP's from the website as a HTML page.

for year in years:
    
    url = url_start.format(year)
    data = requests.get(url)
    
    with open('mvp/{}.html'.format(year), 'w+', encoding='utf-8') as f:
        f.write(data.text)

In [5]:
## Importing BeautifulSoup to parse through the data.

from bs4 import BeautifulSoup

In [6]:
## Parsing through the data throughout the years to get a complete list of MVP's between 1990-2023

dfs = []

for year in years:

    with open("mvp/{}.html".format(year), encoding='utf-8') as f:
        page = f.read()

    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="over_header").decompose() ## unwanted subheaders removed
    
    mvp_table = soup.find_all(id="mvp")[0]

    mvp_df = pd.read_html(str(mvp_table))[0]
    mvp_df["Year"] = year
    
    dfs.append(mvp_df)

In [7]:
## Now that there is data for all the years, all the lists are connected into one dataframe.

mvps = pd.concat(dfs)
mvps.head()

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,1,Magic Johnson,30,LAL,27.0,636.0,920,0.691,79,37.2,...,6.6,11.5,1.7,0.4,0.48,0.384,0.89,16.5,0.27,1990
1,2,Charles Barkley,26,PHI,38.0,614.0,920,0.667,79,39.1,...,11.5,3.9,1.9,0.6,0.6,0.217,0.749,17.3,0.269,1990
2,3,Michael Jordan,26,CHI,21.0,564.0,920,0.613,82,39.0,...,6.9,6.3,2.8,0.7,0.526,0.376,0.848,19.0,0.285,1990
3,4,Karl Malone,26,UTA,2.0,214.0,920,0.233,82,38.1,...,11.1,2.8,1.5,0.6,0.562,0.372,0.762,15.9,0.245,1990
4,5,Patrick Ewing,27,NYK,1.0,162.0,920,0.176,82,38.6,...,10.9,2.2,1.0,4.0,0.551,0.25,0.775,13.5,0.205,1990


In [8]:
mvps.to_csv('mvps.csv') ## Lastly saving the all the MVP's as their own file for later use.

## Overall player data

Next up is overall player data. We have the data already for from the MVP's but need the overall player data too as
also other players can end up in winning the MVP award. 

In [10]:
## Since the list from the website is long and is not visible without scrolling, we need to import parts of
## Selenium to scroll the page. 

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.add_argument(r"--webdriver.chrome.driver=C:\Users\35845\Downloads\chromedriver-win64.zip\chromedriver-win64")

driver = webdriver.Chrome(options=chrome_options)

In [11]:
## Here we are then using the script to scroll through the pages.

for year in years:
    url = player_stats_url.format(year)
    
    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(2)
    
    with open("player/{}.html".format(year), "wb+") as f:
        f.write(driver.page_source.encode('utf-8'))

In [12]:
dfs = []

for year in years:
    
    with open("player/{}.html".format(year), encoding='utf-8') as f:
        page = f.read()
    
    soup = BeautifulSoup(page, 'html.parser')
    
    for thead_row in soup.find_all('tr', class_="thead"):
        thead_row.decompose()
    
    player_table = soup.find_all(id="per_game_stats")[0]
    
    player_df = pd.read_html(str(player_table))[0]
    player_df["Year"] = year
    
    dfs.append(player_df)

In [13]:
players = pd.concat(dfs)
players.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year
0,1,Mark Acres,C,27,ORL,80,50,21.1,1.7,3.6,...,1.9,3.5,5.4,0.8,0.5,0.3,0.9,3.1,4.5,1990
1,2,Michael Adams,PG,27,DEN,79,74,34.1,5.0,12.5,...,0.6,2.2,2.8,6.3,1.5,0.0,1.8,1.7,15.5,1990
2,3,Mark Aguirre,SF,30,DET,78,40,25.7,5.6,11.5,...,1.5,2.4,3.9,1.9,0.4,0.2,1.6,2.6,14.1,1990
3,4,Danny Ainge,PG,30,SAC,75,68,36.4,6.7,15.4,...,0.9,3.4,4.3,6.0,1.5,0.2,2.5,3.2,17.9,1990
4,5,Mark Alarie,PF,26,WSB,82,10,23.1,4.5,9.6,...,1.8,2.7,4.6,1.7,0.7,0.5,1.2,2.7,10.5,1990


In [14]:
players.to_csv('players.csv') ## Once again saving the data for later use.

## Team data

Lastly we are collecting the data for the teams. This is to see if there is correlation between the teams and their
record in regards to winning the MVP award.

Most of the things done here are the same as for all the player data.

In [15]:
team_stats_url = 'https://www.basketball-reference.com/leagues/NBA_{}_standings.html'

In [18]:
## Scraping the data of the teams with the driver.

for year in years:
    url = team_stats_url.format(year)
    
    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(2)
    
    with open("team/{}.html".format(year), "wb+") as f:
        f.write(driver.page_source.encode('utf-8'))

In [19]:
dfs = []

for year in years:
    
    with open("team/{}.html".format(year), encoding='utf-8') as f:
        page = f.read()
    soup = BeautifulSoup(page, 'html.parser')
    soup.find('tr', class_="thead").decompose()
    
    ## Eastern conference team data
    
    east_table = soup.find_all(id="divs_standings_E")[0]
    east_df = pd.read_html(str(east_table))[0]
    east_df["Year"] = year
    east_df['Team'] = east_df['Eastern Conference']
    del east_df['Eastern Conference']
    dfs.append(east_df)
    
    ## Western conference team data
        
    west_table = soup.find_all(id="divs_standings_W")[0]
    west_df = pd.read_html(str(west_table))[0]
    west_df["Year"] = year
    west_df['Team'] = west_df['Western Conference']
    del west_df['Western Conference']
    dfs.append(west_df)


In [22]:
teams = pd.concat(dfs)
teams.head()

Unnamed: 0,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Team
0,53,29,0.646,—,110.2,105.2,4.23,1990,Philadelphia 76ers*
1,52,30,0.634,1.0,110.0,106.0,3.23,1990,Boston Celtics*
2,45,37,0.549,8.0,108.3,106.9,0.78,1990,New York Knicks*
3,31,51,0.378,22.0,107.7,109.9,-2.43,1990,Washington Bullets
4,18,64,0.22,35.0,100.6,110.3,-9.59,1990,Miami Heat


In [23]:
teams.to_csv('teams.csv') ## Saving for later use.