# Testing out Web-scraping stats.nba.com with Selenium

To get all the NBA data necessary for my analysis, I have to web-scrape multiple pages of stats.nba.com.

Here, I follow this wonderful tutorial: http://kevincsong.com/Scraping-stats.nba.com-with-python/

After testing out BeautifulSoup and Selenium, I conclude it is much easier to scrape a dynamic and javascript-dependent webpage like **stats.nba.com** with Selenium. Selnium also provides you with a visual interface that keeps track of where you navigate to through code, making it more user-friendly in my opinion. Finally, Selenium allows you to easily navigate to multiple pages by "clicking" through instead of providing an absolute URL like BeautifulSoup requires.

Selenium will be the vehicle of choice for web-scraping in this project.

In [57]:
from selenium import webdriver
import pandas as pd

In [58]:
# instantiate selenium web instance
path_to_chromedriver = '/Users/skylershi/Data Science/chromedriver' # Path to access a chrome driver
browser = webdriver.Chrome(executable_path=path_to_chromedriver)

In [3]:
# navigate to URL
url = 'https://stats.nba.com/leaders'
browser.get(url)

In [6]:
# configure dynamic webpage options - regular season, display all players
browser.find_element_by_xpath('/html/body/main/div[2]/div/div[2]/div/div/div[1]/div[2]/div/div/label/select/option[2]').click()

browser.find_element_by_xpath('/html/body/main/div[2]/div/div[2]/div/div/nba-stat-table/div[3]/div/div/select/option[1]').click()




In [7]:
# get the html table of player stats
table = browser.find_element_by_class_name('nba-stat-table__overflow')

In [8]:
# convert the table to text and read stats off line by line
column_names = []
player_stats = []

temp_player_stat = []
for line_id, line in enumerate(table.text.split('\n')):
    if line_id == 0:
        column_names = line.split(' ')[1:]
    else:
        if line_id % 3 == 2:
            temp_player_stat.append(line)
        if line_id % 3 == 0:
            temp_player_stat.extend(line.split(' '))
            player_stats.append(temp_player_stat)
            temp_player_stat = []

In [42]:
column_names = [column_name.replace('%','_PCT') for column_name in column_names]

In [44]:
df = pd.DataFrame(player_stats, columns = column_names)

In [45]:
df.head()

Unnamed: 0,PLAYER,GP,MIN,PTS,FGM,FGA,FG_PCT,3PM,3PA,3P_PCT,...,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,EFF
0,James Harden,61,36.7,34.4,9.9,22.7,43.5,4.4,12.6,35.2,...,11.8,86.1,1.0,5.3,6.4,7.4,1.7,0.9,4.5,31.8
1,Bradley Beal,57,36.0,30.5,10.4,22.9,45.5,3.0,8.4,35.3,...,8.0,84.2,0.9,3.3,4.2,6.1,1.2,0.4,3.4,25.4
2,Giannis Antetokounmpo,57,30.9,29.6,10.9,20.0,54.7,1.5,4.8,30.6,...,10.0,63.3,2.3,11.5,13.7,5.8,1.0,1.0,3.7,34.8
3,Trae Young,60,35.3,29.6,9.1,20.8,43.7,3.4,9.5,36.1,...,9.3,86.0,0.5,3.7,4.3,9.3,1.1,0.1,4.8,26.6
4,Damian Lillard,58,36.9,28.9,9.2,20.0,45.7,3.9,9.9,39.4,...,7.6,88.8,0.5,3.8,4.3,7.8,1.0,0.4,2.9,27.8


# Data Persistence

### Can Store Dataframes as Pickle Files

In [46]:
import pickle

In [47]:
df.to_pickle('../Pickles/19-20-nba_player_stats.pkl')

In [48]:
pd.read_pickle('../Pickles/19-20-nba_player_stats.pkl')

Unnamed: 0,PLAYER,GP,MIN,PTS,FGM,FGA,FG_PCT,3PM,3PA,3P_PCT,...,FTA,FT_PCT,OREB,DREB,REB,AST,STL,BLK,TOV,EFF
0,James Harden,61,36.7,34.4,9.9,22.7,43.5,4.4,12.6,35.2,...,11.8,86.1,1.0,5.3,6.4,7.4,1.7,0.9,4.5,31.8
1,Bradley Beal,57,36.0,30.5,10.4,22.9,45.5,3.0,8.4,35.3,...,8.0,84.2,0.9,3.3,4.2,6.1,1.2,0.4,3.4,25.4
2,Giannis Antetokounmpo,57,30.9,29.6,10.9,20.0,54.7,1.5,4.8,30.6,...,10.0,63.3,2.3,11.5,13.7,5.8,1.0,1.0,3.7,34.8
3,Trae Young,60,35.3,29.6,9.1,20.8,43.7,3.4,9.5,36.1,...,9.3,86.0,0.5,3.7,4.3,9.3,1.1,0.1,4.8,26.6
4,Damian Lillard,58,36.9,28.9,9.2,20.0,45.7,3.9,9.9,39.4,...,7.6,88.8,0.5,3.8,4.3,7.8,1.0,0.4,2.9,27.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
258,Anthony Tolliver,47,15.6,3.5,1.1,3.3,34.8,0.8,2.6,32.2,...,0.5,72.0,0.6,2.2,2.8,0.7,0.3,0.2,0.6,4.7
259,Rodney McGruder,50,15.0,3.2,1.2,3.1,39.1,0.4,1.6,27.8,...,0.6,53.6,0.5,2.0,2.6,0.6,0.5,0.2,0.4,4.4
260,Semi Ojeleye,61,14.6,3.1,1.1,2.6,40.9,0.6,1.6,36.7,...,0.5,89.3,0.4,1.6,2.0,0.5,0.3,0.1,0.2,4.3
261,Matthew Dellavedova,57,14.4,3.1,1.1,3.1,35.4,0.4,1.6,23.1,...,0.6,86.5,0.3,1.0,1.3,3.2,0.4,0.0,1.0,4.9


### Can Write to SQL Database

In [67]:
# you need to install sqlalchemy AND psycopg2!
# psycopg2 is the preferred database driver for postgres databases
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os

In [69]:
load_dotenv()
db_username = os.getenv('db_username')
db_password = os.getenv('db_password')
db_host     = os.getenv('db_host')
db_name     = os.getenv('db_name')

In [61]:
db_url = "postgresql+psycopg2://{}:{}@{}/{}".format(db_username,
                                                    db_password,
                                                    db_host,
                                                    db_name)

In [71]:
sql_db = create_engine(db_url)

In [72]:
df.to_sql('nba_player_stats', sql_db, if_exists = 'replace')