# Project Luther

Kenny Leung - kenleung11@gmail.com

Part 3/8 - Data scraping NBA drafted player's birthday, height and weight

This notebook documents the process of scraping NBA drafted player's birthday, height and weight from https://www.basketball-reference.com/.

In [1]:
# import packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import pickle
from datetime import datetime

# Data Scraping

In order to scrape data located in each rookie player's extension page, I needed a list of url extensions for each rookie drafted since 2003. First I tried code for just scraping the url extensions for players drafted in 2003.

In [2]:
url = 'https://www.basketball-reference.com/draft/NBA_2003.html'

response = requests.get(url)
page = response.text
soup = BeautifulSoup(page, "lxml")

url_list = [] # create empty list
for x in soup.find_all(class_='left')[1::3]:
    try: # try/except needed in case rookie player has no extension page
        url = x.find('a')['href']
        url_list.append(url) # append to empty list
    except:
        pass

Once I was confident the code will work, I scraped the website created a url extension list for all drafted players since 2003.

In [20]:
url_template = "https://www.basketball-reference.com/draft/NBA_{year}.html"
url_list = []

for year in range(2003, 2017):
    print(year)
    url = url_template.format(year=year)  # get the url

    response = requests.get(url)
    page = response.text

    soup = BeautifulSoup(page, 'html5lib')
    
    for x in soup.find_all(class_='left')[1::3]:
        try:
            url = x.find('a')['href']
            url_list.append(url)
        except:
            pass
    
    time.sleep(.5+2*random.random())

2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016


In [22]:
# create pickle object
with open('draft_url_list.pkl', 'wb') as fp:
    pickle.dump(url_list, fp)

In [10]:
with open("draft_url_list.pkl", 'rb') as picklefile: 
    url_list = pickle.load(picklefile)

# Scraping Height and Weight info from Player's extension page

Now that I have a list of player's extensions, I wanted to try to scrape the birthday, height and weight of all the drafted players. I started by trying to get the data I needed on Lebron's page.

In [12]:
player_url = 'https://www.basketball-reference.com/players/j/jamesle01.html'

response2 = requests.get(player_url)
page2 = response2.text
soup2 = BeautifulSoup(page2, "lxml")

# get player data
Player = soup2.find_all('h1')[0].getText()
Height_ft = int(soup2.find_all(itemprop="height")[0].getText().split('-')[0]) 
Height_in = int(soup2.find_all(itemprop="height")[0].getText().split('-')[1])
Weight = int(soup2.find_all(itemprop="weight")[0].getText()[:3])
BirthMonth = [x.find_all('a') for x in soup2.find_all(itemprop="birthDate")][0][0].getText().split()[0][:3]
BirthDay = [x.find_all('a') for x in soup2.find_all(itemprop="birthDate")][0][0].getText().split()[1]
Birthyear = [x.find_all('a') for x in soup2.find_all(itemprop="birthDate")][0][1].getText()

date = BirthMonth + ' ' + BirthDay + ' ' + Birthyear
Birthday = datetime.strptime(date, '%b %d %Y') # convert to datetime object using datetime module

column_headers = ['Player', 'Birthday', 'Height_ft', 'Height_in', 'Weight']
player_data = [Player, Birthday, Height_ft, Height_in, Weight]

dic = dict(zip(column_headers,player_data))

df = pd.DataFrame(dic,columns=column_headers, index=[0])

In [32]:
df

Unnamed: 0,Player,Birthday,Height_ft,Height_in,Weight
0,LeBron James,1984-12-30,6,8,250


In [34]:
# loop through url_list to get all drafted players info
url_template2 = "https://www.basketball-reference.com{player}"
player_df = pd.DataFrame()

for player in url_list:
    print(player)
    url = url_template2.format(player=player)
    
    response = requests.get(url)
    page = response.text

    soup = BeautifulSoup(page, 'html5lib')
    
    Player = soup.find_all('h1')[0].getText()
    BirthMonth = [x.find_all('a') for x in soup.find_all(itemprop="birthDate")][0][0].getText().split()[0][:3]
    BirthDay = [x.find_all('a') for x in soup.find_all(itemprop="birthDate")][0][0].getText().split()[1]
    Birthyear = [x.find_all('a') for x in soup.find_all(itemprop="birthDate")][0][1].getText()
    date = BirthMonth + ' ' + BirthDay + ' ' + Birthyear
    Birthday = datetime.strptime(date, '%b %d %Y')
    Height_ft = int(soup.find_all(itemprop="height")[0].getText().split('-')[0])
    Height_in = int(soup.find_all(itemprop="height")[0].getText().split('-')[1])
    Weight = int(soup.find_all(itemprop="weight")[0].getText()[:3])
    
    column_headers = ['Player', 'Birthday', 'Height_ft', 'Height_in', 'Weight']
    player_data = [Player, Birthday, Height_ft, Height_in, Weight]
    
    dic = dict(zip(column_headers,player_data))
    df = pd.DataFrame(dic,columns=column_headers, index=[0])
    
    player_df = player_df.append(df, ignore_index=True)
    
    time.sleep(.5+2*random.random())

/players/j/jamesle01.html
/players/m/milicda01.html
/players/a/anthoca01.html
/players/b/boshch01.html
/players/w/wadedw01.html
/players/k/kamanch01.html
/players/h/hinriki01.html
/players/f/fordtj01.html
/players/s/sweetmi01.html
/players/h/hayesja01.html
/players/p/pietrmi01.html
/players/c/collini01.html
/players/b/banksma01.html
/players/r/ridnolu01.html
/players/g/gainere01.html
/players/b/belltr01.html
/players/c/cabarza01.html
/players/w/westda01.html
/players/p/pavloal01.html
/players/j/jonesda02.html
/players/d/diawbo01.html
/players/p/planizo01.html
/players/o/outlatr01.html
/players/c/cookbr01.html
/players/d/delfica01.html
/players/e/ebind01.html
/players/p/perkike01.html
/players/b/barbole01.html
/players/h/howarjo01.html
/players/l/lampema01.html
/players/k/kaponja01.html
/players/w/waltolu01.html
/players/b/beaslje01.html
/players/s/schorso01.html
/players/s/szewcsz01.html
/players/a/austima01.html
/players/h/hansetr01.html
/players/b/blakest01.html
/players/v/vranesl01.

/players/t/tomican01.html
/players/d/dragigo01.html
/players/w/walkebi01.html
/players/h/hairsma01.html
/players/h/hardide01.html
/players/j/jacksda01.html
/players/d/dragita01.html
/players/l/leunema01.html
/players/t/taylomi01.html
/players/k/kaunsa01.html
/players/c/crawfjo01.html
/players/e/erdense01.html
/players/g/griffbl01.html
/players/t/thabeha01.html
/players/h/hardeja01.html
/players/e/evansty01.html
/players/r/rubiori01.html
/players/f/flynnjo01.html
/players/c/curryst01.html
/players/h/hilljo01.html
/players/d/derozde01.html
/players/j/jennibr01.html
/players/w/willite01.html
/players/h/hendege02.html
/players/h/hansbty01.html
/players/c/clarkea01.html
/players/d/dayeau01.html
/players/j/johnsja01.html
/players/h/holidjr01.html
/players/l/lawsoty01.html
/players/t/teaguje01.html
/players/m/maynoer01.html
/players/c/collida01.html
/players/c/clavevi01.html
/players/c/casspom01.html
/players/m/mulleby01.html
/players/b/beaubro01.html
/players/g/gibsota01.html
/players/c/carr

/players/e/ennisty01.html
/players/h/harriga01.html
/players/c/cabocbr01.html
/players/m/mcgarmi01.html
/players/a/adamsjo01.html
/players/h/hoodro01.html
/players/n/napiesh01.html
/players/c/capelca01.html
/players/h/hairspj02.html
/players/b/bogdabo01.html
/players/w/wilcocj01.html
/players/h/huestjo01.html
/players/a/anderky01.html
/players/i/inglida01.html
/players/m/mcdankj01.html
/players/h/harrijo01.html
/players/e/earlycl01.html
/players/s/stokeja01.html
/players/o/obryajo01.html
/players/d/daniede01.html
/players/d/dinwisp01.html
/players/g/grantje01.html
/players/r/robingl02.html
/players/j/jokicni01.html
/players/j/johnsni01.html
/players/t/tavarwa01.html
/players/b/brownma02.html
/players/p/poweldw01.html
/players/c/clarkjo01.html
/players/s/smithru01.html
/players/p/pattela01.html
/players/b/bairsca01.html
/players/b/brownal01.html
/players/a/antetth01.html
/players/m/micicva01.html
/players/g/gential01.html
/players/d/dangune01.html
/players/c/chrisse01.html
/players/m/ma

In [36]:
# save to csv file
player_df.to_csv('player_raw.csv')

In [35]:
player_df

Unnamed: 0,Player,Birthday,Height_ft,Height_in,Weight
0,LeBron James,1984-12-30,6,8,250
1,Darko Milicic,1985-06-20,7,0,250
2,Carmelo Anthony,1984-05-29,6,8,240
3,Chris Bosh,1984-03-24,6,11,235
4,Dwyane Wade,1982-01-17,6,4,220
5,Chris Kaman,1982-04-28,7,0,265
6,Kirk Hinrich,1981-01-02,6,4,190
7,T.J. Ford,1983-03-24,6,0,165
8,Mike Sweetney,1982-10-25,6,8,275
9,Jarvis Hayes,1981-08-09,6,7,220
