## CE9010 Introduction to Data Science Project

# Data Scraper



First, we import the relevant libraries.

In [2]:
from bs4 import Comment, BeautifulSoup as bs
import urllib.request
import csv
import traceback
import pandas as pd


# TO BE FORMATTED \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\


Data Acquisition
 ------

#### Warning, long runtime of up to 40 minutes. Program will finish scraping when END OF YEAR: 2016 is printed.

Scraper written in beautifulsoup.

#### Types of data extracted
---

Our main goal of our is to predict the impact of college prospects in the first and second year of NBA based on their college stats. Hence we choose to use their advanced stats available in college (ws, ws/40min) to generalize the impact on NBA measured in their advanced stats available in the NBA (per, ws, ws/48, bpm, vorp) for their first two years in the league. We chose to extract data from 1996 as college advanced statistics were only available from the 1995 season onwards. We also chose to only include the first 30 draft picks of each draft as players beyond the 30th picks usually do not play significant minutes in their teams.

Player's NBA stats extracted from [Basketball Reference](https://www.basketball-reference.com/).

Player's College stats extracted from [Sports Reference](https://www.sports-reference.com/)

---



#### Problems encountered

---

The main problem encountered when running the scraper is the long runtime for a relatively small dataset est <200? confirm again>. The bottleneck occurs when parsing the whole page to look for the link to the sports reference website for each player in order to acquire their respective college stats. Having to parse the whole page for every player leads to a long runtime; inefficient code. A better solution would be to use another data scraping framework such as Scrapy which allows data extraction using [selectors](https://doc.scrapy.org/en/latest/topics/selectors.html). Extracting the link to sports-reference website of the page using the XPath or CSS selector would be much quicker than parsing the entire page. However as this scraper would only need to be run once it is not a major concern to optimize and rewrite another scraper.

## Utility functions

In [3]:


def extract_comments(inside_html, html_id):
    comments = inside_html.findAll(text=lambda text:isinstance(text, Comment)) #data we want is commented, hence the need 
    comments = [comment.extract() for comment in comments if 'id=\"' + html_id + '\"' in comment.extract()] #get advanced stats table
    return bs(comments[0], 'html.parser')

def read_url_into_soup(url):
    try:
        next_page = urllib.request.urlopen(url).read() # goes to player page
        return bs(next_page, 'html.parser')
    except:
        traceback.print_exc()
        return read_url_into_soup(url)

## Main

In [7]:
f = open("data/nba_all.csv", "w")
f2 = open("data/college_all.csv", "w")
writer = csv.writer(f)
writerf = csv.writer(f2)    
year_dict = {}

for year in range(1996, 2017):
    url = "https://www.basketball-reference.com/draft/NBA_"+ str(year) + ".html"
#     req = urllib.request.Request(url)
#     html = read_url_into_soup(req)
    html = read_url_into_soup(url)
#     response = urllib.request.urlopen(req)
#     html = response.read()
#     html = bs(html, 'html.parser')

    
    for row in html.table.tbody.findAll("tr"):

        if (row.find("td")) is None:
            print("END OF YEAR: "+ str(year) +"\n")  # reach end of file
            break
            
        seasons = row.find("td", {"data-stat" : "seasons"} )
        
        if (not seasons.get_text() or  #if player did not play in NBA after being drafted
        int(seasons.get_text()) < 2 or  # if less than 2 seasons played, skip
        len(row.find("td", {"data-stat" : "college_name"}) ) < 1): #if player played in euroleague
            continue
            
        player = row.find("td",{"data-stat" : "player"})
        year_dict[player.get_text()] = year
        print(player.get_text())
        inside_html = read_url_into_soup("https://www.basketball-reference.com/" + player.a['href'])
#         next_page = urllib.request.urlopen("https://www.basketball-reference.com/" + player.a['href']).read() # goes to player page
#         inside_html = bs(next_page, 'html.parser')
        
        advanced = extract_comments(inside_html, 'advanced')
        
        # STATS
          
        out = [player.string]
        cols = ['per', 'ws', 'ws_per_48', 'bpm', 'vorp'] # stats to include
        for col in cols:
            out.append(advanced.findAll('td', {'data-stat' : col})[0].string) #include both 1st and 2nd year
            out.append(advanced.findAll('td', {'data-stat' : col})[1].string) 
        out.append(year)
        writer.writerow(out)
        
        for i in inside_html.findAll("a"): #inefficient way of finding the url for college stats
            if "College Basketball" in str(i):
                coll_url = i['href']
                break
              
        coll_html = read_url_into_soup(coll_url)
#         coll_page = urllib.request.urlopen(coll_url).read()
#         coll_html = bs(coll_page, 'html.parser')

        coll_advanced = extract_comments(coll_html, 'players_advanced')
#         coll_pg = extract_comments(coll_html, 'players_per_game')
        
        #input stats
        
        
        
        ws = coll_advanced.tbody.findAll('td', {'data-stat' : 'ws'})[-1].string #obtain final year college stats ws and ws40
        ws40 = coll_advanced.tbody.findAll('td', {'data-stat' : 'ws_per_40'})[-1].string
        
        sos = coll_html.findAll('td', {'data-stat' : 'sos'})[-2].string
        writerf.writerow([player.string, ws, ws40, sos, year])
            
f.close()
f2.close()    

Allen Iverson
Marcus Camby
Shareef Abdur-Rahim
Stephon Marbury
Ray Allen
Antoine Walker
Lorenzen Wright
Kerry Kittles
Samaki Walker
Erick Dampier
Todd Fuller
Vitaly Potapenko
Steve Nash
Tony Delk
John Wallace
Walter McCarty
Roy Rogers
Derek Fisher
Jerome Williams
Brian Evans
Priest Lauderdale
Travis Knight
END OF YEAR: 1996

Tim Duncan
Keith Van Horn
Chauncey Billups
Antonio Daniels
Tony Battie
Ron Mercer
Tim Thomas
Adonal Foyle
Danny Fortson
Tariq Abdul-Wahad
Austin Croshere
Derek Anderson
Maurice Taylor
Kelvin Cato
Brevin Knight
Johnny Taylor
Scot Pollard
Paul Grant
Anthony Parker
Ed Gray
Bobby Jackson
Rodrick Rhodes
John Thomas
Charles Smith
Jacque Vaughn
Keith Booth
END OF YEAR: 1997

Michael Olowokandi
Mike Bibby
Raef LaFrentz
Antawn Jamison
Vince Carter
Robert Traylor
Jason Williams
Larry Hughes
Paul Pierce
Bonzi Wells
Michael Doleac
Keon Clark
Michael Dickerson
Matt Harpring
Bryce Drew
Pat Garrity
Roshown McLeod
Ricky Davis
Brian Skinner
Tyronn Lue
Felipe Lopez
Sam Jacobson
Core

## ML SHIT HERE

In [16]:
dict = {}

for year in range(1996, 2017):
    url = "https://www.basketball-reference.com/draft/NBA_"+ str(year) + ".html"
#     req = urllib.request.Request(url)
#     html = read_url_into_soup(req)
    html = read_url_into_soup(url)
#     response = urllib.request.urlopen(req)
#     html = response.read()
#     html = bs(html, 'html.parser')

    
    for row in html.table.tbody.findAll("tr"):

        if (row.find("td")) is None:
            print("END OF YEAR: "+ str(year) +"\n")  # reach end of file
            break
            
        seasons = row.find("td", {"data-stat" : "seasons"} )
        
        if (not seasons.get_text() or  #if player did not play in NBA after being drafted
        int(seasons.get_text()) < 2 or  # if less than 2 seasons played, skip
        len(row.find("td", {"data-stat" : "college_name"}) ) < 1): #if player played in euroleague
            continue
            
        player = row.find("td",{"data-stat" : "player"})
        year_dict[player.get_text()] = year
print(year_dict)            

END OF YEAR: 1996

END OF YEAR: 1997

END OF YEAR: 1998

END OF YEAR: 1999

END OF YEAR: 2000

END OF YEAR: 2001

END OF YEAR: 2002

END OF YEAR: 2003

END OF YEAR: 2004

END OF YEAR: 2005

END OF YEAR: 2006

END OF YEAR: 2007

END OF YEAR: 2008

END OF YEAR: 2009

END OF YEAR: 2010

END OF YEAR: 2011

END OF YEAR: 2012

END OF YEAR: 2013

END OF YEAR: 2014

END OF YEAR: 2015

END OF YEAR: 2016

{'Jason Williams': 1998, 'Mike Sweetney': 2003, 'DerMarr Johnson': 2000, 'Reece Gaines': 2003, 'Steve Francis': 1999, 'Adonal Foyle': 1997, 'Charles Smith': 1997, 'Eddie Griffin': 2001, 'C.J. Wilcox': 2014, 'Richard Hamilton': 1999, 'Tyler Ennis': 2014, 'John Salmons': 2002, 'Alec Burks': 2011, 'Ed Davis': 2010, 'Antonio Daniels': 1997, 'Kendall Marshall': 2012, 'Alando Tucker': 2007, 'Andrew Bogut': 2005, 'Derrick Rose': 2008, 'Fred Jones': 2002, 'Brendan Haywood': 2001, 'Trevor Booker': 2010, 'Reggie Jackson': 2011, 'Tyronn Lue': 1998, 'Nerlens Noel': 2013, 'Joe Alexander': 2008, 'Felipe Lope

In [32]:
nba_df = pd.read_csv('nba_all.csv', header=None)
nba_df['year'] = nba_df[0].apply(lambda x: year_dict[x] if x in year_dict else None)
# nba_df['year'] = nba_df[0].apply(lambda x: lookup(x, year_dict))
nba_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,year
0,Allen Iverson,18.0,20.4,4.1,9.0,0.065,0.138,1.5,3.8,2.7,4.6,1996
1,Marcus Camby,17.8,15.9,3.7,0.9,0.095,0.022,-0.3,-0.7,0.8,0.7,1996
2,Shareef Abdur-Rahim,17.4,21.1,2.9,6.9,0.049,0.113,-2.0,1.2,0.0,2.3,1996
3,Stephon Marbury,16.1,16.3,3.7,5.3,0.077,0.082,-1.0,-0.6,0.6,1.1,1996
4,Ray Allen,14.6,16.2,4.9,7.0,0.092,0.102,0.3,1.8,1.5,3.2,1996


In [28]:
college_df = pd.read_csv('college_all.csv', header=None)
college_df['year'] = college_df[0].apply(lambda x: year_dict[x] if x in year_dict else None)
college_df.head()

Unnamed: 0,0,1,2,3,year
0,Allen Iverson,8.6,0.283,9.83,1996
1,Marcus Camby,8.1,0.32,8.92,1996
2,Shareef Abdur-Rahim,5.4,0.221,6.44,1996
3,Stephon Marbury,4.3,0.127,12.71,1996
4,Ray Allen,8.3,0.303,8.19,1996
