## CE9010 Introduction to Data Science Project

# Data Scraper



First, we import the relevant libraries.

In [1]:
from bs4 import Comment, BeautifulSoup as bs
import urllib.request
import csv
import traceback
import numpy as np
import pandas as pd

%matplotlib inline
from IPython.display import set_matplotlib_formats
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# TO BE FORMATTED \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\


Data Acquisition
 ------

#### Warning, long runtime of up to 40 minutes. Program will finish scraping when END OF YEAR: 2016 is printed.

Scraper written in beautifulsoup.

#### Types of data extracted
---

Our main goal of our is to predict the impact of college prospects in the first and second year of NBA based on their college stats. Hence we choose to use their advanced stats available in college (ws, ws/40min) to generalize the impact on NBA measured in their advanced stats available in the NBA (per, ws, ws/48, bpm, vorp) for their first two years in the league. We chose to extract data from 1996 as college advanced statistics were only available from the 1995 season onwards. We also chose to only include the first 30 draft picks of each draft as players beyond the 30th picks usually do not play significant minutes in their teams.

Player's NBA stats extracted from [Basketball Reference](https://www.basketball-reference.com/).

Player's College stats extracted from [Sports Reference](https://www.sports-reference.com/)

---



#### Problems encountered

---

The main problem encountered when running the scraper is the long runtime for a relatively small dataset est <200? confirm again>. The bottleneck occurs when parsing the whole page to look for the link to the sports reference website for each player in order to acquire their respective college stats. Having to parse the whole page for every player leads to a long runtime; inefficient code. A better solution would be to use another data scraping framework such as Scrapy which allows data extraction using [selectors](https://doc.scrapy.org/en/latest/topics/selectors.html). Extracting the link to sports-reference website of the page using the XPath or CSS selector would be much quicker than parsing the entire page. However as this scraper would only need to be run once it is not a major concern to optimize and rewrite another scraper.

## Utility functions

In [2]:


def extract_comments(inside_html, html_id):
    comments = inside_html.findAll(text=lambda text:isinstance(text, Comment)) #data we want is commented, hence the need 
    comments = [comment.extract() for comment in comments if 'id=\"' + html_id + '\"' in comment.extract()] #get advanced stats table
    return bs(comments[0], 'html.parser')

def read_url_into_soup(url):
    try:
        next_page = urllib.request.urlopen(url).read() # goes to player page
        return bs(next_page, 'html.parser')
    except:
        traceback.print_exc()
        return read_url_into_soup(url)

## Main

In [20]:
f = open("data/nba_all.csv", "w")
f2 = open("data/college_all.csv", "w")
writer = csv.writer(f)
writerf = csv.writer(f2)    
running = False
# year_dict = {}

for year in range(1996, 2017):
    url = "https://www.basketball-reference.com/draft/NBA_"+ str(year) + ".html"
#     req = urllib.request.Request(url)
#     html = read_url_into_soup(req)
    html = read_url_into_soup(url)
#     response = urllib.request.urlopen(req)
#     html = response.read()
#     html = bs(html, 'html.parser')
    if running:
        break
    
    for row in html.table.tbody.findAll("tr"):

        if (row.find("td")) is None:
            print("END OF YEAR: "+ str(year) +"\n")  # reach end of file
            break
            
        seasons = row.find("td", {"data-stat" : "seasons"} )
        
        if (not seasons.get_text() or  #if player did not play in NBA after being drafted
        int(seasons.get_text()) < 2 or  # if less than 2 seasons played, skip
        len(row.find("td", {"data-stat" : "college_name"}) ) < 1): #if player played in euroleague
            continue
            
        player = row.find("td",{"data-stat" : "player"})
        print(player.get_text())
        inside_html = read_url_into_soup("https://www.basketball-reference.com/" + player.a['href'])
#         next_page = urllib.request.urlopen("https://www.basketball-reference.com/" + player.a['href']).read() # goes to player page
#         inside_html = bs(next_page, 'html.parser')
        
        advanced = extract_comments(inside_html, 'advanced')
        
        # STATS
          
        out = [player.string]
        cols = ['per', 'ws_per_48', 'bpm', 'vorp'] # stats to include
        for col in cols:
            out.append(advanced.findAll('td', {'data-stat' : col})[0].string) #include both 1st and 2nd year
            out.append(advanced.findAll('td', {'data-stat' : col})[1].string) 
        out.append(year)
        writer.writerow(out)
        
        for i in inside_html.findAll("a"): #inefficient way of finding the url for college stats
            if "College Basketball" in str(i):
                coll_url = i['href']
                break
              
        coll_html = read_url_into_soup(coll_url)
#         coll_page = urllib.request.urlopen(coll_url).read()
#         coll_html = bs(coll_page, 'html.parser')

        
        coll_advanced = extract_comments(coll_html, 'players_advanced')
        coll_html = read_url_into_soup(coll_url)
        coll_players_pm = extract_comments(coll_html, 'players_per_min')
        
        
        cols = ['g','gs','fg_per_min','fga_per_min','fg_pct','fg2_per_min','fg2a_per_min','fg2_pct','fg3_per_min',
                'fg3a_per_min','fg3_pct','ft_per_min','fta_per_min','ft_pct','trb_per_min','ast_per_min', 'stl_per_min',
                'blk_per_min', 'tov_per_min','pf_per_min','pts_per_min']

        pm = [coll_players_pm.tbody.findAll('td', {'data-stat' : c})[-1].string for c in cols]
        
        #input stats
        cols = ["mp", "ts_pct","efg_pct","fg3a_per_fga_pct", "fta_per_fga_pct","ws_per_40"]
#         for col in cols:
#         ws40 = coll_advanced.tbody.findAll('td', {'data-stat' : 'ws_per_40'})[-1].string
        adv = [coll_advanced.tbody.findAll('td', {'data-stat' : c})[-1].string for c in cols]
        
        sos = coll_html.findAll('td', {'data-stat' : 'sos'})[-2].string
        result = [player.string, sos, year]
        result.extend(adv)
        result.extend(pm)
#         print(result)
        writerf.writerow(result)
#         running = True
#         break
            
f.close()
f2.close()    

Allen Iverson
['Allen Iverson', '9.83', 1996, '1213', '.578', '.547', '.366', '.488', '.283', '37', '37', '10.3', '21.4', '.480', '7.4', '13.6', '.546', '2.9', '7.8', '.366', '7.1', '10.5', '.678', '4.6', '5.7', '4.1', '0.5', '4.6', '2.9', '30.5']


## ML SHIT HERE
#### We used college_all and nba_all outside the data/ directory in case we accidentally ran the above function and everything is overwritten

- remove rows with NaN (somehow in 1997-1998 some college statistics were not recorded)

In [12]:
college_df = pd.read_csv('college_all.csv', header=None)
college_df = college_df.rename(index=str, columns={0: "Player"})
# college_df.columns = ["Player","sos", "year","mp", "ts_pct","efg_pct","fg3a_per_fga_pct", "fta_per_fga_pct","ws_per_40",
#                       'g','gs','fg_per_min','fga_per_min','fg_pct','fg2_per_min','fg2a_per_min','fg2_pct','fg3_per_min',
#                 'fg3a_per_min','fg3_pct','ft_per_min','fta_per_min','ft_pct','trb_per_min','ast_per_min', 'stl_per_min',
#                 'blk_per_min', 'tov_per_min','pf_per_min','pts_per_min']
nba_df = pd.read_csv('nba_all.csv', header=None)
nba_df.columns = ['Player','PER 1st', 'PER 2nd', 'WS 1st','WS 2nd', 'WS48 1st','WS48 2nd', 'BPM 1st', 'BPM 2nd','VORP 1st', 'VORP 2nd', 'year']

nba_df = nba_df.drop(['year'], axis=1) #remove duplicate column

comb = pd.merge(nba_df, college_df, on=['Player', 'Player'])

print(college_df[college_df.isnull().any(axis=1)].shape) #amount of data to remove
print(comb.shape)
comb = comb.dropna(how = 'any') #remove rows with NaN
print(comb.shape)
print("Are there any rows with empty cells?", comb.isnull().values.any() )

# comb = comb.sort_values(by = ["College Win Shares"])


(46, 5)
(486, 15)
(440, 15)
Are there any rows with empty cells? False


# train test split

In [23]:
 #enter favourite number
seed = 100
  
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

comb = pd.read_csv('merged_data.csv',index_col=0)
comb.head()

Unnamed: 0,Player,PER 1st,PER 2nd,Win Shares 1st,Win Shares 2nd,Win Shares per 48 min 1st,Win Shares per 48 min 2nd,BPM 1st,BPM 2nd,VORP 1st,VORP 2nd,Win Shares(Col),Win Shares per 40 min(Col),Strength of Schedule(Col),Year Drafted
0,Allen Iverson,18.0,20.4,4.1,9.0,0.065,0.138,1.5,3.8,2.7,4.6,8.6,0.283,9.83,1996
1,Marcus Camby,17.8,15.9,3.7,0.9,0.095,0.022,-0.3,-0.7,0.8,0.7,8.1,0.32,8.92,1996
2,Shareef Abdur-Rahim,17.4,21.1,2.9,6.9,0.049,0.113,-2.0,1.2,0.0,2.3,5.4,0.221,6.44,1996
3,Stephon Marbury,16.1,16.3,3.7,5.3,0.077,0.082,-1.0,-0.6,0.6,1.1,4.3,0.127,12.71,1996
4,Ray Allen,14.6,16.2,4.9,7.0,0.092,0.102,0.3,1.8,1.5,3.2,8.3,0.303,8.19,1996


In [25]:
kf = KFold(n_splits=4)
kf.get_n_splits(X)
print(kf)

def split_val_set(y_var, kf):
    '''
    SPLITS TRAIN SET INTO VAL AND TRAINING SETS
    '''
    X_train, X_test, y_train, y_test = train_test_split(comb[["Win Shares(Col)", "Win Shares per 40 min(Col)", "Strength of Schedule(Col)", "Year Drafted"]], comb[y_var], test_size=0.20, random_state=seed)
    for train_index, test_index in kf.split(X):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
    return X_train, y_train, X_test, y_test

for y_name in list(nba_df):
    print(y_name)
    xset , yset, x_test, y_test = split_val_set(y_name, kf)
    # comb['College Win Shares per 40 min'] = preprocessing.scale(comb['College Win Shares per 40 min']

KFold(n_splits=4, random_state=None, shuffle=False)
Player
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
PER 1st
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
PER 2nd
TRAIN: [1 2 3] TEST: [0]
TRAIN: [0 2 3] TEST: [1]
TRAIN: [0 1 3] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
WS 1st


KeyError: 'WS 1st'

# for sklearn
normalize : boolean, optional, default False

This parameter is ignored when fit_intercept is set to False. If True, the regressors X will be normalized before regression by subtracting the mean and dividing by the l2-norm. If you wish to standardize, please use sklearn.preprocessing.StandardScaler before calling fit on an estimator with normalize=False.

In [9]:
lin_reg_sklearn = LinearRegression(normalize = True) #normalize according to z

no_of_set = 4
for y_name in list(nba_df):
    print(y_name)
    xset , yset, x_test, y_test = split_val_set(y_name, kf)
    # comb['College Win Shares per 40 min'] = preprocessing.scale(comb['College Win Shares per 40 min']
    
    
    print(len(xset),len(yset))

    for i in range(no_of_set):
        xset[i]
        lin_reg_sklearn.fit(comb['College Win Shares'].reshape([-1,1]), comb['Win Shares 2nd'].reshape([-1,1]))
        w_sklearn = np.zeros([2,1])
        w_sklearn[0,0] = lin_reg_sklearn.intercept_
        w_sklearn[1,0] = lin_reg_sklearn.coef_
        print(w_sklearn)
        plt.scatter(comb['College Win Shares per 40 min'], comb['Win Shares 2nd'],s=20, c='r', marker='o', linewidths=1)
        # plt.plot(range(-5,6), np.asarray([lin_reg_sklearn.predict(x) for x in range(-5,6)]).reshape(-1 ,1))
        # y_pred_sklearn = w_sklearn[0] + w_sklearn[1]* 
        plt.plot()
        # plt.plot(comb['College Win Shares per 40 min'], np.asarray([lin_reg_sklearn.predict(x) for x in comb['College Win Shares per 40 min']]).reshape(-1 ,1))

        plt.show()


Player


TypeError: split_val_set() missing 1 required positional argument: 'kf'

In [87]:
comb.head()

Unnamed: 0,Player,PER 1st,PER 2nd,Win Shares 1st,Win Shares 2nd,Win Shares per 48 min 1st,Win Shares per 48 min 2nd,BPM 1st,BPM 2nd,VORP 1st,VORP 2nd,College Win Shares,College Win Shares per 40 min,College Strength of Schedule,Year Drafted
0,Allen Iverson,18.0,20.4,4.1,9.0,0.065,0.138,1.5,3.8,2.7,4.6,8.6,0.283,9.83,1996
1,Marcus Camby,17.8,15.9,3.7,0.9,0.095,0.022,-0.3,-0.7,0.8,0.7,8.1,0.32,8.92,1996
2,Shareef Abdur-Rahim,17.4,21.1,2.9,6.9,0.049,0.113,-2.0,1.2,0.0,2.3,5.4,0.221,6.44,1996
3,Stephon Marbury,16.1,16.3,3.7,5.3,0.077,0.082,-1.0,-0.6,0.6,1.1,4.3,0.127,12.71,1996
4,Ray Allen,14.6,16.2,4.9,7.0,0.092,0.102,0.3,1.8,1.5,3.2,8.3,0.303,8.19,1996


# RECIPE TO FOLLOW

## Pre process (zero mean? unit variance?)
## Extract a subset of training data and over fit them? L train close to zero and Lval high by manually selecting hyper par
## Add regularization and evaluate the generalization performance on the validation set
## Lval and L train gap should be minimized and both be ideally small
## USe all training data and cross validation to estimate the parameters and hyper parameters 

    