## Webscraping for draft table

The code below uses BeautifulSoup to get draft data from 1966 to 2017. As opposed to scraping for the player tables, this one didn't require as many functions and so I ended up not using any python scripts for this one. All the scraping is conducted in this notebook

In [1]:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import re

In [2]:
def SoupFromURL(url, suppressOutput=True):
    if not suppressOutput:
        print(url)
    try:
        r = requests.get(url)
    except:
        return None

    return BeautifulSoup(r.text, "html5lib")

### Scraping the NBA draft pages

Code below scraped with beautifulsoup to get all draft player information

In [3]:
soup = SoupFromURL('http://www.basketball-reference.com/draft/NBA_2014.html')
column_headers = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')]
column_headers.remove(column_headers[0])

data_rows = soup.findAll('tr')[2:]
player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]

df = pd.DataFrame(player_data, columns=column_headers)
df.head()

Unnamed: 0,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,AST,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST.1,WS,WS/48,BPM,VORP
0,1,CLE,Andrew Wiggins,University of Kansas,3,245,8862,4995,993,523,...,0.329,0.76,36.2,20.4,4.1,2.1,10.3,0.056,-2.4,-0.8
1,2,MIL,Jabari Parker,Duke University,3,152,4874,2403,847,314,...,0.341,0.748,32.1,15.8,5.6,2.1,9.0,0.088,-1.3,0.9
2,3,PHI,Joel Embiid,University of Kansas,1,31,786,627,243,66,...,0.367,0.783,25.4,20.2,7.8,2.1,1.9,0.117,3.1,1.0
3,4,ORL,Aaron Gordon,University of Arizona,3,205,4958,1981,1081,311,...,0.289,0.699,24.2,9.7,5.3,1.5,10.1,0.098,-0.1,2.4
4,5,UTA,Dante Exum,,2,148,3045,805,263,309,...,0.308,0.743,20.6,5.4,1.8,2.1,1.1,0.017,-3.3,-1.0


### Data Cleaning

Now that we have all the data, we can conduct some data cleaning. Here are some things we need to do with this data:

    - Get rid of a couple of rows (that were header rows) that contain only NoneType values
    - Rename some of the columns
    - Change to proper data types
    - Deal with some more missing values
    - Add column for draft year

#### Null picks
Now lets find the rows containing NoneType values. To do this we can use pandas boolean indexing. We can find the the rows we want by calling isnull() method (which return True if there is a NoneType or NaN) from the 'Pk' column. If 'Pk' value is missing then there isn't a draft pick in that row so we can get rid of that row.

In [4]:
# Finding the None rows
df[df['Pk'].isnull()]
df = df[df.Player.notnull()]

Now no more missing values

In [5]:
df[df['Pk'].isnull()]

Unnamed: 0,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,AST,...,3P%,FT%,MP.1,PTS.1,TRB.1,AST.1,WS,WS/48,BPM,VORP


#### Renaming Columns

We should rename some of the columns since Python is not happy with having '%' or '/' in identifiers.

In [6]:
df.rename(columns={'WS/48':'WS_per_48'}, inplace=True)
df.columns = df.columns.str.replace('%', '_Perc')

#We also need to differentiate between per game stats and total cumulative career stats.
df.columns.values[14:18] = [df.columns.values[14:18][col] + "_per_G" for col in range(4)]
print(df.columns)

Index(['Pk', 'Tm', 'Player', 'College', 'Yrs', 'G', 'MP', 'PTS', 'TRB', 'AST',
       'FG_Perc', '3P_Perc', 'FT_Perc', 'MP', 'PTS_per_G', 'TRB_per_G',
       'AST_per_G', 'WS_per_G', 'WS_per_48', 'BPM', 'VORP'],
      dtype='object')


#### Change Data to Proper Data Type

These are all object types so need to change to proper types

In [7]:
df.dtypes

Pk           object
Tm           object
Player       object
College      object
Yrs          object
G            object
MP           object
PTS          object
TRB          object
AST          object
FG_Perc      object
3P_Perc      object
FT_Perc      object
MP           object
PTS_per_G    object
TRB_per_G    object
AST_per_G    object
WS_per_G     object
WS_per_48    object
BPM          object
VORP         object
dtype: object

In [8]:
df = df.convert_objects(convert_numeric=True)
df = df[:].fillna(0) # index all the columns and fill in the 0s
df.loc[:,'Yrs':'AST'] = df.loc[:,'Yrs':'AST'].astype(int)
df.head()

  """Entry point for launching an IPython kernel.


Unnamed: 0,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,AST,...,3P_Perc,FT_Perc,MP.1,PTS_per_G,TRB_per_G,AST_per_G,WS_per_G,WS_per_48,BPM,VORP
0,1,CLE,Andrew Wiggins,University of Kansas,3,245,8862,4995,993,523,...,0.329,0.76,8862,20.4,4.1,2.1,10.3,0.056,-2.4,-0.8
1,2,MIL,Jabari Parker,Duke University,3,152,4874,2403,847,314,...,0.341,0.748,4874,15.8,5.6,2.1,9.0,0.088,-1.3,0.9
2,3,PHI,Joel Embiid,University of Kansas,1,31,786,627,243,66,...,0.367,0.783,786,20.2,7.8,2.1,1.9,0.117,3.1,1.0
3,4,ORL,Aaron Gordon,University of Arizona,3,205,4958,1981,1081,311,...,0.289,0.699,4958,9.7,5.3,1.5,10.1,0.098,-0.1,2.4
4,5,UTA,Dante Exum,,2,148,3045,805,263,309,...,0.308,0.743,3045,5.4,1.8,2.1,1.1,0.017,-3.3,-1.0


In [9]:
df.dtypes

Pk             int64
Tm            object
Player        object
College       object
Yrs            int64
G              int64
MP             int64
PTS            int64
TRB            int64
AST            int64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP             int64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS_per_G     float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

#### Adding year column

Now lets finally add a Draft_Yr column to indicate the draft class year.

In [10]:
df.insert(0, 'Draft_Yr', 2014)
df.head()

Unnamed: 0,Draft_Yr,Pk,Tm,Player,College,Yrs,G,MP,PTS,TRB,...,3P_Perc,FT_Perc,MP.1,PTS_per_G,TRB_per_G,AST_per_G,WS_per_G,WS_per_48,BPM,VORP
0,2014,1,CLE,Andrew Wiggins,University of Kansas,3,245,8862,4995,993,...,0.329,0.76,8862,20.4,4.1,2.1,10.3,0.056,-2.4,-0.8
1,2014,2,MIL,Jabari Parker,Duke University,3,152,4874,2403,847,...,0.341,0.748,4874,15.8,5.6,2.1,9.0,0.088,-1.3,0.9
2,2014,3,PHI,Joel Embiid,University of Kansas,1,31,786,627,243,...,0.367,0.783,786,20.2,7.8,2.1,1.9,0.117,3.1,1.0
3,2014,4,ORL,Aaron Gordon,University of Arizona,3,205,4958,1981,1081,...,0.289,0.699,4958,9.7,5.3,1.5,10.1,0.098,-0.1,2.4
4,2014,5,UTA,Dante Exum,,2,148,3045,805,263,...,0.308,0.743,3045,5.4,1.8,2.1,1.1,0.017,-3.3,-1.0


#### Putting this altogether!

Now let's put all of this together to get a full draft table

In [11]:
url_template = "http://www.basketball-reference.com/draft/NBA_{year}.html"
draft_df = pd.DataFrame()

for year in range(1966, 2017):  # for each year
    url = url_template.format(year=year)  # get the url

    soup = SoupFromURL(url) # create our BS object
    
    #column headers
    column_headers = [th.getText() for th in soup.findAll('tr', limit=2)[1].findAll('th')]
    column_headers.remove(column_headers[0])
    
    # get our player data
    data_rows = soup.findAll('tr')[2:] 
    player_data = [[td.getText() for td in data_rows[i].findAll('td')] for i in range(len(data_rows))]
    
    # Turn yearly data into a DatFrame
    year_df = pd.DataFrame(player_data, columns=column_headers)
    # create and insert the Draft_Yr column
    year_df.insert(0, 'Draft_Yr', year)
    
    # Append to the big dataframe
    draft_df = draft_df.append(year_df, ignore_index=True)

# Convert data to proper data types
draft_df = draft_df.convert_objects(convert_numeric=True)

# Get rid of the rows full of null values
draft_df = draft_df[draft_df.Player.notnull()]

# Replace NaNs with 0s
draft_df = draft_df.fillna(0)

# Rename Columns
draft_df.rename(columns={'WS/48':'WS_per_48'}, inplace=True)
# Change % symbol
draft_df.columns = draft_df.columns.str.replace('%', '_Perc')
# Add per_G to per game stats
draft_df.columns.values[15:19] = [draft_df.columns.values[15:19][col] + "_per_G" for col in range(4)]

# Changing the Data Types to int
draft_df.loc[:,'Yrs':'AST'] = draft_df.loc[:,'Yrs':'AST'].astype(int)



In [12]:
draft_df.shape

(5988, 22)

In [13]:
draft_df.dtypes

Draft_Yr       int64
Pk           float64
Tm            object
Player        object
College       object
Yrs            int64
G              int64
MP             int64
PTS            int64
TRB            int64
AST            int64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP             int64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS_per_G     float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

In [14]:
draft_df['Pk'] = draft_df['Pk'].astype(int) # change Pk to int
draft_df.dtypes

Draft_Yr       int64
Pk             int64
Tm            object
Player        object
College       object
Yrs            int64
G              int64
MP             int64
PTS            int64
TRB            int64
AST            int64
FG_Perc      float64
3P_Perc      float64
FT_Perc      float64
MP             int64
PTS_per_G    float64
TRB_per_G    float64
AST_per_G    float64
WS_per_G     float64
WS_per_48    float64
BPM          float64
VORP         float64
dtype: object

In [15]:
draft_df.isnull().sum() # No missing values in our DataFrame

Draft_Yr     0
Pk           0
Tm           0
Player       0
College      0
Yrs          0
G            0
MP           0
PTS          0
TRB          0
AST          0
FG_Perc      0
3P_Perc      0
FT_Perc      0
MP           0
PTS_per_G    0
TRB_per_G    0
AST_per_G    0
WS_per_G     0
WS_per_48    0
BPM          0
VORP         0
dtype: int64

In [16]:
draft_df.to_csv("scraped_data/draft_tables/draft_data_1966_to_2017.csv",index=False)