# Python for Webscraping
* SOC 590: Big Data and Population Processes
* 17th October 2016

## Tutorial 3: Webscraping and pre-processing 

In [1]:
import os
import urllib
import webbrowser
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup

%matplotlib inline

In [2]:
url = 'http://www.pro-football-reference.com/years/2015/passing.htm'
webbrowser.open_new_tab(url)

True

In [3]:
# The url we will be scraping
url_2015 = "http://www.pro-football-reference.com/years/2015/passing.htm"

# get the html
html = urllib.request.urlopen(url_2015)

# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")

## Scraping the Column Headers

The column headers we need for our `DataFrame` are found in the second row of column headers PFR table.  We will will scrape those and add two additional columns headers for the two additional player page links.

In [4]:
# Extract the necessary values for the column headers from the table
# and store them as a list
column_headers = [th.getText() for th in soup.findAll('th', limit=30)]
column_headers = [s for s in column_headers if len(s) != 0]
column_headers = column_headers[1:]
column_headers

['Tm',
 'Age',
 'Pos',
 'G',
 'GS',
 'QBrec',
 'Cmp',
 'Att',
 'Cmp%',
 'Yds',
 'TD',
 'TD%',
 'Int',
 'Int%',
 'Lng',
 'Y/A',
 'AY/A',
 'Y/C',
 'Y/G',
 'Rate',
 'QBR',
 'Sk',
 'Yds',
 'NY/A',
 'ANY/A',
 'Sk%',
 '4QC',
 'GWD']

In [5]:
len(column_headers)

28

## Scraping the Data

We can easily extract the rows of data using the [CSS selector](http://www.w3schools.com/cssref/css_selectors.asp) `"#draft tr"`.  What we are essentially doing is selecting table row elements within the HTML element that has the id value `"draft"`.  

A really helpful tool when it comes to finding CSS selectors is [SelectorGadget](http://selectorgadget.com/).  It's a web extension that lets you click on different elements of a web page and provides the CSS selector for those selected elements.

In [6]:
#soup.select("tr")[2]

In [7]:
# The data is found within the table rows of the element with id=draft
# We want the elements from the 3rd row and on
table_rows = soup.find_all("tr")[1:]
table_rows

[<tr><th class="right " csk="1" data-stat="ranker" scope="row">1</th><td class="left " csk="Rivers,Philip" data-append-csv="RivePh00" data-stat="player"><a href="/players/R/RivePh00.htm">Philip Rivers</a></td><td class="left " data-stat="team"><a href="/teams/sdg/2015.htm" title="San Diego Chargers">SDG</a></td><td class="right " data-stat="age">34</td><td class="left " data-stat="pos">QB</td><td class="right " data-stat="g">16</td><td class="right " data-stat="gs">16</td><td class="right " csk="0.25000" data-stat="qb_rec">4-12-0</td><td class="right " data-stat="pass_cmp">437</td><td class="right " data-stat="pass_att">661</td><td class="right " data-stat="pass_cmp_perc">66.1</td><td class="right " data-stat="pass_yds">4792</td><td class="right " data-stat="pass_td">29</td><td class="right " data-stat="pass_td_perc">4.4</td><td class="right " data-stat="pass_int">13</td><td class="right " data-stat="pass_int_perc">2.0</td><td class="right " data-stat="pass_long">80</td><td class="righ

Note that `table_rows` is a list of tag elements.

In [8]:
type(table_rows)

list

In [9]:
table_rows[0] # take a look at the first row

<tr><th class="right " csk="1" data-stat="ranker" scope="row">1</th><td class="left " csk="Rivers,Philip" data-append-csv="RivePh00" data-stat="player"><a href="/players/R/RivePh00.htm">Philip Rivers</a></td><td class="left " data-stat="team"><a href="/teams/sdg/2015.htm" title="San Diego Chargers">SDG</a></td><td class="right " data-stat="age">34</td><td class="left " data-stat="pos">QB</td><td class="right " data-stat="g">16</td><td class="right " data-stat="gs">16</td><td class="right " csk="0.25000" data-stat="qb_rec">4-12-0</td><td class="right " data-stat="pass_cmp">437</td><td class="right " data-stat="pass_att">661</td><td class="right " data-stat="pass_cmp_perc">66.1</td><td class="right " data-stat="pass_yds">4792</td><td class="right " data-stat="pass_td">29</td><td class="right " data-stat="pass_td_perc">4.4</td><td class="right " data-stat="pass_int">13</td><td class="right " data-stat="pass_int_perc">2.0</td><td class="right " data-stat="pass_long">80</td><td class="right

The data we want for each player is found within the the `td` (or table data) elements.  

Below I've created a function that extracts the data we want from `table_rows`.  The comments should walk you through what each part of the function does.

In [10]:
def extract_player_data(table_rows):
    """
    Extract and return the the desired information from the td elements within
    the table rows.
    """
    # create the empty list to store the player data
    player_data = []
    
    for row in table_rows:  # for each row do the following

        # Get the text for each table data (td) element in the row
        # Some player names end with ' HOF', if they do, get the text excluding
        # those last 4 characters,
        # otherwise get all the text data from the table data
        player_list = [td.get_text() for td in row.find_all("td")]
        
        
        # there are some empty table rows, which are the repeated 
        # column headers in the table
        # we skip over those rows and and continue the for loop
        if not player_list:
            continue
            
        # Now append the data to list of data
        player_data.append(player_list)
        
    return player_data

Now we can create a `DataFrame` with the data from the 1967 draft.

In [11]:
# extract the data we want
data = extract_player_data(table_rows)
#column_headers
# and then store it in a DataFrame
#df_2016 = pd.DataFrame(data, columns=column_headers)
df_2016  = pd.DataFrame(data)

In [12]:
data[1]

['Drew Brees',
 'NOR',
 '36',
 'QB',
 '15',
 '15',
 '7-8-0',
 '428',
 '627',
 '68.3',
 '4870',
 '32',
 '5.1',
 '11',
 '1.8',
 '80',
 '7.8',
 '8.0',
 '11.4',
 '324.7',
 '101.0',
 '75.47',
 '31',
 '235',
 '7.04',
 '7.26',
 '4.7',
 '1',
 '2']

In [13]:
df_2016.columns = ['Player_name']+column_headers

In [14]:
df_2016.head()

Unnamed: 0,Player_name,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,Cmp%,...,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
0,Philip Rivers,SDG,34,QB,16,16,4-12-0,437,661,66.1,...,299.5,93.8,59.44,40,264,6.46,6.45,5.7,1,2
1,Drew Brees,NOR,36,QB,15,15,7-8-0,428,627,68.3,...,324.7,101.0,75.47,31,235,7.04,7.26,4.7,1,2
2,Tom Brady*,NWE,38,QB,16,16,12-4-0,402,624,64.4,...,298.1,102.2,64.42,38,225,6.87,7.48,5.7,2,2
3,Eli Manning*,NYG,34,QB,16,16,6-10-0,387,618,62.6,...,277.0,93.6,60.46,27,157,6.63,6.74,4.2,1,2
4,Matt Ryan,ATL,30,QB,16,16,8-8-0,407,614,66.3,...,286.9,89.0,61.79,30,203,6.81,6.35,4.7,4,4


### Scraping the Data for All Seasons Since 1967

Scraping the for all drafts since 1967 follows is essentially the same process as above, just repeated for each draft year, using a `for` loop.

As we loop over the years, we will create a `DataFrame` for each draft, and append it to a large list of `DataFrame`s that contains all the drafts.  We will also have a separate list that will contain any errors and the url associated with that error.  This will let us know if there are any issues with our scraper, and which url is causing the error. We will also have to add an additional column for tackles.  Tackles show up after the 1993 season, so that is a column we need to insert into the `DataFrame`s we create for the drafts from 1967 to 1993.

In [15]:
# Create an empty list that will contain all the dataframes
# (one dataframe for each draft)
draft_dfs_list = []

# a list to store any errors that may come up while scraping
errors_list = []

In [16]:
# The url template that we pass in the draft year inro
url_template = "http://www.pro-football-reference.com/years/{year}/passing.htm"

# for each year from 1967 to (and including) 2016
for year in range(1967, 2017): 
    
    # Use try/except block to catch and inspect any urls that cause an error
    try:
        # get the draft url
        url = url_template.format(year=year)

        # get the html
        html = urllib.request.urlopen(url)

        # create the BeautifulSoup object
        soup = BeautifulSoup(html, "lxml") 

        # get the column headers
        column_headers = [th.getText() for th in soup.findAll('th', limit=30)]

        column_headers = [s for s in column_headers if len(s) != 0]
        column_headers = column_headers[1:]
        
        
        #column_headers = ['Player_name']+column_headers
        #print(column_headers)
        

        # select the data from the table using the '#drafts tr' CSS selector
        table_rows = soup.find_all("tr")[1:]

        # extract the player data from the table rows
        player_data = extract_player_data(table_rows)
        #print((column_headers))
        # create the dataframe for the current years draft
        #year_df = pd.DataFrame(player_data, columns=column_headers)
        year_df = pd.DataFrame(player_data)

    
        # add the year of the draft to the dataframe
        year_df.insert(0, "Year", year)
        print(year_df)
        #print(year_df.columns)

        # append the current dataframe to the list of dataframes
        draft_dfs_list.append(year_df)
    
    except Exception as e:
        # Store the url and the error it causes in a list
        error =[url, e] 
        # then append it to the list of errors
        errors_list.append(error)

    Year                 0    1   2      3   4   5       6    7    8 ...  \
0   1967  Sonny Jurgensen*  WAS  33     QB  14  14   5-6-3  288  508 ...   
1   1967   Johnny Unitas*+  BAL  34     QB  14  14  11-1-2  255  436 ...   
2   1967        Norm Snead  PHI  28     QB  14  14   6-7-1  240  434 ...   
3   1967          Jim Hart  STL  23     QB  14  14   6-7-1  192  397 ...   
4   1967   Fran Tarkenton*  NYG  27     QB  14  14   7-7-0  204  377 ...   
5   1967    Roman Gabriel*  RAM  27     QB  14  14  11-1-2  196  371 ...   
6   1967       John Brodie  SFO  32     QB  14  10   5-5-0  168  349 ...   
7   1967     Randy Johnson  ATL  23     QB  14  12  1-10-1  142  288 ...   
8   1967        Frank Ryan  CLE  31     QB  13  13   9-4-0  136  280 ...   
9   1967          Kent Nix  PIT  23     QB  12   9   3-6-0  136  268 ...   
10  1967       Gary Cuozzo  NOR  26     QB  13  10   3-7-0  134  260 ...   
11  1967     Don Meredith*  DAL  29     QB  11  11   7-4-0  128  255 ...   
12  1967    

In [17]:
len(errors_list)

0

In [18]:
errors_list

[]

We don't get any errors, so that's good.

Now we can concatenate all the `DataFrame`s we scraped and create one large `DataFrame` containing all the drafts.

In [19]:
type(draft_dfs_list)

list

In [20]:
draft_dfs_list[0:1]

[    Year                 0    1   2      3   4   5       6    7    8 ...  \
 0   1967  Sonny Jurgensen*  WAS  33     QB  14  14   5-6-3  288  508 ...   
 1   1967   Johnny Unitas*+  BAL  34     QB  14  14  11-1-2  255  436 ...   
 2   1967        Norm Snead  PHI  28     QB  14  14   6-7-1  240  434 ...   
 3   1967          Jim Hart  STL  23     QB  14  14   6-7-1  192  397 ...   
 4   1967   Fran Tarkenton*  NYG  27     QB  14  14   7-7-0  204  377 ...   
 5   1967    Roman Gabriel*  RAM  27     QB  14  14  11-1-2  196  371 ...   
 6   1967       John Brodie  SFO  32     QB  14  10   5-5-0  168  349 ...   
 7   1967     Randy Johnson  ATL  23     QB  14  12  1-10-1  142  288 ...   
 8   1967        Frank Ryan  CLE  31     QB  13  13   9-4-0  136  280 ...   
 9   1967          Kent Nix  PIT  23     QB  12   9   3-6-0  136  268 ...   
 10  1967       Gary Cuozzo  NOR  26     QB  13  10   3-7-0  134  260 ...   
 11  1967     Don Meredith*  DAL  29     QB  11  11   7-4-0  128  255 ...   

In [21]:
column_headers.insert(0, "Player_name")
column_headers.insert(0, "Year")
print(column_headers)
print(len(column_headers))

['Year', 'Player_name', 'Tm', 'Age', 'Pos', 'G', 'GS', 'QBrec', 'Cmp', 'Att', 'Cmp%', 'Yds', 'TD', 'TD%', 'Int', 'Int%', 'Lng', 'Y/A', 'AY/A', 'Y/C', 'Y/G', 'Rate', 'QBR', 'Sk', 'Yds', 'NY/A', 'ANY/A', 'Sk%', '4QC', 'GWD']
30


In [22]:
# store all drafts in one DataFrame
draft_df = pd.concat(draft_dfs_list, axis=0)
#draft_df = draft_df.iloc[:,0:-1]
draft_df.columns = column_headers

In [23]:
# Take a look at the first few rows
draft_df.loc[0,:]

Unnamed: 0,Year,Player_name,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,...,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
0,1967,Sonny Jurgensen*,WAS,33,QB,14,14,5-6-3,288,508,...,267.6,87.3,,,,,,4.0,2.0,
0,1968,John Brodie,SFO,33,QB,14,14,7-6-1,234,404,...,215.7,78.0,,,,,,1.0,2.0,
0,1969,Sonny Jurgensen*+,WAS,35,QB,14,14,7-5-2,274,442,...,221.6,85.4,40.0,322.0,5.77,5.28,8.3,2.0,0.0,
0,1970,Roman Gabriel,RAM,30,QB,14,14,9-4-1,211,407,...,182.3,72.2,20.0,134.0,5.66,5.15,4.7,1.0,1.0,
0,1971,John Hadl,SDG,31,QB,14,14,6-8-0,233,431,...,219.6,68.9,16.0,145.0,6.55,4.98,3.6,2.0,3.0,
0,1972,Archie Manning,NOR,23,QB,14,14,2-11-1,230,448,...,198.6,64.6,43.0,347.0,4.96,3.77,8.8,1.0,1.0,
0,1973,Roman Gabriel*,PHI,33,QB,14,14,5-8-1,270,460,...,229.9,86.0,31.0,219.0,6.11,5.95,6.3,3.0,2.0,
0,1974,Jim Hart*,STL,30,QB,14,14,10-4-0,200,388,...,172.2,79.5,16.0,134.0,5.64,5.74,4.0,1.0,2.0,
0,1975,Fran Tarkenton*+,MIN,35,QB,14,14,12-2-0,273,425,...,213.9,91.8,27.0,245.0,6.08,5.89,6.0,1.0,1.0,
0,1976,Jim Zorn,SEA,23,QB,14,14,2-12-0,208,439,...,183.6,49.5,25.0,196.0,5.12,3.02,5.4,,,


We should edit the columns a bit as there are some repeated column headers and some are even empty strings.

In [24]:
# get the current column headers from the dataframe as a list
column_headers = draft_df.columns.tolist()

# The 5th column header is an empty string, but represesents player names
#column_headers[4] = "Player"

# Prepend "Rush_" for the columns that represent rushing stats 
#column_headers[19:22] = ["Rush_" + col for col in column_headers[19:22]]

# Prepend "Rec_" for the columns that reperesent receiving stats
#column_headers[23:25] = ["Rec_" + col for col in column_headers[23:25]]

# Properly label the defensive int column as "Def_Int"
#column_headers[-6] = "Def_Int"


# Take a look at the updated column headers
column_headers

['Year',
 'Player_name',
 'Tm',
 'Age',
 'Pos',
 'G',
 'GS',
 'QBrec',
 'Cmp',
 'Att',
 'Cmp%',
 'Yds',
 'TD',
 'TD%',
 'Int',
 'Int%',
 'Lng',
 'Y/A',
 'AY/A',
 'Y/C',
 'Y/G',
 'Rate',
 'QBR',
 'Sk',
 'Yds',
 'NY/A',
 'ANY/A',
 'Sk%',
 '4QC',
 'GWD']

In [25]:
# Now assign edited columns to the DataFrame
draft_df.columns = column_headers

Now that we fixed up the necessary columns, let's write out the raw data to a CSV file.

In [26]:
if not os.path.exists('../data/raw_data'):
    os.makedirs('../data/raw_data')

if not os.path.exists('../data/clean_data'):
    os.makedirs('../data/clean_data')
    


In [27]:
draft_df.head()

Unnamed: 0,Year,Player_name,Tm,Age,Pos,G,GS,QBrec,Cmp,Att,...,Y/G,Rate,QBR,Sk,Yds,NY/A,ANY/A,Sk%,4QC,GWD
0,1967,Sonny Jurgensen*,WAS,33,QB,14,14,5-6-3,288,508,...,267.6,87.3,,,,,,4,2,
1,1967,Johnny Unitas*+,BAL,34,QB,14,14,11-1-2,255,436,...,244.9,83.6,,,,,,4,3,
2,1967,Norm Snead,PHI,28,QB,14,14,6-7-1,240,434,...,242.8,80.0,,,,,,0,1,
3,1967,Jim Hart,STL,23,QB,14,14,6-7-1,192,397,...,214.9,58.4,,,,,,2,2,
4,1967,Fran Tarkenton*,NYG,27,QB,14,14,7-7-0,204,377,...,220.6,85.9,,,,,,2,2,


In [28]:
# Write out the raw draft data to the raw_data fold in the data folder
#draft_df.to_csv("../data/raw_data/pfr_nfl_draft_data_RAW.csv", index=False)

# Cleaning the Data

Now that we have the raw draft data, we need to clean it up a bit in order to do some of the data exploration we want.  

## Create a Player ID/Links `DataFrame` 

First lets create a separate `DataFrame` that contains the player names, their player page links, and the player ID on Pro-Football-Reference.  This way we can have a separate CSV file that just contains the necessary information to extract individual player data for Pro-Football-Reference sometime in the future.

To extract the Pro-Football-Reference player ID from the player link, we will need to use a [regular expression](https://en.wikipedia.org/wiki/Regular_expression). Regular expressions are a sequence of characters used to match some pattern in a body of text. The regular expression that we can use to match the pattern of the player link and extract the ID is as follows:

    /.*/.*/(.*)\.
    
What the above regular expression essentially says is match the string with the following pattern:
- One `'/'`. 
- Followed by 0 or more characters (this is represented by the `'.*'` characters).
- Followed by another `'/'` (the 2nd `'/'` character).
- Followed by 0 or more characters (again the `'.*'` characters) .
- Followed by another (3rd) `'/'`.
- Followed by a grouping of 0 or more characters (the `'(.*)'` characters).
  - This is the key part of our regular expression. The `'()'` characters create a grouping around the characters we want to extract.  Since the player IDs are found between the 3rd `'/'` and the `'.'`, we use `'(.*)'` to extract all the characters found in that part of our string.
- Followed by a `'.'`, character after the player ID.

We can extract the IDs by passing the above regular expression into the `pandas extract` method.

In [29]:
# extract the player id from the player links
# expand=False returns the IDs as a pandas Series
#player_ids = draft_df.Player_NFL_Link.str.extract("/.*/.*/(.*)\.", 
                                                  expand=False)

IndentationError: unexpected indent (<ipython-input-29-1ec74142c4f8>, line 4)

In [None]:
# add a Player_ID column to our draft_df
draft_df["Player_ID"] = player_ids

In [None]:
# add the beginning of the pfr url to the player link column
pfr_url = "http://www.pro-football-reference.com"
draft_df.Player_NFL_Link =  pfr_url + draft_df.Player_NFL_Link

Now we can save a `DataFrame` just containing the player names, IDs, and links.

## Cleaning Up the Rest of the Draft Data

Now that we are done with the play ID stuff lets get back to dealing with the draft data.

Lets first drop some unnecessary columns.

In [None]:
# drop the the player links and the column labeled by an empty string
draft_df.drop(draft_df.columns[-4:-1], axis=1, inplace=True)

The main issue left with the rest of the draft data is converting everything to their proper data type.

In [None]:
draft_df.info()

From the above we can see that a lot of the player data isn't numeric when it should be.  To convert all the columns to their proper numeric type we can apply the `to_numeric` function to the whole `DataFrame`.  Because it is impossible to convert some of the columns (e.g. Player, Tm, etc.) into a numeric type (since they aren't numbers) we need to set the `errors` parameter to "ignore" to avoid raising any errors.

In [None]:
# convert the data to proper numeric types
draft_df = draft_df.apply(pd.to_numeric, errors="ignore")

In [None]:
draft_df.info()

We are not done yet. A lot of out numeric columns are missing data because players didn't accumulate any of those stats.  For example, some players didn't score a TD or even play a game.  Let's select the columns with numeric data and then replace the `NaN`s (the current value that represents the missing data) with 0s, as that is a more appropriate value. 

In [None]:
# Get the column names for the numeric columns
num_cols = draft_df.columns[draft_df.dtypes != object]

# Replace all NaNs with 0
draft_df.loc[:, num_cols] = draft_df.loc[:, num_cols].fillna(0)

In [None]:
# Everything is filled, except for Player_ID, which is fine for now
draft_df.info()

We are finally done cleaning the data and now we can save it to a CSV file.

In [None]:
draft_df.to_csv("../data/clean_data/pfr_nfl_draft_data_CLEAN.csv", index=False)

# Exploring the NFL Draft

Now that we are done getting and cleaning the data we want, we can finally have some fun.  First lets just keep the draft data up to and including the 2010 draft, as players who have been drafted more recently haven't played enough to accumulate have a properly representative career [Approximate Value](http://www.sports-reference.com/blog/approximate-value-methodology/) (or cAV).

In [None]:
# get data for drafts from 1967 to 2010
draft_df_2010 = draft_df.loc[draft_df["Draft_Yr"] <= 2010, :]

In [None]:
draft_df_2010.tail() # we see that the last draft is 2010