# WEBSCRAPING CAVEATS

The code contained in this brief tutorial is intended for illustration purposes only. It was not implemented with code design or optimization in mind. It merely shows how to scrape data from various sources using Beautiful Soup.

In [1]:
from urllib2 import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import unicodedata

# DATA SOURCE #1 - Sports Stats

The website [sports-reference.com](http://www.sports-reference.com/) includes statistics for the MLB, NBA, NFL, NHL, CFB, CBB, and the Olympics. As an example, this tutorial will show how to scrape Chicago Blackhawk player statistics for the 2015-2016 season located [here](http://www.hockey-reference.com/teams/CHI/2016.html) and convert it to a pandas dataframe. 

### GET DATA

In [2]:
# GET DATA
url = "http://www.hockey-reference.com/teams/CHI/2016.html"
html = urlopen(url)

# PRE-PROCESS
soup = BeautifulSoup(html, "lxml")
table = soup.findAll("table", class_="sortable stats_table")
scoring = table[2]  # subset to just scoring table
column_headers = [th.getText() for th in 
                  scoring.findAll('tr')[1].findAll('th')][1:]  # drop 'Rk'
data_rows = scoring.findAll('tr')[2:]  # skip the first 2 header rows
player_data = [[td.getText() for td in data_rows[i].findAll('td')]
               for i in range(len(data_rows))]

# CREATE DF
df = pd.DataFrame(player_data, columns=column_headers)
df.head()

Unnamed: 0,Player,Age,Pos,GP,G,A,PTS,+/-,PIM,EV,...,TOI,ATOI,OPS,DPS,PS,BLK,HIT,FOW,FOL,FO%
0,Patrick Kane,27,RW,82,46,60,106,17,30,29,...,1674,20:25,12.4,2.6,15.0,21,37,11,40,21.6
1,Artemi Panarin,24,LW,80,30,47,77,8,32,22,...,1482,18:31,7.9,2.0,9.8,14,46,0,0,
2,Jonathan Toews,27,C,80,28,30,58,16,62,18,...,1540,19:15,5.4,2.4,7.7,38,81,921,652,58.6
3,Brent Seabrook,30,D,81,14,35,49,6,32,8,...,1849,22:49,4.8,4.4,9.2,150,121,0,0,
4,Duncan Keith,32,D,67,9,34,43,13,26,5,...,1691,25:14,3.7,4.7,8.4,116,16,0,1,0.0


# DATA SOURCE #2 - Beer

The website [http://hbd.org/ensmingr/](http://hbd.org/ensmingr/) gives the percent alcohol, number of calories, specific gravity before (OG) and after (FG) fermentation, and apparent attenuation for ~1000 commercial beers from ~100 breweries, as of 1999. As an example, this tutorial will show how to scrape data for Grant's Brewery located [here](http://hbd.org/ensmingr/grants.html) and convert it to a pandas dataframe.

In [3]:
# GET DATA
url2 = "http://hbd.org/ensmingr/grants.html"
html2 = urlopen(url2)

# PRE-PROCESS
soup2 = BeautifulSoup(html2, 'lxml')
column_headers = ['Company and Product', 'Location', 'Percent_Alcohol', 'Calories_Per_12', 'OG', 'FG', 'AA', 'Notes']
data_rows2 = soup2.findAll('tr')[2:]  # skip the first 2 header rows
beer_data = [[td.getText() for td in data_rows2[i].findAll('td')]
             for i in range(len(data_rows2))]
beer_data = [[unicodedata.normalize('NFKD', beer_data[i][j]).encode('ascii','ignore') for j in range(8)] for i in range(len(beer_data))]

# CREATE DF
df2 = pd.DataFrame(beer_data, columns=column_headers).replace({'\r\n': ''}, regex=True)
df2.head()

Unnamed: 0,Company and Product,Location,Percent_Alcohol,Calories_Per_12,OG,FG,AA,Notes
0,Scottish Ale,,4.7,158,12.5,3.1,0.752,"OG, FG in Plato"
1,India Pale Ale,,4.2,143,11.5,2.8,0.757,
2,Glorious Golden Ale,,4.6,150,12.0,2.7,0.775,
3,Amber Ale,,5.5,180,13.0,3.5,0.731,
4,HefeWeizen,,4.2,137,11.5,2.3,0.8,


### WITH PANDAS (much easier)

In [4]:
grants = pd.read_html(url2, header=0)[0]
grants.head()

Unnamed: 0,Company and Product,Location,% Alcohol (v/v),Calories/12 oz,OG,FG,AA,Notes
0,Grant's Brewery,"Yakima, WA",,,,,,
1,Scottish Ale,,4.7,158.0,12.5,3.1,0.752,"OG, FG in Plato"
2,India Pale Ale,,4.2,143.0,11.5,2.8,0.757,
3,Glorious Golden Ale,,4.6,150.0,12.0,2.7,0.775,
4,Amber Ale,,5.5,180.0,13.0,3.5,0.731,


# DATA SOURCE #3 - Wine

The website [http://www.wineinstitute.org/resources/pressroom/07082016](http://www.wineinstitute.org/resources/pressroom/07082016) gives wine sales in the U.S. and California.  As an example, this tutorial will show how to scrape wine sales data (in millions of 9-liter cases - from California, other states, and foreign producers entering U.S. distribution) and convert it to a pandas dataframe.. 

In [5]:
# GET DATA
url3 = "http://www.wineinstitute.org/resources/pressroom/07082016"
html3 = urlopen(url3)

# PRE-PROCESS
soup3 = BeautifulSoup(html3, "lxml")
table3 = soup3.findAll("table")
sales = table3[1]  # subset to just wine sales table
column_headers3 = ['Year', 'Table', 'Dessert', 'Sparkling', 'Total_Wine', 'Total_Retail_Value']
data_rows3 = sales.findAll('tr')[2:]  # skip the first 2 header rows
wine_data = [[td.getText() for td in data_rows3[i].findAll('td')]
             for i in range(len(data_rows3))]

# CREATE DF
df3 = pd.DataFrame(wine_data, columns=column_headers3)
df3.head(10)

Unnamed: 0,Year,Table,Dessert,Sparkling,Total_Wine,Total_Retail_Value
0,2014,324.0,32.7,19.7,376.4,$53.1 billion
1,2013,323.5,30.9,18.4,372.8,$51.3 billion
2,2012,319.4,29.9,17.6,366.9,$50.8 billion
3,2011,306.2,31.5,17.2,354.9,$48.6 billion
4,2010,292.1,28.8,15.4,336.3,$46.5 billion
5,2009,295.3,26.6,13.9,335.8,$45.2 billion
6,2008,279.7,27.4,13.5,320.6,$45.0 billion
7,2007,276.9,26.3,13.9,317.1,$43.5 billion
8,2006,266.0,27.4,13.6,304.3,$41.5 billion
9,2005,255.4,22.5,13.1,290.9,$38.5 billion
