### We're going to cover the basic of scraping a data table from a webpage


#### Function to parse tables from HTML source code into a python list

First, run the following script, which defines a function called "table_reader" that receives a webpage's html source code, and finds the table in it, and extracts the elements, saving the elements of each row to a list.

The result of the function is a list of lists, with each list containing the entries from a row in the table.



In [1]:
#import the html parser that constructs of tree of tags and what's in them
import lxml.html as ET

#Let's make a function that reads tables and gets the useful information
#content_string is the source code for the page
#table_number is which table we should parse if there are multiple tables on the page. 
#The default value for table_number is 0, meaning retreive the first table
def table_reader(source_code,table_number=0):
    
    #send the page html to the html parser
    doc = ET.fromstring(source_code)
    
    #make an empty list to save our table into
    data=[]
    
    #look in between the tags that say "table" and find all of the row elements, which are the <tr> tags 
    #the table indicates, which table on the page to retreive in case there are many
    rows = doc.xpath("//table")[table_number].findall("tr")
    
    #go through the list of table rows    
    for row in rows:
        #append to our data all of the data in the cells of the row
        data.append([c.text_content() for c in row.getchildren()])
    
    #return the data list
    return data

#### Parsing a table from a single webpage

Let's try it out on a simple webpage

Here is a list of crime rates of major cities across the U.S from Wikipedia

We'll send out the request for the page using the requests library

In [2]:
import requests

#provide the url and set a url variable to hold it
url = "https://en.wikipedia.org/wiki/List_of_United_States_cities_by_crime_rate_(2014)"

#send the request
content= requests.get(url)



#### Sending the source code to our table parser

Now that we have the source, we can:


convert it to a string format

remove newline characters

pass on the information to our table reader function that will output a data_table

In [5]:
#get the content, in this case, the HTML source code
content_string = content.text.encode('utf-8')

# we can clean up the source code and remove special characters from it using the re (regular expression) library
content_string = content_string.replace("\n","")

#submit the source code to our table reader
data_table=table_reader(content_string)

print data_table

[['State', 'City', 'Population', 'Violent Crime', 'Murder andNonnegligent Manslaughter', 'Rape', 'Robbery', 'Aggravated Assault', 'Property Crime', 'Burglary', 'Larceny-Theft', 'Motor Vehicle Theft', 'Arson'], ['New Mexico', 'Albuquerque', '558,874', '882.8', '5.4', '71.9', '247.1', '558.4', '5,446.1', '1,095.6', '3,713.9', '636.6', '15.4'], ['California', 'Anaheim', '346,956', '317.3', '4.0', '22.8', '120.5', '170.1', '2,362.3', '375.0', '1,619.8', '367.5', '6.6'], ['Alaska', 'Anchorage', '301,306', '864.6', '4.0', '130.1', '164.6', '565.9', '3,827.0', '456.3', '3,059.0', '311.6', '26.9'], ['Texas', 'Arlington', '382,976', '484.1', '3.4', '53.8', '128.7', '298.2', '3,515.1', '644.9', '2,633.6', '236.6', '6.8'], ['Georgia', 'Atlanta', '454,363', '1,227.4', '20.5', '33.2', '512.6', '661.1', '5,747.4', '1,203.9', '3,631.0', '912.5', '16.5'], ['Colorado', 'Aurora', '350,948', '412.6', '3.1', '78.1', '118.8', '212.6', '2,838.6', '517.5', '2,018.0', '303.2', '21.7'], ['Texas', 'Austin', '90

#### Make a pandas DataFrame out of the table data

Next, we'll import the data table into pandas similar to how we usually do when importing a list of lists

We will create a DataFrame where the first variable is the data within the table(a list of lists containing the values)

The second variable is the column information (the first list in our data table variable)

In [4]:
import pandas as pd

####send the data_table to a pandas data frame

### The column information is stored in the first row of data data_table[0]
### The actual cell entries are stored in the second row onward data_table[1:]

crime_df = pd.DataFrame(data_table[1:], columns=data_table[0])

#Let's look at the first ten entries of the U.S. crime dataframe
print crime_df.head(10)

           State         City Population Violent Crime  \
0     New Mexico  Albuquerque    558,874         882.8   
1     California      Anaheim    346,956         317.3   
2         Alaska    Anchorage    301,306         864.6   
3          Texas    Arlington    382,976         484.1   
4        Georgia      Atlanta    454,363       1,227.4   
5       Colorado       Aurora    350,948         412.6   
6          Texas       Austin    903,924         396.2   
7     California  Bakersfield    367,406         456.7   
8       Maryland    Baltimore    623,513       1,338.5   
9  Massachusetts       Boston    654,413         725.7   

  Murder andNonnegligent Manslaughter   Rape Robbery Aggravated Assault  \
0                                 5.4   71.9   247.1              558.4   
1                                 4.0   22.8   120.5              170.1   
2                                 4.0  130.1   164.6              565.9   
3                                 3.4   53.8   128.7         