# BMI565: Bioinformatics Programming & Scripting


## Week 4: HTML and Web Scraping

** * Thanks to Ryan Swan for these materials.**

1. HTML
    * Organization of HTML files
2. LXML Package
    * HTML as a tree structure
    * XPath queries
    * Element objects
    * HTML tag attributes
3. Beautiful Soup
    * Soup objects and methods
    * Using tag attributes with BeautifulSoup
4. The Web Developers Console
5. A note about APIs and `robots.txt`

#### Requirements

1. Python 2.7
2. `lxml` module
3. `urllib` module
4. `BeautifulSoup (beautifulsoup4)` module
5. `io` module

The first three modules should be included in the current Anaconda distribution. `io` is included as part of the base python distribution.

## HTML

Just like anything else on a computer, the internet is made up of code. At the computer level, most pages look about like this:

    <html>
    <head>
        <title>Hey look, a webpage!</title>
    </head>
    <body>
        <p>webpage goes here</p>
    </body>
    </html>

Granted, most pages are much more complicated, but this page will run just fine on a standard browser if loaded.

Hypertext Markup Language (HTML) is the basis for most pages that are served on the internet. Like other markup languages, it's primarily concerned with telling you how to display a document, with an emphasis on formatting and annotation.

HTML is actually very similar to XML (Extensible Markup Language), with the caveat that it also contains presentation semantics, which are attributes that specify how information is meant to be displayed or arranged on a screen. But overall, the nested format is almost exactly like an XML document, and because of that, we can extract information from a standard HTML page exactly the same way we would from an XML document.

## LXML package

The LXML package for Python contains methods to read HTML pages like a tree structure. It uses a querying syntax called XML Path Language (XPath) to parse the tree structure and return relevent information from the document.

Before we get started, it helps to have an idea of some of the ways that HTML arranges documents. Most scrapable HTML data is contained in tables like the one at http://www.bioinformatics.org/sms/iupac.html. HTML tables are arranged in the following format:

    <table>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            ...
        </tr>
        <tr>
            ...
        </tr>
    </table>

This general format specifies table rows and table dividers, where each divider is a different column. The data in the table is contained inside each of the nested <td></td> tag pairs. 

XPath querying allows us to find specific kinds of elements and their contents. For example:

In [1]:
from lxml import etree
from urllib import urlopen # lets us open files from web addresses
from io import StringIO # This will help us deal with string inputs

## Get the code from the url
html = urlopen("http://www.bioinformatics.org/sms/iupac.html").read()

## First we have some housekeeping. StringIO wants to see a unicode string, 
## so we have to change the encoding on our html so it can be read.

html = html.decode('utf-8')

## Next we have to create a parser that will read the info from the HTML 
## file and tell it what kind of data it will be receiving

parser = etree.HTMLParser()
tree = etree.parse(StringIO(html),parser)

We now have the webpage represented as a tree of data, just like an XML object. We can do all sorts of things now.

We can print the entire tree structure to the screen:

In [2]:
## This function looks in each element node, and if it has 
## contents it performs the same action on the descendent node
## Note that this is an example of recursion - a function 
## that calls itself.

def parseTree(e,t='\t'):
    for i in e:
        print str(t) + str(i)
        parseTree(i,t=t + '\t')

parseTree(tree.getroot())

	<Element head at 0x106825440>
		<Element meta at 0x106825560>
		<Element meta at 0x106825518>
		<Element meta at 0x1068255a8>
		<Element title at 0x106825560>
	<Element body at 0x1068254d0>
		<Element table at 0x106825560>
			<Element tr at 0x1068255a8>
				<Element td at 0x1068255f0>
					<Element font at 0x106825680>
				<Element td at 0x106825638>
					<Element font at 0x1068255f0>
			<Element tr at 0x106825518>
				<Element td at 0x1068255a8>
				<Element td at 0x106825710>
			<Element tr at 0x106825638>
				<Element td at 0x106825518>
				<Element td at 0x1068255a8>
			<Element tr at 0x106825710>
				<Element td at 0x106825638>
				<Element td at 0x106825518>
			<Element tr at 0x1068255a8>
				<Element td at 0x106825710>
				<Element td at 0x106825638>
			<Element tr at 0x106825518>
				<Element td at 0x1068255a8>
				<Element td at 0x106825710>
			<Element tr at 0x106825638>
				<Element td at 0x106825518>
				<Element td at 0x1068255a8>
			<Element tr at 0x106825710>
				<Eleme

And we can also navigate the tree just like an XML file:

In [3]:
root = tree.getroot()

for e in root:
    print e
    for i in e:
        print '\t' + str(i)

<Element head at 0x105aeb518>
	<Element meta at 0x106825710>
	<Element meta at 0x106825638>
	<Element meta at 0x106825680>
	<Element title at 0x106825560>
<Element body at 0x106825440>
	<Element table at 0x105aeb518>
	<Element br at 0x106825680>
	<Element table at 0x106825710>


The etree returned has "elements" that are represented similarly to list objects. They can be indexed and iterated through the same way. However they also carry additional information.

Using the xpath() method of an etree, we are able to explicitly index areas that we would like to access. For example, here we pull the entire table from the document as one entity.

In [4]:
table = tree.xpath('body/table')

We've now focused on one specific part of the HTML document - the table of values.

If we want to be less focused, we can also find specific elements based on attribute tags. For example, the table containing the tree letter amino acid codes contains the tag "cols='3'" specifying that it has three columns. We can use this attribute to extract these tags in particular.

In [5]:
amino = tree.xpath("""//table[@cols='3']""")
amino

[<Element table at 0x106825710>]

We can use the .text attribute of the element to print out the data we've found.

In [6]:
for row in amino[0]:
    for cell in row:
        print cell.text

None
None
None
A
Ala
Alanine
C
Cys
Cysteine
D
Asp
Aspartic Acid
E
Glu
Glutamic Acid
F
Phe
Phenylalanine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
K
Lys
Lysine
L
Leu
Leucine
M
Met
Methionine
N
Asn
Asparagine
P
Pro
Proline
Q
Gln
Glutamine
R
Arg
Arginine
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
W
Trp
Tryptophan
Y
Tyr
Tyrosine


We could also do the same thing using the text function which is built into xpath queries:

In [7]:
for i in tree.xpath("""//table[@cols='3']/tr/td//text()"""):
    print i

IUPAC amino acid code
Three letter code
Amino acid
A
Ala
Alanine
C
Cys
Cysteine
D
Asp
Aspartic Acid
E
Glu
Glutamic Acid
F
Phe
Phenylalanine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
K
Lys
Lysine
L
Leu
Leucine
M
Met
Methionine
N
Asn
Asparagine
P
Pro
Proline
Q
Gln
Glutamine
R
Arg
Arginine
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
W
Trp
Tryptophan
Y
Tyr
Tyrosine


We can now start using for loops to write more interesting queries, and convert the entire table to data we can use.

One thing to keep in mind is that once you have focused on a particular part of the tree, your position is defined relative to that node. However, the element still contains the full information about the whole HTML document's tree. You are able to start a query with the absolute path of the full tree with `/` or you are able to use `.` in order to define a query relative to your current position. Here we use the `.` operator to define a relative path to our element object.

In [8]:
table = tree.xpath('body/table')

# In this expression, we're interested in the second table found
# Remember to use the relative path root
for tr in amino[0].xpath('./tr'):
    print tr.xpath('./td//text()')

['IUPAC amino acid code', 'Three letter code', 'Amino acid']
['A', 'Ala', 'Alanine']
['C', 'Cys', 'Cysteine']
['D', 'Asp', 'Aspartic Acid']
['E', 'Glu', 'Glutamic Acid']
['F', 'Phe', 'Phenylalanine']
['G', 'Gly', 'Glycine']
['H', 'His', 'Histidine']
['I', 'Ile', 'Isoleucine']
['K', 'Lys', 'Lysine']
['L', 'Leu', 'Leucine']
['M', 'Met', 'Methionine']
['N', 'Asn', 'Asparagine']
['P', 'Pro', 'Proline']
['Q', 'Gln', 'Glutamine']
['R', 'Arg', 'Arginine']
['S', 'Ser', 'Serine']
['T', 'Thr', 'Threonine']
['V', 'Val', 'Valine']
['W', 'Trp', 'Tryptophan']
['Y', 'Tyr', 'Tyrosine']


This data is now in the correct format to be tidied up and written out to a CSV file using the CSV package or loaded into a data analysis package in order to compare the data available.

## Beautiful Soup 

While that was certainly a fun demonstration of how HTML is organized and can be digested for further analysis, manual XPath evaluations can be a tedious process. Beautiful Soup is a package meant to make the process of getting information from web documents much simpler.

In Beautiful Soup, we first import the package in order to create a "soup" object. Here we use the html object that we acquired earlier.

In [9]:
from bs4 import BeautifulSoup as bs

soup = bs(html, "lxml")

From here we can perform all sorts of different manipulations on the data, and Beautiful Soup takes care of the many of the details behind the scenes. For example, we can print the entire page as an indented object:

In [10]:
print(soup.prettify())

<!DOCTYPE html PUBLIC "-//w3c//dtd html 4.0 transitional//en">
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="Mozilla/4.72 [en] (Win98; I) [Netscape]" name="GENERATOR"/>
  <meta content="Paul Stothard" name="Author"/>
  <title>
   IUPAC Codes
  </title>
 </head>
 <body alink="#0000FF" bgcolor="#FFFFCC" link="#0000FF" text="#000000" vlink="#551A8B">
  <table border="" cellpadding="2" cellspacing="0" cols="2" width="350">
   <tr>
    <td bgcolor="#B0C4DE">
     <font color="#000000">
      IUPAC nucleotide code
     </font>
    </td>
    <td bgcolor="#B0C4DE">
     <font color="#000000">
      Base
     </font>
    </td>
   </tr>
   <tr>
    <td>
     A
    </td>
    <td>
     Adenine
    </td>
   </tr>
   <tr>
    <td>
     C
    </td>
    <td>
     Cytosine
    </td>
   </tr>
   <tr>
    <td>
     G
    </td>
    <td>
     Guanine
    </td>
   </tr>
   <tr>
    <td>
     T (or U)
    </td>
    <td>
     Thymine (or Uracil)
    </t

We can also call specific tags and have them returned instantly. For example, this will return all the tables in the document:

In [11]:
tables = soup.find_all("table")
tables

[<table border="" cellpadding="2" cellspacing="0" cols="2" width="350">\n<tr>\n<td bgcolor="#B0C4DE"><font color="#000000">IUPAC nucleotide code</font></td>\n<td bgcolor="#B0C4DE"><font color="#000000">Base</font></td>\n</tr>\n<tr>\n<td>A</td>\n<td>Adenine</td>\n</tr>\n<tr>\n<td>C</td>\n<td>Cytosine</td>\n</tr>\n<tr>\n<td>G</td>\n<td>Guanine</td>\n</tr>\n<tr>\n<td>T (or U)</td>\n<td>Thymine (or Uracil)</td>\n</tr>\n<tr>\n<td>R</td>\n<td>A or G</td>\n</tr>\n<tr>\n<td>Y</td>\n<td>C or T</td>\n</tr>\n<tr>\n<td>S</td>\n<td>G or C</td>\n</tr>\n<tr>\n<td>W</td>\n<td>A or T</td>\n</tr>\n<tr>\n<td>K</td>\n<td>G or T</td>\n</tr>\n<tr>\n<td>M</td>\n<td>A or C</td>\n</tr>\n<tr>\n<td>B</td>\n<td>C or G or T</td>\n</tr>\n<tr>\n<td>D</td>\n<td>A or G or T</td>\n</tr>\n<tr>\n<td>H</td>\n<td>A or C or T</td>\n</tr>\n<tr>\n<td>V</td>\n<td>A or C or G</td>\n</tr>\n<tr>\n<td>N</td>\n<td>any base</td>\n</tr>\n<tr>\n<td>. or -</td>\n<td>gap</td>\n</tr>\n</table>,
 <table border="" cellpadding="2" cellspaci

From this object we can find that the table we're interested in lives at index 1. 

We also have the ability to find the table we're interested in by finding attributes that make it unique in the HTML document and using the soup.find() method.

In [12]:
table = soup.find("table",{"width":"350","cols":"3"})
table

<table border="" cellpadding="2" cellspacing="0" cols="3" width="350">\n<tr>\n<td bgcolor="#B0C4DE"><font color="#000000">IUPAC amino acid code</font></td>\n<td bgcolor="#B0C4DE"><font color="#000000">Three letter code</font></td>\n<td bgcolor="#B0C4DE"><font color="#000000">Amino acid</font></td>\n</tr>\n<tr>\n<td>A</td>\n<td>Ala</td>\n<td>Alanine</td>\n</tr>\n<tr>\n<td>C</td>\n<td>Cys</td>\n<td>Cysteine</td>\n</tr>\n<tr>\n<td>D</td>\n<td>Asp</td>\n<td>Aspartic Acid</td>\n</tr>\n<tr>\n<td>E</td>\n<td>Glu</td>\n<td>Glutamic Acid</td>\n</tr>\n<tr>\n<td>F</td>\n<td>Phe</td>\n<td>Phenylalanine</td>\n</tr>\n<tr>\n<td>G</td>\n<td>Gly</td>\n<td>Glycine</td>\n</tr>\n<tr>\n<td>H</td>\n<td>His</td>\n<td>Histidine</td>\n</tr>\n<tr>\n<td>I</td>\n<td>Ile</td>\n<td>Isoleucine</td>\n</tr>\n<tr>\n<td>K</td>\n<td>Lys</td>\n<td>Lysine</td>\n</tr>\n<tr>\n<td>L</td>\n<td>Leu</td>\n<td>Leucine</td>\n</tr>\n<tr>\n<td>M</td>\n<td>Met</td>\n<td>Methionine</td>\n</tr>\n<tr>\n<td>N</td>\n<td>Asn</td>\n<td>Aspa

From either of these we can easily extract the rest of the information as follows.

In [13]:
for row in table.findAll("tr"):
    cells = row.findAll("td")
    newCells = list()
    for c in cells:
        newCells.append(c.get_text())
    print newCells

[u'IUPAC amino acid code', u'Three letter code', u'Amino acid']
[u'A', u'Ala', u'Alanine']
[u'C', u'Cys', u'Cysteine']
[u'D', u'Asp', u'Aspartic Acid']
[u'E', u'Glu', u'Glutamic Acid']
[u'F', u'Phe', u'Phenylalanine']
[u'G', u'Gly', u'Glycine']
[u'H', u'His', u'Histidine']
[u'I', u'Ile', u'Isoleucine']
[u'K', u'Lys', u'Lysine']
[u'L', u'Leu', u'Leucine']
[u'M', u'Met', u'Methionine']
[u'N', u'Asn', u'Asparagine']
[u'P', u'Pro', u'Proline']
[u'Q', u'Gln', u'Glutamine']
[u'R', u'Arg', u'Arginine']
[u'S', u'Ser', u'Serine']
[u'T', u'Thr', u'Threonine']
[u'V', u'Val', u'Valine']
[u'W', u'Trp', u'Tryptophan']
[u'Y', u'Tyr', u'Tyrosine']


And again we're ready to move on to tidying up the data and getting ready to do actual analysis.

Scraping tables is just a fraction of what is able to be done with these packages. It's also possible to create web crawlers that automatically index entire websites and extract relevant information. It's also possible to create bots that monitor websites and update themselves when something changes, or to get images, links, or any other class of information represented in HTML. It's also possible to write interfaces that will pull information from Javascript applications and other scripting languages.

## The Developer's Console

Both Chrome and Firefox are equipped with a developer's console, meant for debugging code while writing websites. This console can also be used to see what elements your computer is interfacing with while you surf the web. 

To open the developer's console in firefox, press Ctrl+Shift+K in Windows or Cmd+Opt+K in OSX. The network tab will allow you to see what information is being sent when, while the Inspector tab allows you to hover over code and see what element of the page it represents. 

Chrome's developer console can be accessed with Ctrl+Shift+J on Windows or Cmd+Opt+J on OSX. While the tabs are named slightly differently, the functions are essentially the same. Notably, Chrome provides native support for web scraping, though the data it gives are usually oriented more toward the organization of entire sites and less toward acquiring data from an individual page.

If you plan on getting data from the web, this is an invaluable tool that will save you a lot of time finding out where data is stored.

## A Word On APIs And robots.txt

Before scraping a site, it is worth taking a couple of things into account in order to make sure that you are a good citizen of the web.  The robots.txt file located in the root directory of most websites will usually give you an idea of which directories are and are not allowed for web scraping. It is good practice if you are scraping a large amount of data to make sure that you adhere to the areas that are described by robots.txt with the "Allow:" tag. 

Many sites also provide an Application Programming Interface (API) that allows you to acquire information directly without scraping web data from the HTML interface, saving both you and the site manager time and money. If an API is available, it is almost always advisable to make use of it.

## In Class Exercises

In [None]:
## Using either lxml or BeautifulSoup, scrape the values from the first 
## table which contains nucleotides and their corresponding name
## Create a dictionary from these values where the nucleotide is the key.


In [None]:
## Using the query above, write the data from the table to a csv file


## References

- [LXML HTML Xpath Tutorial](http://lxml.de/parsing.html)
- [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [XPath Syntax Guide](http://www.w3schools.com/xsl/xpath_syntax.asp)