# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 4: XML, HTML and Web Scraping

**Thanks to Gareth Harman for revisions to the XML materials.**
<br />
**Thanks to Ryan Swan for the materials on HTML and web scraping.**


# XML Overview

><b>XML</b> stands for E<u>x</u>tensible <u>M</u>arkup <u>L</u>anguage, and is a set of rules for encoding documents in a machine-readable format. In bioinformatics, XML is a commonly used format for sharing heterogenous data (as opposed to delimited files, where every record (row) contains the same data elements).

The World Wide Web Consortium (W3C) oversaw XML development in 1996.

### XML Design Goals:
1. XML shall be straightforwardly usable over the Internet
2. XML shall support a wide variety of applications
3. XML shall be compatible with Standard Generalized Markup Language (SGML)
4. It shall be easy to write programs that process XML documents
5. The number of optional features in XML is to be kept to the absolute minimum
6. XML documents should be human-legible and reasonably clear
7. The XML design should be prepared quickly
8. The design of XML shall be formal and concise
9. XML documents shall be easy to create
10. Terseness in XML markup is of minimal importance

### Why can't we use CSV formats?

We usually can, but...

1. CSV files are not always human readable (other documentation is often necessary to identify data elements)
2. Inconsistencies are more likely 
3. CSV files don't easily support multiple levels of data
4. CSV files don't easily support addition details such as formatting or meta data (experimental protocols, etc.)


### XML Format

The first couple lines of an XML document contain information about the XML version used, the document structure and comments:

#### Version

```xml
<?xml version='1.0' encoding='UTF-8'?>
```
    
#### Document Type Declaration
```xml
<uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
```

#### XML Document Body

The body of an XML document contains labeled data elements. Data elements can be nested to show relationships. Data labels are called "tags", which can also contain attributes (values are always strings) that provide additional information about the data.
    
```xml
    <parent_tag>
        <child_tag attribute1="value1" attrubute2="value2">data</child_tag>
    </parent_tag>
```

It is subjective whether to provide additional information as attributes or additional data elements:

```xml
    <contact birthdate="1-1-1980">
        <name>John Smith</name>
    </contact>
    
    <contact>
        <name>John Smith</name>
        <birthdate>1-1-1980</birthdate>
    </contact>
```

<center><img src="./images/xml_graph.png"></center>

#### DTD and XML Schema

- Document Type Definitions (DTD) and XML Schemas are two ways of describing the structure and content of an XML document
- XML Schemas (a.k.a. XML Schema Definitions or XSDs) were designed to improve upon the shortcomings of DTDs
    - data type support
    - namespace aware
- Example: the UniProt XSD - [http://www.uniprot.org/support/docs/uniprot.xsd](http://www.uniprot.org/support/docs/uniprot.xsd)

In [2]:

import os

company_xmlpath = os.path.join('data', 'company.xml')

company_xmlstring = ('''<?xml version='1.0' ?>
                        <company>
                            <department>
                                <employee>
                                    <name>John Doe</name>
                                    <job>Software Analyst</job>
                                    <salary>2000</salary>
                                </employee>
                                <employee>
                                    <name>Jane Fletcher</name>
                                    <job>Designer</job>
                                    <salary>2500</salary>
                                </employee>        
                                <employee>
                                    <name>Mike Mooney</name>
                                    <job>Professor</job>
                                    <salary>250000</salary>
                                </employee>
                                <employee>
                                    <name>Gareth Harman</name>
                                    <job>Student</job>
                                    <salary>10</salary>
                                </employee>
                            </department>
                        </company>''')


# LXML

It uses a querying syntax called XML Path Language (XPath) to parse the tree structure and return relevent information from the document.

`parse()`
> Read an xml file
> Returns a `ElementTree` object

`fromstring()`
> Create an `Element` object from a string-like XML

It's important to note that loading each method of reading above create different objects which we can see below

In [3]:

import lxml.etree as et
import re

# Lets load our object from our string object above
parse_company = et.fromstring(company_xmlstring)
tree_company = parse_company.getroottree()

print(f'parse_company: {type(parse_company)}')
print(f'tree_company: {type(tree_company)}')

# And load from an xml file directly
tree_company = et.parse(company_xmlpath)

print(f'tree_company: {type(tree_company)}')


parse_company: <class 'lxml.etree._Element'>
tree_company: <class 'lxml.etree._ElementTree'>
tree_company: <class 'lxml.etree._ElementTree'>


> Lets look at the root node and some of the information about our xml included an given elements `tag` and `text`

In [4]:

# Obtain the root of our node
root_company = tree_company.getroot()
print(f'root_company: {type(root_company)}')

print(f'Company: {root_company.tag} Len: {len(root_company)}')
print(f'Company: {root_company[0].tag} Len: {len(root_company[0])}')
print(f'Company: {root_company[0][0].tag} Len: {len(root_company[0][0])}')

print(f'{root_company[0][0][0].tag}')
print(f'{root_company[0][0][0].text}')


root_company: <class 'lxml.etree._Element'>
Company: company Len: 1
Company: department Len: 4
Company: employee Len: 3
name
John Doe


### Walking the tree

`iterwalk()`
> Iteratively *walk* the elements of an `ElementTree` or `Element` object

`iterparse()`
> Iteratively *parse* and walk the elements of an .xml file

`element.clear()`
> This statement is important!

> Often times we are traversing a very large .xml file (sometimes >4GB), sometimes we might only have 4GB of ram total, so if we want to traverse the xmltree we need to clear objects from memory as we go

In [5]:

def walk(iter_obj):
    
    ''' Walking an ElementTree Object '''

    event, root = next(iter_obj)            # Create our generator
   
    for event, element in iter_obj:         # Walk through the elements
        if (event == 'end' and              # Check it is the end of the object
            element.tag != root.tag and     # Check it isnt our root object
            element.text is not None):      # Check it isnt None
            if element.text.strip() != '':  # Check that the attribute has text
                print(f'{element.tag}:{element.text.strip()}')
                element.clear()             # Clear this element from memory
    root.clear()                            # Clear the root from memory
    



> Walking through an object loaded into memory (ElementTree)

In [24]:

iter_et_fromtree = et.iterwalk(root_company, events=['start', 'end'])
walk(iter_et_fromtree)


name:John Doe
job:Software Analyst
salary:2000
name:Jane Fletcher
job:Designer
salary:2500
name:Mike Mooney
job:Professor
salary:250000
name:Gareth Harman
job:Student
salary:10


> Walking through an object by iteratively parsing

In [22]:

tree_company = et.iterparse(company_xmlpath, events=['start', 'end'])
walk(tree_company)


name:John Doe
job:Software Analyst
salary:2000
name:Jane Fletcher
job:Designer
salary:2500
name:Mike Mooney
job:Professor
salary:250000
name:Gareth Harman
job:Student
salary:10


In [25]:

tree_company = et.parse(company_xmlpath)
root_company = tree_company.getroot()


## Namespaces

> Namespaces allow us to avoid name conflicts by using `prefixes`

```xml
<table>
    <tr>
        <td>Column1</td>
    </tr>
</table>

<table>
    <material>pine</material>
    <width>10</width>
</table>
```

Combining these objects would lead to conflicts due to the differing meaning of the variable `<table>`

To solve, we can add namespaces usually given in the format

```xml 
<x:table xmlns:h="http://www.w3.org/TR/html4/">
```

Thus, one solution to our previous issues could look as following,

```xml
<h:table xmlns:h="http://www.w3.org/TR/html4/">
    <h:tr>
        <h:td>Column1</h:td>
    </h:tr>
</h:table>

<f:table xmlns:f="https://www.w3schools.com/furniture">
    <f:material>pine</f:material>
    <f:width>10</f:width>
</f:table>
```

#### Extracting namespaces

We can obtain a documents namespace using the following syntax

```python
furniture_tree = et.parse(path_to_furniture_xml)
namespace = re.match(r"{.*}", furniture_tree.tag).group()
namespace
```

Note:

> Often times we see our variable point to a URL, it isn't used to parse or reference any information from the URL but usually points to a page containing additional information about the XML's namespace

References:
- https://www.w3schools.com/xml/xml_namespaces.asp

# Searching for Elements and XPath Queries


- Stands for XML Path Language
- Uses "path like" syntax to identify and navigate nodes in an XML document
- Contains over 200 built-in functions
- A major element in the XSLT standard
- A W3C recommendation

Reference:
- [W3 XPath](https://www.w3schools.com/xml/xpath_intro.asp)


### XPath Syntax

```xml
<?xml version='1.0' ?>
<company>
    <employee id="111">
        <name>John Doe</name>
        <job>Software Analyst</job>
        <salary>2000</salary>
    </employee>
    <employee id="222">
        <name>Jane Fletcher</name>
        <job>Designer</job>
        <salary>2500</salary>
    </employee>       
    <employee>
        <name>Steven Smith</name>
        <job>Cantelope Eater</job>
        <salary>25000</salary>
    </employee>  
</company>
```

| Expression   | Description                                           |
|--------------|-------------------------------------------------------|
| `nodename`   | Get nodes with name "nodename"                        |
| `/`          | Get nodes from the root node                          |
| `//`         | Get nodes **FROM THE ENTIRE DOCUMENT** wherever they are |
| `.`          | Get the current node                                  |
| `..`         | Get the parent of the current node                    |
| `@`          | Get an attribute                                      |



In [27]:

# Reload the xml
tree_company = et.parse(os.path.join('data', 'company_sml.xml'))
root_company = tree_company.getroot()


In [28]:

# Get all employee nodes that are a child of our specified element
print(tree_company.xpath('employee'))


[<Element employee at 0x7fe1c38f3ac0>, <Element employee at 0x7fe1c38f3f40>, <Element employee at 0x7fe1c38f3e40>]


In [29]:

# Get the root element
print(tree_company.xpath('/company'))


[<Element company at 0x7fe1c38f3a00>]


In [30]:

# Get name elements that are children of an employee element
print(tree_company.xpath('employee/name'))


[<Element name at 0x7fe1c38f48c0>, <Element name at 0x7fe1c38f4980>, <Element name at 0x7fe1c38f49c0>]


In [31]:

# Get the current node
print(tree_company.xpath('.'))


[<Element company at 0x7fe1c38f3a00>]


### Predicates

- Denoted by `[ ]` square brackets
- Allow us to identify nodes that contain specific values

| Expression             | Description                                                     |
|------------------------|-----------------------------------------------------------------|
| `/node[n]`             | Get the n'th item                                               |
| `/node[last()]`        | Get the last element                                            |
| `/node[last()-1]`      | Get second to last element                                      |
| `//node[@attr]`        | Get elements that have an attribute named "attr"                |
| `//node[@attr="what"]` | Get elements that have an attribute named "attr" that is "what" |


In [32]:

# Get the first employee element
print(tree_company.xpath('employee[1]'))


[<Element employee at 0x7fe1c38f3cc0>]


In [33]:

# Get the last employee element
print(tree_company.xpath('employee[last()]'))


[<Element employee at 0x7fe1c361fe80>]


In [34]:

# Get the employee element that has the "id" attribute 
print(tree_company.xpath('//employee[@id]'))


[<Element employee at 0x7fe1c38e9540>, <Element employee at 0x7fe1c38e9080>]


In [35]:

# Get the employee element that has the "id" attribute "222"
print(tree_company.xpath('//employee[@id="222"]'))


[<Element employee at 0x7fe1c38eeb40>]


### Wildcards

> Wildcards allow us to do partial searches for nodes or attributes in our document

| Expression   | Description                                    |
|--------------|------------------------------------------------|
| `*`          | Match any element node                         |
| `@*`         | Match any attribute node                       |
| `node()`     | Match any node of any kind                     |
| `/node/*`    | Get all child elements of "node"               |
| `//*`        | Get all elements in the document               |
| `//node[@*]` | Get all node elements which have any attribute |


In [23]:

# Wildcards

# Get any child elements of the company node
print(tree_company.xpath('/company/*'))

# Get any attributes of the first employee element
print(tree_company.xpath('employee[1]/@*'))

# Get all elements in the document
for ii in tree_company.xpath('//*'):
    print(ii)


[<Element employee at 0x7f8b620dc700>, <Element employee at 0x7f8b620dce40>, <Element employee at 0x7f8b620dcdc0>]
['111']
<Element company at 0x7f8b620d7280>
<Element employee at 0x7f8b620dce40>
<Element name at 0x7f8b620dcd80>
<Element job at 0x7f8b620dcf00>
<Element salary at 0x7f8b620dcf80>
<Element employee at 0x7f8b620dce80>
<Element name at 0x7f8b620dcf40>
<Element job at 0x7f8b620df040>
<Element salary at 0x7f8b620df080>
<Element employee at 0x7f8b620dcec0>
<Element name at 0x7f8b620df0c0>
<Element job at 0x7f8b620df100>
<Element salary at 0x7f8b620df140>


> We can also search for items using the `find` function

In [24]:

#### FIND ####: Search for name (only on the first level)
print(root_company.find("name"))


#### FIND ####: Search for name anywhere in the tree (.//)
print(root_company.find(".//name").text)


#### FINDALL #: Find all instance of name anywhere in the tree
print(root_company.findall(".//name"))


#### ITERFIND : Generator to find instances of name anywhere in the tree
for jj in root_company.iterfind('.//name'):
    print(jj.text)

    
# Same thing but put it in a list
[jj.text for jj in root_company.iterfind('.//name')]


None
John Doe
[<Element name at 0x7f8b620dfc00>, <Element name at 0x7f8b620dfd00>, <Element name at 0x7f8b620dfd40>]
John Doe
Jane Fletcher
Steven Smith


['John Doe', 'Jane Fletcher', 'Steven Smith']

# Building/Writing an XML 

#### Methods for Writing XML
<table align="center">
<tr><td style="text-align:left"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:left">`et.Element(tag)`</td><td>Creates an element with the specified tag. Returns an element object.</td></tr>
<tr><td style="text-align:left">`et.SubElement(element, tag)`</td><td>Creates a child element under the specified element.</td></tr>
<tr><td style="text-align:left">`Element.set(key, value)`</td><td>Sets the attributes of an element.</td></tr>
<tr><td style="text-align:left">`et.ElementTree(root)`</td><td>Returns an ElementTree object.</td></tr>
<tr><td style="text-align:left">`ElementTree.write(file)`</td><td>Writes an ElementTree object to a file.</td></tr>
</table>

In [38]:

# Setup the root node
multnomah_library = et.Element('library')
doc = et.ElementTree(multnomah_library)

#-> Setup the portland branch
portland_branch = et.SubElement(multnomah_library, "portland_branch",
                                zipcode="97239")

#->-> Add our book category
horror_books = et.SubElement(portland_branch, "horror")

#->->-> Add the first book
h_book1 = et.SubElement(horror_books, "book")

#->->->-> Add the elements to our book
auth = et.SubElement(h_book1, "author", text="Jeff VanderMeer")
title = et.SubElement(h_book1, "title", text="Annihilation")
isbn = et.SubElement(h_book1, "ISBN", text="0374104093")
pub = et.SubElement(h_book1, "Publisher", text="FSG Originals")

'''
Can also set tags this way

pub.tag = "Publisher"
pub.text = "FSG Originals"
'''

# Save to XML file
with open(os.path.join('data', 'output.xml'), 'wb') as f:
    doc.write(f, pretty_print=True, encoding='utf-8')



In [37]:
print(et.tostring(doc, pretty_print=True).decode())

<library>
  <portland_branch zipcode="97239">
    <horror>
      <book>
        <author text="Jeff VanderMeer"/>
        <title text="Annihilation"/>
        <ISBN text="0374104093"/>
        <Publisher text="FSG Originals"/>
      </book>
    </horror>
  </portland_branch>
</library>



## xmltodict

- Another option we have for creating python objects from an xml is `xmltodict`
- This tool allows us to open an xml and convert it to a nested python dictionary

In [39]:

import xmltodict

with open(os.path.join('data', 'output.xml')) as fd:
    doc_dict = xmltodict.parse(fd.read())

# Get the ISBN of the first book
print(doc_dict['library']['portland_branch']['horror']['book']['ISBN'])

OrderedDict([('@text', '0374104093')])


In [40]:
doc_dict

OrderedDict([('library',
              OrderedDict([('portland_branch',
                            OrderedDict([('@zipcode', '97239'),
                                         ('horror',
                                          OrderedDict([('book',
                                                        OrderedDict([('author',
                                                                      OrderedDict([('@text',
                                                                                    'Scott Smith')])),
                                                                     ('title',
                                                                      OrderedDict([('@text',
                                                                                    'The Ruins')])),
                                                                     ('ISBN',
                                                                      OrderedDict([('@text',
                            

#### Drawbacks to XML?

- More difficult to parse than CSV
- Verbose syntax means larger files

## XML and Bioinformatics
#### SBML (Systems Biology Markup Language)
- Used to communicate models of biological processes (cell-signaling pathways, regulatory networks). Models can represent:
    - Chemical Equations
    - Cellular Components: nucleus, cytoplasm, etc.
    - Species: genomes, proteomes, etc.
- [http://sbml.org](http://sbml.org)
- [http://www.ebi.ac.uk/biomodels-main/](http://www.ebi.ac.uk/biomodels-main/)

#### KGML (KEGG Markup Language)
- A format for KEGG pathway maps
    - [http://www.kegg.jp/kegg/xml/](http://www.kegg.jp/kegg/xml/)
    
#### PDBML (Protein Databank Markup Language)
- Describes 3D protein structure
    - relative atomic coordinates
    - secondary structure assignment
    - atomic connectivity
- [http://www.rcsb.org/pdb/home/home.do](http://www.rcsb.org/pdb/home/home.do)
- [http://pdbml.pdb.org/](http://pdbml.pdb.org/)

#### NeuroML (Neuro Markup Language)

The main aims of the NeuroML initiative are to:

- Create specifications for a language (in XML) to describe neuronal systems including: 
    - biophysics 
    - anatomy  
    - network architecture 

- Facilitate the exchange of complex neuronal network models between researchers, allowing for greater transparency and accessibility of models

- Promote software tools supporting NeuroML and to support the development of new software and databases

- Encourage researchers who create models within the scope of NeuroML to exchange and publish their models in this format.

[Resource for NeuroML](https://en.wikipedia.org/wiki/NeuroML)

## HTML

Hypertext Markup Language (HTML) is the basis for most pages that are served on the internet. HTML is actually very similar to XML (Extensible Markup Language), with the caveat that it also contains presentation semantics, which are attributes that specify how information is meant to be displayed or arranged on a screen. But overall, the nested format is almost exactly like an XML document, and because of that, we can extract information from a standard HTML page exactly the same way we would from an XML document. Below is a simple example of an HTML document:

    <html>
    <head>
        <title>Hey look, a webpage!</title>
    </head>
    <body>
        <p>webpage goes here</p>
    </body>
    </html>


## LXML package

We can also use the LXML package to read HTML pages in the tree structure.

Before we get started, it helps to have an idea of some of the ways that HTML arranges documents. Most scrapable HTML data is contained in tables like the one at http://www.bioinformatics.org/sms/iupac.html. HTML tables are arranged in the following format:

    <table>
        <tr>
            <td></td>
            <td></td>
            <td></td>
            ...
        </tr>
        <tr>
            ...
        </tr>
    </table>

This general format specifies table rows and table dividers, where each divider is a different column. The data in the table is contained inside each of the nested <td></td> tag pairs. 


In [1]:
from lxml import etree
import requests
from io import StringIO # This will help us deal with string inputs

## Get the code from the url
url = "http://www.bioinformatics.org/sms/iupac.html"
html = requests.get(url).text

## Next we have to create a parser that will read the info from the HTML 
## file and tell it what kind of data it will be receiving
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html),parser) ## Note we could also provide the URL directly

We now have the webpage represented as a tree of data. This tree is an iterable object, just like we saw above when working with XML documents. We can do all sorts of things now.

For example with can iterate through the tree with a for loop:

In [2]:
## Note: here we are only showing two levels of the tree
root = tree.getroot()

for e in root:
    print(e)
    for i in e:
        print('\t' + str(i))

<Element head at 0x7fbd87fe7780>
	<Element meta at 0x7fbd87feca00>
	<Element meta at 0x7fbd87fdf3c0>
	<Element meta at 0x7fbd87fdffa0>
	<Element title at 0x7fbd87feca00>
<Element body at 0x7fbd87fecd70>
	<Element table at 0x7fbd87fecdc0>
	<Element br at 0x7fbd87fdf3c0>
	<Element table at 0x7fbd87feca00>


In [3]:
## The following function will print the entire tree structure
## This function looks in each element node, and if it has 
## contents it performs the same action on the descendent node
## Note that this is an example of recursion - a function 
## that calls itself.

def parseTree(e,t='\t'):
    for i in e:
        print(str(t) + str(i))
        parseTree(i,t=t + '\t')

parseTree(tree.getroot())

	<Element head at 0x7fbd88b45140>
		<Element meta at 0x7fbd88b451e0>
		<Element meta at 0x7fbd88b45230>
		<Element meta at 0x7fbd88b45280>
		<Element title at 0x7fbd88b451e0>
	<Element body at 0x7fbd87fecd70>
		<Element table at 0x7fbd88b451e0>
			<Element tr at 0x7fbd88b45230>
				<Element td at 0x7fbd88b45370>
					<Element font at 0x7fbd88b45460>
				<Element td at 0x7fbd88b453c0>
					<Element font at 0x7fbd88b45460>
			<Element tr at 0x7fbd88b452d0>
				<Element td at 0x7fbd88b45320>
				<Element td at 0x7fbd88b45460>
			<Element tr at 0x7fbd88b453c0>
				<Element td at 0x7fbd88b45230>
				<Element td at 0x7fbd88b45320>
			<Element tr at 0x7fbd88b45460>
				<Element td at 0x7fbd88b452d0>
				<Element td at 0x7fbd88b45230>
			<Element tr at 0x7fbd88b45320>
				<Element td at 0x7fbd88b453c0>
				<Element td at 0x7fbd88b452d0>
			<Element tr at 0x7fbd88b45230>
				<Element td at 0x7fbd88b45320>
				<Element td at 0x7fbd88b45460>
			<Element tr at 0x7fbd88b452d0>
				<Element td at 0x

The `etree` object has a method called `xpath()`, which allows us to perform queries on the tree structure to identify specific elements within the HTML document. For example, if we want to find all tables within the body of the document we would do the following:

In [4]:
## This will return a list of table elements
tables = tree.xpath('body/table')
tables

[<Element table at 0x7fbd88b35fa0>, <Element table at 0x7fbd87feca00>]

We can also use tag attributes to perform more specific queries. For instance, we know that the table containing amino acid codes has three columns. To extract this table we could do something like:

In [5]:
## This will find all tables with three columns
## Note: the // means it will look anywhere under the current element (root in this case) 
## (i.e. the table could be nested within another element)
amino = tree.xpath("//table[@cols='3']")
amino

[<Element table at 0x7fbd87feca00>]

In [6]:
## We can iterate through this table to get the data
for row in amino[0]:
    for cell in row:
        print(cell.text)

None
None
None
A
Ala
Alanine
C
Cys
Cysteine
D
Asp
Aspartic Acid
E
Glu
Glutamic Acid
F
Phe
Phenylalanine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
K
Lys
Lysine
L
Leu
Leucine
M
Met
Methionine
N
Asn
Asparagine
P
Pro
Proline
Q
Gln
Glutamine
R
Arg
Arginine
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
W
Trp
Tryptophan
Y
Tyr
Tyrosine


Note that the column headers are missing above. This is because that text is not directly within the table cells, it is actually nested within a `<font>` tag, which allows additional formatting of the text. The code below will solve this problem. The Xpath `text()` function will extract text, and using the `//` means that it will find text anywere under the `<td>` tag.

In [7]:
for i in tree.xpath("//table[@cols='3']/tr/td//text()"):
    print(i)

IUPAC amino acid code
Three letter code
Amino acid
A
Ala
Alanine
C
Cys
Cysteine
D
Asp
Aspartic Acid
E
Glu
Glutamic Acid
F
Phe
Phenylalanine
G
Gly
Glycine
H
His
Histidine
I
Ile
Isoleucine
K
Lys
Lysine
L
Leu
Leucine
M
Met
Methionine
N
Asn
Asparagine
P
Pro
Proline
Q
Gln
Glutamine
R
Arg
Arginine
S
Ser
Serine
T
Thr
Threonine
V
Val
Valine
W
Trp
Tryptophan
Y
Tyr
Tyrosine


We can now start using for loops to write more interesting queries, and convert the entire table to a data structure  we can more easily use.

One thing to keep in mind is that once you have focused on a particular part of the tree, your position is defined relative to that element. However, the object still contains the full information about the whole HTML document's tree. You are able to start a query with the absolute path of the full tree with `/` or you are able to use `.` in order to define a query relative to your current position. Here we use the `.` operator to define a path relative to the current element (e.g. the table element stored in `amino[0]`).

In [8]:
## Remember here we are only interested in the amino acid table
## Use the . to ensure you are searching for rows within that table only
table_list = []
for tr in amino[0].xpath('./tr'):
    table_list.append(tr.xpath('./td//text()'))
table_list

[['IUPAC amino acid code', 'Three letter code', 'Amino acid'],
 ['A', 'Ala', 'Alanine'],
 ['C', 'Cys', 'Cysteine'],
 ['D', 'Asp', 'Aspartic Acid'],
 ['E', 'Glu', 'Glutamic Acid'],
 ['F', 'Phe', 'Phenylalanine'],
 ['G', 'Gly', 'Glycine'],
 ['H', 'His', 'Histidine'],
 ['I', 'Ile', 'Isoleucine'],
 ['K', 'Lys', 'Lysine'],
 ['L', 'Leu', 'Leucine'],
 ['M', 'Met', 'Methionine'],
 ['N', 'Asn', 'Asparagine'],
 ['P', 'Pro', 'Proline'],
 ['Q', 'Gln', 'Glutamine'],
 ['R', 'Arg', 'Arginine'],
 ['S', 'Ser', 'Serine'],
 ['T', 'Thr', 'Threonine'],
 ['V', 'Val', 'Valine'],
 ['W', 'Trp', 'Tryptophan'],
 ['Y', 'Tyr', 'Tyrosine']]

## Beautiful Soup 

While that was certainly a fun demonstration of how HTML is organized and can be digested for further analysis, manual XPath evaluations can be a tedious process. Beautiful Soup is a package meant to make the process of getting information from web documents much simpler.

In Beautiful Soup, we first import the package in order to create a "soup" object. Here we use the html object that we acquired earlier.

In [9]:
from bs4 import BeautifulSoup as bs
soup = bs(html, "lxml")

From here we can perform all sorts of different manipulations on the data, and Beautiful Soup takes care of the many of the details behind the scenes. Let's just take a look a couple quick examples:

In [10]:
## Find all tables in the document
tables = soup.find_all("table")
tables

[<table border="" cellpadding="2" cellspacing="0" cols="2" width="350">
 <tr>
 <td bgcolor="#B0C4DE"><font color="#000000">IUPAC nucleotide code</font></td>
 <td bgcolor="#B0C4DE"><font color="#000000">Base</font></td>
 </tr>
 <tr>
 <td>A</td>
 <td>Adenine</td>
 </tr>
 <tr>
 <td>C</td>
 <td>Cytosine</td>
 </tr>
 <tr>
 <td>G</td>
 <td>Guanine</td>
 </tr>
 <tr>
 <td>T (or U)</td>
 <td>Thymine (or Uracil)</td>
 </tr>
 <tr>
 <td>R</td>
 <td>A or G</td>
 </tr>
 <tr>
 <td>Y</td>
 <td>C or T</td>
 </tr>
 <tr>
 <td>S</td>
 <td>G or C</td>
 </tr>
 <tr>
 <td>W</td>
 <td>A or T</td>
 </tr>
 <tr>
 <td>K</td>
 <td>G or T</td>
 </tr>
 <tr>
 <td>M</td>
 <td>A or C</td>
 </tr>
 <tr>
 <td>B</td>
 <td>C or G or T</td>
 </tr>
 <tr>
 <td>D</td>
 <td>A or G or T</td>
 </tr>
 <tr>
 <td>H</td>
 <td>A or C or T</td>
 </tr>
 <tr>
 <td>V</td>
 <td>A or C or G</td>
 </tr>
 <tr>
 <td>N</td>
 <td>any base</td>
 </tr>
 <tr>
 <td>. or -</td>
 <td>gap</td>
 </tr>
 </table>,
 <table border="" cellpadding="2" cellspaci

In [11]:
## Find the first table that matches some criteria
table = soup.find("table",{"width":"350","cols":"3"})
table

<table border="" cellpadding="2" cellspacing="0" cols="3" width="350">
<tr>
<td bgcolor="#B0C4DE"><font color="#000000">IUPAC amino acid code</font></td>
<td bgcolor="#B0C4DE"><font color="#000000">Three letter code</font></td>
<td bgcolor="#B0C4DE"><font color="#000000">Amino acid</font></td>
</tr>
<tr>
<td>A</td>
<td>Ala</td>
<td>Alanine</td>
</tr>
<tr>
<td>C</td>
<td>Cys</td>
<td>Cysteine</td>
</tr>
<tr>
<td>D</td>
<td>Asp</td>
<td>Aspartic Acid</td>
</tr>
<tr>
<td>E</td>
<td>Glu</td>
<td>Glutamic Acid</td>
</tr>
<tr>
<td>F</td>
<td>Phe</td>
<td>Phenylalanine</td>
</tr>
<tr>
<td>G</td>
<td>Gly</td>
<td>Glycine</td>
</tr>
<tr>
<td>H</td>
<td>His</td>
<td>Histidine</td>
</tr>
<tr>
<td>I</td>
<td>Ile</td>
<td>Isoleucine</td>
</tr>
<tr>
<td>K</td>
<td>Lys</td>
<td>Lysine</td>
</tr>
<tr>
<td>L</td>
<td>Leu</td>
<td>Leucine</td>
</tr>
<tr>
<td>M</td>
<td>Met</td>
<td>Methionine</td>
</tr>
<tr>
<td>N</td>
<td>Asn</td>
<td>Asparagine</td>
</tr>
<tr>
<td>P</td>
<td>Pro</td>
<td>Proline</td>


In [12]:
## Iterate through the table and create a list of lists
table_list2 = []
for row in table.find_all("tr"):
    cells = row.find_all("td")
    newCells = []
    for c in cells:
        newCells.append(c.get_text())
    table_list2.append(newCells)
table_list2

[['IUPAC amino acid code', 'Three letter code', 'Amino acid'],
 ['A', 'Ala', 'Alanine'],
 ['C', 'Cys', 'Cysteine'],
 ['D', 'Asp', 'Aspartic Acid'],
 ['E', 'Glu', 'Glutamic Acid'],
 ['F', 'Phe', 'Phenylalanine'],
 ['G', 'Gly', 'Glycine'],
 ['H', 'His', 'Histidine'],
 ['I', 'Ile', 'Isoleucine'],
 ['K', 'Lys', 'Lysine'],
 ['L', 'Leu', 'Leucine'],
 ['M', 'Met', 'Methionine'],
 ['N', 'Asn', 'Asparagine'],
 ['P', 'Pro', 'Proline'],
 ['Q', 'Gln', 'Glutamine'],
 ['R', 'Arg', 'Arginine'],
 ['S', 'Ser', 'Serine'],
 ['T', 'Thr', 'Threonine'],
 ['V', 'Val', 'Valine'],
 ['W', 'Trp', 'Tryptophan'],
 ['Y', 'Tyr', 'Tyrosine']]

## The Developer's Console

Both Chrome and Firefox are equipped with a developer's console, meant for debugging code while writing websites. This console can also be used to see what elements your computer is interfacing with while you surf the web. 

To open the developer's console in firefox, press Ctrl+Shift+K in Windows or Cmd+Opt+K in OSX. The network tab will allow you to see what information is being sent when, while the Inspector tab allows you to hover over code and see what element of the page it represents. 

Chrome's developer console can be accessed with Ctrl+Shift+J on Windows or Cmd+Opt+J on OSX. While the tabs are named slightly differently, the functions are essentially the same. Notably, Chrome provides native support for web scraping, though the data it gives are usually oriented more toward the organization of entire sites and less toward acquiring data from an individual page.

If you plan on getting data from the web, this is an invaluable tool that will save you a lot of time finding out where data is stored.

## A Word On APIs And robots.txt

Before scraping a site, it is worth taking a couple of things into account in order to make sure that you are a good citizen of the web.  The robots.txt file located in the root directory of most websites will usually give you an idea of which directories are and are not allowed for web scraping. It is good practice if you are scraping a large amount of data to make sure that you adhere to the areas that are described by robots.txt with the "Allow:" tag. 

Many sites also provide an Application Programming Interface (API) that allows you to acquire information directly without scraping web data from the HTML interface, saving both you and the site manager time and money. If an API is available, it is almost always advisable to make use of it.

## In-Class Exercises

In [None]:
'''
Exercise 1.

Using our uniprot.xml document,

1. Parse the document
2. Get the document's namespace
3. Get the fullName of the first entry in our document
'''


In [None]:
## Exercise 2.
## Using either lxml or BeautifulSoup, scrape the values from the first 
## table at the URL below, which contains nucleotides and their corresponding name
## Create a dictionary from these values where the nucleotide code is the key.
## "http://www.bioinformatics.org/sms/iupac.html"



## References

- <u>Python Essential Reference</u>, David Beazley, 4th Edition, Addison‐Wesley (2008)
- <u>Python for Bioinformatics</u>, Sebastian Bassi, CRC Press (2010)
- [http://en.wikipedia.org/wiki/XML](http://en.wikipedia.org/wiki/XML)
- [http://docs.python.org/](http://docs.python.org/)
- [https://docs.python.org/2/library/xml.etree.elementtree.html](https://docs.python.org/2/library/xml.etree.elementtree.html)
- [LXML HTML Xpath Tutorial](http://lxml.de/parsing.html)
- [BeautifulSoup Documentation](http://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [XPath Syntax Guide](https://www.w3schools.com/xml/xpath_syntax.asp)

#### Last Updated: 03-Oct-2022