# BMI565: Bioinformatics Programming & Scripting

#### (C) Michael Mooney (mooneymi@ohsu.edu)

## Week 3: XML and Python

1. XML Overview
    - XML Format
2. The Python ElementTree Class
    - Reading XML
    - Writing XML
3. XML and Bioinformatics

#### Requirements

- Python 2.7
- `xml.etree.ElementTree` module
- Data Files
    - `./data/book.xml`
    - `./data/SHH.xml`
- Miscellaneous Files
    - `./images/book_tree.jpg`

## XML Overview

<b>XML</b> stands for E<u>x</u>tensible <u>M</u>arkup <u>L</u>anguage, and is a set of rules for encoding documents in a machine-readable format. In bioinformatics, XML is a commonly used format for sharing heterogenous data (as opposed to delimited files, where every record (row) contains the same data elements).

The World Wide Web Consortium (W3C) oversaw XML development in 1996.

#### XML Design Goals:
1. XML shall be straightforwardly usable over the Internet
2. XML shall support a wide variety of applications
3. XML shall be compatible with Standard Generalized Markup Language (SGML)
4. It shall be easy to write programs that process XML documents
5. The number of optional features in XML is to be kept to the absolute minimum
6. XML documents should be human-legible an reasonably clear
7. The XML design should be prepared quickly
8. The design of XML shall be formal and concise
9. XML documents shall be easy to create
10. Terseness in XML markup is of minimal importance

#### Why can't we use CSV formats?
1. We usually can, but...
1. CSV files are not always human readable (other documentation is often necessary to identify data elements)
2. Inconsistencies are more likely 
3. CSV files don't easily support multiple levels of data
4. CSV files don't easily support addition details such as formatting or meta data (experimental protocols, etc.)


#### UniProt Example: Sonic Hedgehog Protein

[http://www.uniprot.org/uniprot/Q15465.xml](http://www.uniprot.org/uniprot/Q15465.xml)

I've provided this file in the course materials, saved as `SHH.xml`.

### XML Format

The first couple lines of an XML document contain information about the XML version used, the document structure and comments:

#### Version
    <?xml version='1.0' encoding='UTF-8'?>
    
#### Document Type Declaration
    <uniprot xmlns="http://uniprot.org/uniprot" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">

#### XML Document Body

The body of an XML document contains labeled data elements. Data elements can be nested to show relationships. Data labels are called "tags", which can also contain attributes (values are always strings) that provide additional information about the data.
    
    <parent_tag>
        <child_tag attribute1="value1" attrubute2="value2">data</child_tag>
    </parent_tag>

It is subjective whether to provide additional information as attributes or additional date elements:

    <contact birthdate="1-1-1980">
        <name>John Smith</name>
    <contact>
    
    <contact>
        <name>John Smith</name>
        <birthdate>1-1-1980</birthdate>
    </contact>

#### DTD and XML Schema

- Document Type Definitions (DTD) and XML Schemas are two ways of describing the structure and content of an XML document
- XML Schemas (a.k.a. XML Schema Definitions or XSDs) were designed to improve upon the shortcomings of DTDs
    - data type support
    - namespace aware
- Example: the UniProt XSD - [http://www.uniprot.org/support/docs/uniprot.xsd](http://www.uniprot.org/support/docs/uniprot.xsd)

## ElementTree
### Reading XML

There are two strategies for reading an XML document:

1. Document Object Model
    - Read the entire file, analyze relationships between elements, and build a tree structure which can be navigated/searched
    - Uses the innate organization of the data
    - Examples: `minidom` and `elementtree` Python modules
2. Event Driven Parsers (SAX or Simple API for XML)
    - Read the XML file and report events, such as the start and end of an element
    - Uses less memory, no tree construction
    - Examples: `sax` and `elementtree` Python modules

#### A Simple Example

    <book>
        <title>Nineteen Eighty‐Four</title>
        <author>George Orwell</author>
        <character>Winston Smith</character>
        <character>Julia</character>
    </book>

    import xml.etree.ElementTree as et
    tree = et.parse("1984.xml")

In the example above, `tree` is an ElementTree object containing a tree of the entire XML file. ElementTree objects are iterable objects. We can iterate through these object to access individual elements. Start by accessing the root of the tree. Each element object contains three main attributes: the tag name `tag`, the text inside the tag `text`, and the tag attributes `attrib`.

    root_element = tree.getroot()
    for element in root_element:
        print element.tag
        print element.text
        print element.attrib

<img src="./images/book_tree.jpg" align="left" width="700" />

#### Another Example: `book.xml`

    <book>
	<title>Ender's Game</title>
	<author>Orson Scott Card</author>
	<chapter>Third</chapter>
	<chapter>Peter</chapter>
	<chapter>Graff</chapter>
    <publication_info>
		<publisher location="New York">Tor Books</publisher>
		<publication_date>1985</publication_date>
	</publication_info>
    </book>

In [1]:
import xml.etree.ElementTree as et
tree = et.parse('./data/book.xml')
root_element = tree.getroot()
root_element

<Element 'book' at 0x103fd35d0>

In [2]:
list(root_element)

[<Element 'title' at 0x103fd3510>,
 <Element 'author' at 0x103fd3550>,
 <Element 'chapter' at 0x103fd3d90>,
 <Element 'chapter' at 0x103fd3dd0>,
 <Element 'chapter' at 0x103fd3e10>,
 <Element 'publication_info' at 0x103fd3e50>]

In [3]:
len(root_element)

6

In [4]:
for element in root_element:
    print element.tag + ":", element.text.strip()

title: Ender's Game
author: Orson Scott Card
chapter: Third
chapter: Peter
chapter: Graff
publication_info: 


In [5]:
root_element[5]

<Element 'publication_info' at 0x103fd3e50>

In [6]:
len(root_element[5])

2

In [7]:
list(root_element[5])

[<Element 'publisher' at 0x103fd3e90>,
 <Element 'publication_date' at 0x103fd3ed0>]

In [8]:
## Each element is iterable, which allows access
## to child elements. Here we check the length of
## each element to get the number of children
for element in root_element:
    if len(element) > 0:
        print element.tag + ":", element.text.strip(), ", ", element.attrib
        for child in element:
            print "\t" + child.tag + ":", child.text.strip(), ", ", child.attrib
    else:
        print element.tag + ":", element.text.strip(), ", ", element.attrib

title: Ender's Game ,  {}
author: Orson Scott Card ,  {}
chapter: Third ,  {}
chapter: Peter ,  {}
chapter: Graff ,  {}
publication_info:  ,  {}
	publisher: Tor Books ,  {'location': 'New York'}
	publication_date: 1985 ,  {}


#### ElementTree Element Methods

<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`Element.iter(tag=None)`</td><td>Creates a tree iterator with the current element as root.<br />If `tag` is specified, only those elements with a tag equal to `tag` are returned by the iterator.</td></tr>
<tr><td style="text-align:center">`Element.find(tag)`</td><td>Returns the first subelement with a tag equal to `tag` or `None` if no match.</td></tr>
<tr><td style="text-align:center">`Element.findall(tag)`</td><td>Returns a list of all matching subelements.</td></tr>
</table>

In [9]:
author = root_element.find("author")
author.text

'Orson Scott Card'

In [10]:
chapters = root_element.findall("chapter")
[c.text for c in chapters]

['Third', 'Peter', 'Graff']

If the XML file is very large, you may want to use an iterator, rather than creating a tree of the entire file all at once. The `iterparse()` method implements an event-driven parser. It will return an iterator of (event, element) tuples, where event indicates the part of an element encountered (e.g. the start tag or end tag). By default, only end events are returned. Since, `iterparse()` still creates a tree in memory, you can use the `Element.clear()` method to save memory. 

In [11]:
iter_et = et.iterparse('./data/book.xml')
for event, element in iter_et:
    print event
    print element.tag + ":", element.text.strip()

end
title: Ender's Game
end
author: Orson Scott Card
end
chapter: Third
end
chapter: Peter
end
chapter: Graff
end
publisher: Tor Books
end
publication_date: 1985
end
publication_info: 
end
book: 


In [12]:
## Use clear() to clear each element after processing
## including the root element
iter_et = et.iterparse('./data/book.xml', events=['start', 'end'])
event, root = iter_et.next()
for event, element in iter_et:
    if event == "end" and element.tag != root.tag:
        print element.tag + ":", element.text.strip()
        element.clear()

root.clear()

title: Ender's Game
author: Orson Scott Card
chapter: Third
chapter: Peter
chapter: Graff
publisher: Tor Books
publication_date: 1985
publication_info: 


#### XML Namespaces

XML namespaces are used to create uniquely named elements and attributes in an XML document. Since a single document may contain element names from multiple vocabularies, ambiguity can arise from the same element name used for different entity definitions. The namespace is appended to the front of tag names to create unique names. In the UniProt example shown above, the attribute `xmlns="http://uniprot.org/uniprot"` specifies the UniProt namespace (in the document type declaration.

A document's namespace can be extracted from the root element:

In [13]:
## Get the XML document's namespace
import re
shh_tree = et.parse('./data/SHH.xml')
shh_root = shh_tree.getroot()
namespace = re.match(r"{.*}", shh_root.tag).group()
namespace

'{http://uniprot.org/uniprot}'

In [14]:
## Append the namespace to any element name
## you want to find
entry = shh_root.find(namespace+'entry')
entry.find(namespace+'name').text

'SHH_HUMAN'

In [15]:
ns = {'uniprot':'http://uniprot.org/uniprot'}
entry = shh_root.find('uniprot:entry', ns)
entry.find("uniprot:name", ns).text

'SHH_HUMAN'

### Writing XML

#### Methods for Writing XML
<table align="left">
<tr><td style="text-align:center"><b>Method</b></td><td><b>Description</b></td></tr>
<tr><td style="text-align:center">`et.Element(tag)`</td><td>Creates an element with the specified tag. Returns an element object.</td></tr>
<tr><td style="text-align:center">`et.SubElement(element, tag)`</td><td>Creates a child element under the specified element.</td></tr>
<tr><td style="text-align:center">`Element.set(key, value)`</td><td>Sets the attributes of an element.</td></tr>
<tr><td style="text-align:center">`et.ElementTree(root)`</td><td>Returns an ElementTree object.</td></tr>
<tr><td style="text-align:center">`ElementTree.write(file)`</td><td>Writes an ElementTree object to a file.</td></tr>
</table>

In [16]:
## Create a simple XML file
root = et.Element("book")
title = et.SubElement(root, "title")
title.text = "Nineteen Eighty-Four"
author = et.SubElement(root, "author")
author.text = "George Orwell"

pub_info = et.SubElement(root, "publication_info")
pub = et.SubElement(pub_info, "publisher")
pub.text = "Secker and Warburg"
pub.attrib = {"location": "London"}
tree = et.ElementTree(root)
tree.write("1984.xml")

In [17]:
with open('1984.xml') as fh:
    data = fh.read()
data

'<book><title>Nineteen Eighty-Four</title><author>George Orwell</author><publication_info><publisher location="London">Secker and Warburg</publisher></publication_info></book>'

#### Drawbacks to XML?

- More difficult to parse than CSV
- Verbose syntax means larger files

## XML and Bioinformatics
#### SBML (Systems Biology Markup Language)
- Used to communicate models of biological processes (cell-signaling pathways, regulatory networks). Models can represent:
    - Chemical Equations
    - Cellular Components: nucleus, cytoplasm, etc.
    - Species: genomes, proteomes, etc.
- Supported by many applications: [http://sbml.org/SBML_Software_Guide](http://sbml.org/SBML_Software_Guide)
- [http://www.ebi.ac.uk/biomodels-main/](http://www.ebi.ac.uk/biomodels-main/)

#### KGML (KEGG Markup Language)
- A format for KEGG pathway maps
    - [http://www.kegg.jp/kegg/xml/](http://www.kegg.jp/kegg/xml/)
    
#### PDBML (Protein Databank Markup Language)
- Describes 3D protein structure
    - relative atomic coordinates
    - secondary structure assignment
    - atomic connectivity
- [http://www.rcsb.org/pdb/home/home.do](http://www.rcsb.org/pdb/home/home.do)
- [http://pdbml.pdb.org/](http://pdbml.pdb.org/)

## In-Class Exercises

In [None]:
## Exercise 1.
## Extract the title and author list for the 
## first reference in SHH.xml


## References

- <u>Python Essential Reference</u>, David Beazley, 4th Edition, Addison‐Wesley (2008)
- <u>Python for Bioinformatics</u>, Sebastian Bassi, CRC Press (2010)
- [http://en.wikipedia.org/wiki/XML](http://en.wikipedia.org/wiki/XML)
- [http://docs.python.org/](http://docs.python.org/)
- [https://docs.python.org/2/library/xml.etree.elementtree.html](https://docs.python.org/2/library/xml.etree.elementtree.html)

#### Last Updated: 21-Sep-2016