# Reading XML with lxml.objectify

The *ElementTree* abstraction tries to find a compromise between an XML way of thinking and a Python way of thinking.  The Python standard library also comes with several other submodules for handling XML that are much closer to the XML way of thinking.  These include `xml.dom` (Document Object Model), `xml.sax` (Simple API for XML), and ` xml.parsers.expat`.  

SAX and Expat are incremental stream-oriented parsers for XML, they both can be very fast and work incrementally. Both require quite a lot of boilerplate and are low-level. Expat is always non-validating, and can be blazingly fast.  The Document Object Model (DOM) creates an entire specialized object, with a variety of methods, as does ElementTree.  However, DOM is a standard created initially for JavaScript, and the method names are verbose, numerous, and feel out of place in Python.  Unless you need to closely match parallel code written in a language such as Java, JavaScript, or C#, I recommend against using the DOM approach.

If you want to work in a *more Pythonic* style with XML trees, the `lxml` library comes with an API called `objectify`.  This is based on much earlier work by my colleague Uche Ogbuji on Amara bindery and by me even earlier as `gnosis.xml.objectify`.  Neither of those old projects are currently maintained, but `lxml.objectify` is very similar and intuitive to work with.  In general `lxml` is a fast and well tested XML library, built on `libxml2` and `libxslt`, that provides both the `objectify` interface and an enhanced and faster version of `ElementTree`.

# A More Pythonic Approach

In [1]:
from lxml import etree
from lxml import objectify

Recall that the marked up version of the Quran as XML we worked with in the last lesson looks something like this:

```xml
<?xml version="1.0"?>
<!DOCTYPE tstmt SYSTEM "../common/tstmt.dtd">
<tstmt  attr1="Test1" attr2="Test2">
<coverpg>
<title>The Quran</title>
<!-- some elements omitted -->    
</coverpg>
```

Continuing a fragment of the XML:

```xml
<suracoll>
<sura>
<bktlong>1. The Opening</bktlong>
<bktshort>1. The Opening</bktshort>
<v>In the name of Allah, the Beneficent, the Merciful.</v>
<v>All praise is due to Allah, the Lord of the Worlds.</v>
<v>The Beneficent, the Merciful.</v>
<v>Master of the Day of Judgment.</v>
<!-- continues -->
</sura>
</suracoll>
```

If we wish to use the ElementTree interface (here as `lxml.etree`) to create a list of the verses in Sura 101, we would write code similar to this:

In [2]:
tree = etree.parse('data/quran.xml')
quran = tree.getroot()

suras = quran.find('suracoll').findall('sura')
[elem.text for elem in suras[100] if elem.tag == 'v']

['The terrible calamity!',
 'What is the terrible calamity!',
 'And what will make you comprehend what the terrible calamity is?',
 'The day on which men shall be as scattered moths,',
 'And the mountains shall be as loosened wool.',
 'Then as for him whose measure of good deeds is heavy,',
 'He shall live a pleasant life.',
 'And as for him whose measure of good deeds is light,',
 'His abode shall be the abyss.',
 'And what will make you know what it is?',
 'A burning fire.']

In contrast, the objectify approach treats the nested elements and attributes as if they were simply attributes of a native Python object with nested data.  XML attributes are accessed with the Python attribute `.attrib`.  Text is accessed with the Python attribute `.text`.  Child elements that occur in parallel are simply presented as a list-like collection.  Reading in the XML data has similar boilerplate as ElementTree, but working with it often feels more natural.

In [3]:
doc = objectify.parse(open('data/quran.xml'))
quran_o = objectify.E.root(doc.getroot())

quran_o.tstmt.suracoll.sura[100].v[:]

['The terrible calamity!',
 'What is the terrible calamity!',
 'And what will make you comprehend what the terrible calamity is?',
 'The day on which men shall be as scattered moths,',
 'And the mountains shall be as loosened wool.',
 'Then as for him whose measure of good deeds is heavy,',
 'He shall live a pleasant life.',
 'And as for him whose measure of good deeds is light,',
 'His abode shall be the abyss.',
 'And what will make you know what it is?',
 'A burning fire.']

If we want to see the XML attributes, they are provided as a dictionary.

In [4]:
print(quran_o.tstmt.attrib)
title = quran_o.tstmt.coverpg.title
print(title, title.attrib) # No attributes

{'attr1': 'Test1', 'attr2': 'Test2'}
The Quran {}


Accessing a different path into nested elements.

In [5]:
quran_o.tstmt.suracoll.sura[100].bktlong

'101. The Terrible Calamity'

In a design compromise, a shortcut to selecting the first of several parallel children is to simply omit indexing.

In [6]:
quran_o.tstmt.suracoll.sura[100].v

'The terrible calamity!'

Often working with objectify allows you to access the portions of interest without needing loops or comprehensions, as in the above examples.  However, these approaches can be combined, as needed.  For example, here are the first three lines of each of the last 4 Suras.

In [7]:
[sura.v[:3] for sura in quran_o.tstmt.suracoll.sura[-4:]]

[['Perdition overtake both hands of Abu Lahab, and he will perish.',
  'His wealth and what he earns will not avail him.',
  'He shall soon burn in fire that flames,'],
 ['Say: He, Allah, is One.',
  'Allah is He on Whom all depend.',
  'He begets not, nor is He begotten.'],
 ['Say: I seek refuge in the Lord of the dawn,',
  'From the evil of what He has created,',
  'And from the evil of the utterly dark night when it comes,'],
 ['Say: I seek refuge in the Lord of men,',
  'The King of men,',
  'The God of men,']]

## Serializing an Element

Rather than using `.dump` which is generally only for debugging purposes, the function `etree.tostring()` can serialize a subelement as a complete XML document (adding namespace declarations or other needed elements to be complete documents rather than fragments).

In [8]:
sura101 = quran_o.tstmt.suracoll.sura[100]
sura_xml = etree.tostring(sura101, pretty_print=True)
print(sura_xml.decode('utf-8'))

<sura xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <bktlong>101. The Terrible Calamity</bktlong>
  <bktshort>101. The Terrible Calamity</bktshort>
  <epigraph>In the Name of Allah, the Beneficent, the Merciful.
</epigraph>
  <v>The terrible calamity!</v>
  <v>What is the terrible calamity!</v>
  <v>And what will make you comprehend what the terrible calamity is?</v>
  <v>The day on which men shall be as scattered moths,</v>
  <v>And the mountains shall be as loosened wool.</v>
  <v>Then as for him whose measure of good deeds is heavy,</v>
  <v>He shall live a pleasant life.</v>
  <v>And as for him whose measure of good deeds is light,</v>
  <v>His abode shall be the abyss.</v>
  <v>And what will make you know what it is?</v>
  <v>A burning fire.</v>
</sura>



Adding or modifying elements is similar to ElementTree.

In [9]:
child = objectify.SubElement(sura101, "external", silly="yes")
child._setText("*** This text is not part of Quran! ***")

In [10]:
sura_xml = etree.tostring(sura101, pretty_print=True)
print(sura_xml.decode('utf-8'))

<sura xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <bktlong>101. The Terrible Calamity</bktlong>
  <bktshort>101. The Terrible Calamity</bktshort>
  <epigraph>In the Name of Allah, the Beneficent, the Merciful.
</epigraph>
  <v>The terrible calamity!</v>
  <v>What is the terrible calamity!</v>
  <v>And what will make you comprehend what the terrible calamity is?</v>
  <v>The day on which men shall be as scattered moths,</v>
  <v>And the mountains shall be as loosened wool.</v>
  <v>Then as for him whose measure of good deeds is heavy,</v>
  <v>He shall live a pleasant life.</v>
  <v>And as for him whose measure of good deeds is light,</v>
  <v>His abode shall be the abyss.</v>
  <v>And what will make you know what it is?</v>
  <v>A burning fire.</v>
  <external silly="yes">*** This text is not part of Quran! ***</external>
</sura>

