# ElementTree for XML

XML is somewhat special as a serialization format. In particular,XML is not really one formate so much as it iis a mete-format with angle bracketed tags (less-than and greater-than signs)XML is somewhat special as a serialization format.  In particular, XML is not really one format so much as it is a meta-format with many dialects.  Syntactically, XML is a relatively simple format that defines elements with angle bracketed tags (less-than and greater-than signs), allows attributes within tags, and has a few other syntactic forms for special entities and directives.  As a rough approximation, XML is a generalization of HTML; or more accurately, HTML is a dialect of XML (to be pedantic, however, recent versions of HTML are not precisely XML dialects in some technical details).

An XML *dialect* is usually defined by a *schema* that specifies exactly which tags and attributes are permitted, and the manners in which they may nest inside one another.  Hundreds of such dialects are widely used; for example all modern word processors and publication systems use an XML dialect to define their documents (with a compression layer wrapped around the underlying XML).  Other non-document formats use XML as well, however.

In contrast to a format like JSON which allow you to serialize pretty much arbitrary Python objects (with caveats discussed in other lessons), or CSV which allows you to serialize any data that is roughly tabular, when you work with XML you start with a specific dialect, and read, modify, and write data with that dialect in mind.

# XML Schemata

There are several different languages in which the rules for a particular XML dialect may be defined.  All of them are outside the scope of this lesson, but the most commonly used one is the Document Type Definition (DTD).  Alternatives are XML Schema and RELAX NG.  For the next several lessons, we use an XML markup of an English translation of the Quran that was prepared by J. Bosak.  A number of religious texts are in the common archive that is contained in the repository for this course (following the license for distribution as a whole).

Looking at one DTD will give a sense of how they are defined, but this lesson will not describe precisely all the rules available.  In concept, a DTD is similar to a formal grammar, and somewhat similar to a regular expression or glob pattern.  XML Schema and RELAX NG are formally equivalent, but use different syntax.

J. Bosak created a relatively simple DTD that defines elements sufficient to encode the several religious texts.  I have simplified that DTD further to include only those elements required by the Quran translation specifically.  Looking at the simplified DTD will provide some idea of the kinds of elements that can be defined.

```dtd
<!-- DTD for testaments    J. Bosak -->
<!-- Early versions 1992-1998 -->
<!-- Major revision Copyright (c) Jon Bosak September 1998 -->
<!-- Subset by David Mertz 2020 -->
<!ENTITY % plaintext "#PCDATA|i">
<!ELEMENT tstmt     (coverpg?,titlepg?,preface?,suracoll+)>
<!ELEMENT coverpg   ((title|title2)+, (subtitle|p)*)>
<!ELEMENT titlepg   ((title|title2)+, (subtitle|p)*)>
```

```dtd
<!ELEMENT title     (%plaintext;)*>
<!ELEMENT title2    (%plaintext;)*>
<!ELEMENT subtitle  (p)+>
<!ELEMENT preface   (ptitle+, p+)>
<!ELEMENT ptitle    (%plaintext;)*>
<!ELEMENT suracoll  (sura+)>
<!ELEMENT sura      (bktlong, bktshort, epigraph?, v+)>
```

```dtd
<!ELEMENT bktlong   (%plaintext;)*>
<!ELEMENT bktshort  (%plaintext;)*>
<!ELEMENT epigraph  (%plaintext;)*>
<!ELEMENT p         (%plaintext;)*>
<!ELEMENT v         (%plaintext;)*>
<!ELEMENT i         (%plaintext;)*>
```

The first few lines of the document we will work with follow this schema and look like the below.

```xml
<?xml version="1.0"?>
<!DOCTYPE tstmt SYSTEM "../common/tstmt.dtd">
<tstmt  attr1="Test1" attr2="Test2">
<coverpg>
<title>The Quran</title>
<title2>One of a group of four religious works marked up for
electronic publication from publicly available sources</title2>
```

```xml
<subtitle>
<p>SGML version by Jon Bosak, 1992-1994</p>
<p>XML version by Jon Bosak, 1996-1998</p>
<p>The XML markup and added material in this version are
Copyright &#169; 1998 Jon Bosak</p>
</subtitle>
```

```xml
<subtitle>
<p>The set of which this work is a part may freely be distributed on
condition that it not be modified or altered in any way.  The
individual works making up the set &#8212; <i>The Old Testament, The
New Testament, The Quran,</i> and <i>The Book of Mormon</i> &#8212;
cannot be distributed separately without violating the terms under
which the set is made available.</p>
</subtitle>
```

# Reading the XML Document

An ElementTree object is a specialized data structure that mimics the hierarchical features of XML.  Reading it is straightforward, and a variety of atributes and methods are attached to the overall tree and to its various branches and leaves. In the original document only subelements are used, but no attributes; I added attributes for demonstration.

> The data associated with this notebook can be found in the files associated with this course

In [32]:
from pprint import pprint
import xml.etree.ElementTree as ET

tree = ET.parse('data/quran.xml')
root = tree.getroot()
root.tag, root.attrib

('tstmt', {'attr1': 'Test1', 'attr2': 'Test2'})

The methods `.find()` and `.findall()` are available on each subelement to locate nested subelements (children) of a given element.

In [33]:
suras = tree.find('suracoll').findall('sura')

print("Number of suras:", len(suras))
print("Structure of sura 101:")
print([elem.tag for elem in suras[100]])

Number of suras: 114
Structure of sura 101:
['bktlong', 'bktshort', 'epigraph', 'v']


To find children that may be more deeply nested, the `.iter()` method is often appropriate.  For example, we can find the 114 nested suras.

In [34]:
suras = list(tree.iter('sura'))
sura101 = suras[100] # zero-based Python
len([sura for sura in suras])

114

We might wish to view the text within child elements, for example.

In [35]:
for verse in sura101.findall('v'):
    print(verse.text)

The Striking Calamity.


## Modifying an Element

Using methods of elements, we may modify either attributes or children. In this example, we are not following the schema, but instead inventing a new element not defined in the DTD. After we have added an element and some content and attributes to that element, we might serialize the modified element as XML.  For illustration, a comment is also added.

In [36]:
sura101.append(ET.Comment("Demonstrate a comment"))
child = ET.SubElement(sura101, 'external')
child.text = "\n*** This text is not part of Quran! ***\n"
child.set('silly', 'yes')
child.set('discard', "True")

In [None]:
ET.dump(sura101)

# What Is Missing?

There are a number of XML features this lesson simply has not considered.  If we look at validation, entity resolution, namespaces, CDATA sections, character encoding and escaping, and some additional concepts you will need for robust XML processing.  For this lesson, and the next two, we just want you to be familiar with basic serialization and deserialization between Python and XML.

Understanding XML in full is its own longer course, and is not usually something you need to know for the basic handling, of the sort we show.

# Reading XML with lxml.objectify

The *ElementTree* abstraction tries to find a compromise between an XML way of thinking and a Python way of thinking.  The Python standard library also comes with several other submodules for handling XML that are much closer to the XML way of thinking.  These include `xml.dom` (Document Object Model), `xml.sax` (Simple API for XML), and ` xml.parsers.expat`.

SAX and Expat are incremental stream-oriented parsers for XML, they both can be very fast and work incrementally. Both require quite a lot of boilerplate and are low-level. Expat is always non-validating, and can be blazingly fast.  The Document Object Model (DOM) creates an entire specialized object, with a variety of methods, as does ElementTree.  However, DOM is a standard created initially for JavaScript, and the method names are verbose, numerous, and feel out of place in Python.  Unless you need to closely match parallel code written in a language such as Java, JavaScript, or C#, I recommend against using the DOM approach.

If you want to work in a *more Pythonic* style with XML trees, the `lxml` library comes with an API called `objectify`.  This is based on much earlier work by my colleague Uche Ogbuji on Amara bindery and by me even earlier as `gnosis.xml.objectify`.  Neither of those old projects are currently maintained, but `lxml.objectify` is very similar and intuitive to work with.  In general `lxml` is a fast and well tested XML library, built on `libxml2` and `libxslt`, that provides both the `objectify` interface and an enhanced and faster version of `ElementTree`.

# A More Pythonic Approach

In [38]:
from lxml import etree
from lxml import objectify

Recall that the marked up version of the Quran as XML we worked with in the last lesson looks something like this:

```xml
<?xml version="1.0"?>
<!DOCTYPE tstmt SYSTEM "../common/tstmt.dtd">
<tstmt  attr1="Test1" attr2="Test2">
<coverpg>
<title>The Quran</title>
<!-- some elements omitted -->
</coverpg>
```

Continuing a fragment of the XML:

```xml
<suracoll>
<sura>
<bktlong>1. The Opening</bktlong>
<bktshort>1. The Opening</bktshort>
<v>In the name of Allah, the Beneficent, the Merciful.</v>
<v>All praise is due to Allah, the Lord of the Worlds.</v>
<v>The Beneficent, the Merciful.</v>
<v>Master of the Day of Judgment.</v>
<!-- continues -->
</sura>
</suracoll>
```

If we wish to use the ElementTree interface (here as `lxml.etree`) to create a list of the verses in Sura 101, we would write code similar to this:

> The data associated with this notebook can be found in the files associated with this course

In [39]:
tree = etree.parse('data/quran.xml')
quran = tree.getroot()

suras = quran.find('suracoll').findall('sura')
[elem.text for elem in suras[100] if elem.tag == 'v']

['The Striking Calamity.']

In contrast, the objectify approach treats the nested elements and attributes as if they were simply attributes of a native Python object with nested data.  XML attributes are accessed with the Python attribute `.attrib`.  Text is accessed with the Python attribute `.text`.  Child elements that occur in parallel are simply presented as a list-like collection.  Reading in the XML data has similar boilerplate as ElementTree, but working with it often feels more natural.

In [43]:
doc = objectify.parse(open('data/quran.xml'))
quran_o = objectify.E.root(doc.getroot())

quran_o.tstmt.suracoll.sura[100].v[:]

['The Striking Calamity.']

If we want to see the XML attributes, they are provided as a dictionary.

In [44]:
print(quran_o.tstmt.attrib)
title = quran_o.tstmt.coverpg.title
print(title, title.attrib) # No attributes

{'attr1': 'Test1', 'attr2': 'Test2'}
Quran {}


Accessing a different path into nested elements.

In [45]:
quran_o.tstmt.suracoll.sura[100].bktlong

'The Calamity'

In a design compromise, a shortcut to selecting the first of several parallel children is to simply omit indexing.

In [46]:
quran_o.tstmt.suracoll.sura[100].v

'The Striking Calamity.'

Often working with objectify allows you to access the portions of interest without needing loops or comprehensions, as in the above examples.  However, these approaches can be combined, as needed.  For example, here are the first three lines of each of the last 4 Suras.

In [47]:
[sura.v[:3] for sura in quran_o.tstmt.suracoll.sura[-4:]]

[['May the hands of Abu Lahab be ruined, and ruined is he.'],
 ['Say, He is Allah, the One!'],
 ['Say, I seek refuge in the Lord of daybreak.'],
 ['Say, I seek refuge in the Lord of mankind.']]

## Serializing an Element

Rather than using `.dump` which is generally only for debugging purposes, the function `etree.tostring()` can serialize a subelement as a complete XML document (adding namespace declarations or other needed elements to be complete documents rather than fragments).

In [48]:
sura101 = quran_o.tstmt.suracoll.sura[100]
sura_xml = etree.tostring(sura101, pretty_print=True)
print(sura_xml.decode('utf-8'))

<sura xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" index="101" name="Al-Qariah">
  <bktlong>The Calamity</bktlong>
  <bktshort>Al-Qariah</bktshort>
  <epigraph/>
  <v>The Striking Calamity.</v>
</sura>



Adding or modifying elements is similar to ElementTree.

In [49]:
child = objectify.SubElement(sura101, "external", silly="yes")
child._setText("*** This text is not part of Quran! ***")

In [50]:
sura_xml = etree.tostring(sura101, pretty_print=True)
print(sura_xml.decode('utf-8'))

<sura xmlns:py="http://codespeak.net/lxml/objectify/pytype" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" index="101" name="Al-Qariah">
  <bktlong>The Calamity</bktlong>
  <bktshort>Al-Qariah</bktshort>
  <epigraph/>
  <v>The Striking Calamity.</v>
  <external silly="yes">*** This text is not part of Quran! ***</external>
</sura>

