# ElementTree for XML

XML is somewhat special as a serialization format. In particular,XML is not really one formate so much as it iis a mete-format with angle bracketed tags (less-than and greater-than signs)XML is somewhat special as a serialization format.  In particular, XML is not really one format so much as it is a meta-format with many dialects.  Syntactically, XML is a relatively simple format that defines elements with angle bracketed tags (less-than and greater-than signs), allows attributes within tags, and has a few other syntactic forms for special entities and directives.  As a rough approximation, XML is a generalization of HTML; or more accurately, HTML is a dialect of XML (to be pedantic, however, recent versions of HTML are not precisely XML dialects in some technical details).

An XML *dialect* is usually defined by a *schema* that specifies exactly which tags and attributes are permitted, and the manners in which they may nest inside one another.  Hundreds of such dialects are widely used; for example all modern word processors and publication systems use an XML dialect to define their documents (with a compression layer wrapped around the underlying XML).  Other non-document formats use XML as well, however.

In contrast to a format like JSON which allow you to serialize pretty much arbitrary Python objects (with caveats discussed in other lessons), or CSV which allows you to serialize any data that is roughly tabular, when you work with XML you start with a specific dialect, and read, modify, and write data with that dialect in mind.

# XML Schemata

There are several different languages in which the rules for a particular XML dialect may be defined.  All of them are outside the scope of this lesson, but the most commonly used one is the Document Type Definition (DTD).  Alternatives are XML Schema and RELAX NG.  For the next several lessons, we use an XML markup of an English translation of the Quran that was prepared by J. Bosak.  A number of religious texts are in the common archive that is contained in the repository for this course (following the license for distribution as a whole).

Looking at one DTD will give a sense of how they are defined, but this lesson will not describe precisely all the rules available.  In concept, a DTD is similar to a formal grammar, and somewhat similar to a regular expression or glob pattern.  XML Schema and RELAX NG are formally equivalent, but use different syntax.

J. Bosak created a relatively simple DTD that defines elements sufficient to encode the several religious texts.  I have simplified that DTD further to include only those elements required by the Quran translation specifically.  Looking at the simplified DTD will provide some idea of the kinds of elements that can be defined.

```dtd
<!-- DTD for testaments    J. Bosak -->
<!-- Early versions 1992-1998 -->
<!-- Major revision Copyright (c) Jon Bosak September 1998 -->
<!-- Subset by David Mertz 2020 -->
<!ENTITY % plaintext "#PCDATA|i">
<!ELEMENT tstmt     (coverpg?,titlepg?,preface?,suracoll+)>
<!ELEMENT coverpg   ((title|title2)+, (subtitle|p)*)>
<!ELEMENT titlepg   ((title|title2)+, (subtitle|p)*)>
```

```dtd
<!ELEMENT title     (%plaintext;)*>
<!ELEMENT title2    (%plaintext;)*>
<!ELEMENT subtitle  (p)+>
<!ELEMENT preface   (ptitle+, p+)>
<!ELEMENT ptitle    (%plaintext;)*>
<!ELEMENT suracoll  (sura+)>
<!ELEMENT sura      (bktlong, bktshort, epigraph?, v+)>
```

```dtd
<!ELEMENT bktlong   (%plaintext;)*>
<!ELEMENT bktshort  (%plaintext;)*>
<!ELEMENT epigraph  (%plaintext;)*>
<!ELEMENT p         (%plaintext;)*>
<!ELEMENT v         (%plaintext;)*>
<!ELEMENT i         (%plaintext;)*>
```

The first few lines of the document we will work with follow this schema and look like the below.

```xml
<?xml version="1.0"?>
<!DOCTYPE tstmt SYSTEM "../common/tstmt.dtd">
<tstmt  attr1="Test1" attr2="Test2">
<coverpg>
<title>The Quran</title>
<title2>One of a group of four religious works marked up for
electronic publication from publicly available sources</title2>
```

```xml
<subtitle>
<p>SGML version by Jon Bosak, 1992-1994</p>
<p>XML version by Jon Bosak, 1996-1998</p>
<p>The XML markup and added material in this version are
Copyright &#169; 1998 Jon Bosak</p>
</subtitle>
```

```xml
<subtitle>
<p>The set of which this work is a part may freely be distributed on
condition that it not be modified or altered in any way.  The
individual works making up the set &#8212; <i>The Old Testament, The
New Testament, The Quran,</i> and <i>The Book of Mormon</i> &#8212;
cannot be distributed separately without violating the terms under
which the set is made available.</p>
</subtitle>
```

# Reading the XML Document

An ElementTree object is a specialized data structure that mimics the hierarchical features of XML.  Reading it is straightforward, and a variety of atributes and methods are attached to the overall tree and to its various branches and leaves. In the original document only subelements are used, but no attributes; I added attributes for demonstration.

> The data associated with this notebook can be found in the files associated with this course

In [32]:
from pprint import pprint
import xml.etree.ElementTree as ET

tree = ET.parse('data/quran.xml')
root = tree.getroot()
root.tag, root.attrib

('tstmt', {'attr1': 'Test1', 'attr2': 'Test2'})

The methods `.find()` and `.findall()` are available on each subelement to locate nested subelements (children) of a given element.

In [33]:
suras = tree.find('suracoll').findall('sura')

print("Number of suras:", len(suras))
print("Structure of sura 101:")
print([elem.tag for elem in suras[100]])

Number of suras: 114
Structure of sura 101:
['bktlong', 'bktshort', 'epigraph', 'v']


To find children that may be more deeply nested, the `.iter()` method is often appropriate.  For example, we can find the 114 nested suras.

In [34]:
suras = list(tree.iter('sura'))
sura101 = suras[100] # zero-based Python
len([sura for sura in suras])

114

We might wish to view the text within child elements, for example.

In [35]:
for verse in sura101.findall('v'):
    print(verse.text)

The Striking Calamity.


## Modifying an Element

Using methods of elements, we may modify either attributes or children. In this example, we are not following the schema, but instead inventing a new element not defined in the DTD. After we have added an element and some content and attributes to that element, we might serialize the modified element as XML.  For illustration, a comment is also added.

In [36]:
sura101.append(ET.Comment("Demonstrate a comment"))
child = ET.SubElement(sura101, 'external')
child.text = "\n*** This text is not part of Quran! ***\n"
child.set('silly', 'yes')
child.set('discard', "True")

In [None]:
ET.dump(sura101)

# What Is Missing?

There are a number of XML features this lesson simply has not considered.  If we look at validation, entity resolution, namespaces, CDATA sections, character encoding and escaping, and some additional concepts you will need for robust XML processing.  For this lesson, and the next two, we just want you to be familiar with basic serialization and deserialization between Python and XML.

Understanding XML in full is its own longer course, and is not usually something you need to know for the basic handling, of the sort we show.