# Introduction to XML Parsing in Python with the ElementTree module

This notebook reviews some of the essential functions and procedures for parsing XML in Python. XML is a standard markup language derived from SGML. This notebook assumes you are familiar with the basic syntax and requirements of XML. As a primer, the TEI Project provides a "[Gentle Introduction to XML](https://tei-c.org/release/doc/tei-p5-doc/es/html/SG.html)" that covers the primary features of XML as a data structure and for encoding text in XML.

First, import the etree module. Although it does support other parsers, we will mostly use ElementTree, which is imported as follows (this is favored since it allows you to call the module by typing `ET` rather than the whole string): 

In [1]:
import xml.etree.ElementTree as ET

In [2]:
import os

## Inspect the available classes and functions

To get an idea of the available classes and functions, try the `inspect` module:

In [3]:
from inspect import getmembers, isclass, isfunction

In [4]:
# Display classes in ET module
for (name, member) in getmembers(ET, isclass):
    if not name.startswith('_'):
        print(name)

C14NWriterTarget
Element
ElementTree
ParseError
QName
TreeBuilder
XMLParser
XMLPullParser


In [5]:
# display functions in ET module
for (name, member) in getmembers(ET, isfunction):
    if not name.startswith('_'):
        print(name)

Comment
PI
ProcessingInstruction
XML
XMLID
canonicalize
dump
fromstring
fromstringlist
indent
iselement
iterparse
parse
register_namespace
tostring
tostringlist


## Basic parsing with etree

In this section, we will focus on the basic usage of a few of the functions. To accomplish initial XML parsing these are:

### Load and parse an XML document

* `.parse()` - creates a python object that we can manipulate with ElementTree
* `.getroot()` - structures an ElementTree object according to the root element that you set
* `.tostring()` - converts XML object data into string format
* `.fromstring()` - converts string data into an XML encodable object

### Find and retrieve values of elements and attributes

* `.find()` - returns the first match to an element name, provided as a string or variable; 
* related, if you want to locate multiple elements or search through the tree, `findall()` will return a list of all matching elements;
* `.iter()` - creates an iterable, which can be used in a loop. Useful to find all of the elements within a given tree or element structure, not just the ones at the current level or matching a specific element name;
* `.get()` - allows you to get specified attributes

### Modify or add values

* `.set()` - allows you to add ("set") specified attributes
* `.write()` when applied to an ElementTree object, this will write out to the filename passed as an argument
* `.append()` - to add a new Element or "tag" if input as a string (that is, `.fromstring()`); alternatively use an `Element` constructor
* to remove attributes, use `del()` - this works because the ElementTree processes attributes as a dictionary

In [6]:
ead_file = os.path.join('..','data','xml','day_20221004_205435_UTC__ead.xml')

In [7]:
tree = ET.parse(ead_file)
root = tree.getroot()

print(root[:250])

[<Element '{http://ead3.archivists.org/schema/}control' at 0x7fa5a896df90>, <Element '{http://ead3.archivists.org/schema/}archdesc' at 0x7fa5a8974130>]


## Finding values of tags, attributes, or contents

Once elements are identified or parsed, it is possible to extract the values of tag, any associated attributes, and the contents nested within the element markers. These methods can be appended to any `Element` object : `.tag` returns the name of the tag; `.attrib` provides the attributes, as a dictionary; and `.text` returns the nested content. 

In addition, since we are parsing an EAD document with a namespace specified, the name of the element includes the very specific and helpful (though rather long) specification to the EAD3 namespace. In other words, it is telling us the vocabulary that it is associated with.

In [8]:
for element in root:
    print(element.tag, type(element))

{http://ead3.archivists.org/schema/}control <class 'xml.etree.ElementTree.Element'>
{http://ead3.archivists.org/schema/}archdesc <class 'xml.etree.ElementTree.Element'>


If you see things like ... `control` and `archdesc` IT WORKED! You've parsed XML with python. 

Now, let's get more fancy. For example, let's look at the content in a human readable text. This can be done wiht the `.tostring()` function, which will convert those binary byte objects into plain text.

In [9]:
print(ET.tostring(root)[:500])

b'<ns0:ead xmlns:ns0="http://ead3.archivists.org/schema/"><ns0:control countryencoding="iso3166-1" dateencoding="iso8601" langencoding="iso639-2b" relatedencoding="marc" repositoryencoding="iso15511" scriptencoding="iso15924" language="en-US"><ns0:recordid instanceurl="">umich-scl-day</ns0:recordid><ns0:filedesc><ns0:titlestmt><ns0:titleproper>Finding Aid for the William R. Day Collection day </ns0:titleproper><ns0:titleproper localtype="filing">William R. Day Collection</ns0:titleproper><ns0:auth'


We can see the tags of an Element object using the `.tag` method, and the attributes associated with the element are stored in a dictionary that can be called with the `.attrib` method. 

In [10]:
root.tag

'{http://ead3.archivists.org/schema/}ead'

In [11]:
root.attrib

{}

In [12]:
for element in root:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}control {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924', 'language': 'en-US'}
{http://ead3.archivists.org/schema/}archdesc {'level': 'collection'}


Or, to see all of the elements at this and lower subElement levels, use the `.iter()` function:

In [13]:
for element in root.iter():
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}ead {}
{http://ead3.archivists.org/schema/}control {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924', 'language': 'en-US'}
{http://ead3.archivists.org/schema/}recordid {'instanceurl': ''}
{http://ead3.archivists.org/schema/}filedesc {}
{http://ead3.archivists.org/schema/}titlestmt {}
{http://ead3.archivists.org/schema/}titleproper {}
{http://ead3.archivists.org/schema/}titleproper {'localtype': 'filing'}
{http://ead3.archivists.org/schema/}author {}
{http://ead3.archivists.org/schema/}publicationstmt {}
{http://ead3.archivists.org/schema/}publisher {}
{http://ead3.archivists.org/schema/}address {}
{http://ead3.archivists.org/schema/}addressline {}
{http://ead3.archivists.org/schema/}addressline {}
{http://ead3.archivists.org/schema/}addressline {}
{http://ead3.archivists.org/schema/}addressline {}
{http://ead3.archivists.or

To select particular elements, use the `find()` function (this will return the first element that matches your request). The argument here is a modified XPath selector, which is how we will guide the function to the elements that we want to see in the tree. If looking for multiple elements, there is also a `findall()` function. 

In [14]:
control = tree.find('{http://ead3.archivists.org/schema/}control')
print(control.attrib) 

{'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924', 'language': 'en-US'}



Once we find the elements, use the `.get()` method to take a look at the attributes associated with a particular element. 

Continuing with the `control` element, get the value of a particular attribute. For example, `countryencoding`.

In [15]:
countryCode = control.get('countryencoding')

print(f'Country encoding is according to: {countryCode}')

Country encoding is according to: iso3166-1


Once identified, Element objects can be iterated through, like the root element previously:

In [16]:
for element in control:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}recordid {'instanceurl': ''}
{http://ead3.archivists.org/schema/}filedesc {}
{http://ead3.archivists.org/schema/}maintenancestatus {'value': 'derived'}
{http://ead3.archivists.org/schema/}maintenanceagency {'countrycode': 'US'}
{http://ead3.archivists.org/schema/}languagedeclaration {}
{http://ead3.archivists.org/schema/}conventiondeclaration {}
{http://ead3.archivists.org/schema/}localcontrol {'localtype': 'findaidstatus'}
{http://ead3.archivists.org/schema/}maintenancehistory {}


## Working with namespaces

Now, let's simplify the usage of namespaces. As you can see from the tags above, it can get tedious to use the full reference for each tag. When an XML with a namespace declaration is parsed by eTree, it prepends the associated namespace to each tag element. Thus, the `control` element in the EAD document here becomes `{http://ead3.archivists.org/schema/}control`. It's a lot to type each time you reference an element. Easier would be to shorten this to a prefix, in this case `ead:control` which provides a shorthand for the schema URI. 

To do this, the etree module provides a namespace handler. To initialize namespaces, establish a dictionary, typically named `ns` (or something short and easy to remember), that will be passed into the parser:

In [17]:
ns = {
    'ead' : 'http://ead3.archivists.org/schema/'
}

Now, elements associated with that namespace can be referenced in find, findall, or other statements in your code by using the prefix and element name, like `ead:control`: 

In [18]:
control = root.find('ead:control', ns)
print(control.attrib)

{'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924', 'language': 'en-US'}


Compared to JSON, this may seem a bit fussy, but one advantage is the highly specific name referencing elements, and we can be very precise in defining what an element means. For example, all of the elements and allowed attributes are defined in detail at the EAD specification: https://loc.gov/ead/EAD3taglib/. 

## Writing XML with etree

To write an XML file, use the `.write()` function. This can be applied to an ElementTree object. Since above we have assigned the object to `tree`, we can use that here to write out the XML. 

A few additional arguments must be passed to the `write()` function in order to create valid XML. First, you will need to register any namespaces that are used in the document. To register namespaces, use the `.register_namespace()` function. This can be used multiple times to register multiple namespaces within a single script.

To establish a primary namespace (i.e., one that is assumed for the whole document, supplied in an `xmlns` attribute on the root element, and not prepended to each element), use a blank reference in the first argument passed to the `register_namespace()` function. Below, the `register_namespace()` function is called multiple times to register EAD, MODS, DublinCore, and the basic W3C schema for XML (many shared attributes among these schemes is inherited from the W3C XML schema). (To learn more about these metadata structure standards, follow the associated links.) 

Depending on how you register namespaces, your XML document may look slightly different: if you do not establish a primary namespace each tag will be prepended with the namespace. That can look a bit redundant, but it is specific and still valid XML. As the [MODS User Guide states](https://www.loc.gov/standards/mods/userguide/introduction.html), "Within a record or group of records it is optional to use the "mods" prefix before each element (and before the "mods" namespace declaration), since the MODS namespace is indicated in the record. It is most useful to use the prefix "mods:" before each element when combining a MODS record with XML data from another namespace." (referenced October 2022) 

In [19]:
# to establish an unprefixed namespace, use a blank in the first argument:
ET.register_namespace('', 'http://ead3.archivists.org/schema/')
# alternatively, specify the 'ead' prefix to be extra specific
#ET.register_namespace('ead', 'http://ead3.archivists.org/schema/')
ET.register_namespace('mods', 'http://www.loc.gov/mods/v3')
ET.register_namespace('dc', 'http://purl.org/dc/elements/1.1/')
ET.register_namespace('xsi', 'http://www.w3.org/2001/XMLSchema')

In addition, you must specify how you want eTree to write the file. To do this, you will provide a file name for output, as well as specify the following variables: 

* `xml_declaration` variable - takes a boolean (True or False) 
* `encoding` variable - to specify the character encoding (here use 'utf-8'), and 
* `method` variable to specify for the writer to use (the default here is `xml`, but you can also request `xhtml` or `html`) provided as a string.

Notice that the first two variables you need are the same information provided in a standard XML document declaration statement:

```xml
<?xml version='1.0' encoding='utf-8'?>
```

A full `write` function might look like this:

In [20]:
tree.write(ead_file, xml_declaration=True, encoding='utf-8', method='xml')

## Modifying XML with the etree module

The `etree` module can also be used to modify XML, including adding/modifying/removing attributes, reading and modifying elements, or adding text within an element.

To create new attributes, etree provides the `.set()` function. This takes the name of the desired attribute and the value as strings; it is called on an Element object. 

In [21]:
# control already contains the ead:control element
control.set('language', 'en-US')

In [22]:
# save the changes using write()
tree.write(ead_file, xml_declaration=True, encoding='utf-8', method='xml')
print('wrote',ead_file)

wrote ../data/xml/day_20221004_205435_UTC__ead.xml


When you run the above two cells, you will notice that the `xml-model` declaration disappears. It is replaced by a different namespace declaration contined in the root tag (in this case, an `xmlns:ead` attribute). The elements are transcribed slightly differently as well: instead of bare EAD elements (like `ead` and `control`), these are now prefixed (thus `ead:ead` or `ead:control`). While this makes the file slightly longer, this is standard and also creates well-formed and valid XML. In fact, many machine-generated XML files will use this sort of prefixed convention. 

To remove attributes, use the `del()` function which is a standard dictionary operation since Python treats the attributes of any Element object as a dictionary.

In [23]:
del(control.attrib['language'])

In [24]:
# save the changes using write()
tree.write(ead_file, xml_declaration=True, encoding='utf-8', method='xml')

## Using XPath within etree

XPath is a selector language that allows us to search for specific elements and attributes within the tree. The power of this language is that it allows us to select very precisely, and it also allows us to see multiple items at similar levels in the hierarchy or those that meet particular characteristics (such as having a particular attribute). 

Remember that while XML can be represented as a tree, any of the nodes, attributes, or embedded values in XML might also be represented by a path. For example, take a look at this basic EAD hierarchy:

![A sample EAD hierarchy with 4 levels descending from the "root" EAD element](../assets/xml-tree-basic.png 'A sample EAD hierarchy with 4 levels')

To see this as a "tree," imagine inverting the hierarchy document and consider the `ead` element as the "root" from which the other elements are branching out. The tree metaphor is commonly used in describing XML. This is useful in illustrating the hierarchical relationships, which reflects the inheritance relationships from element to element, and can illustrate "parent" (source nodes) and "child" (descending nodes) relationships. 

Another metaphor is a file path. In this representation of the structure, imagine notating each node from the top to the destination. Thus, we can create a specific address for each element in the structure. To address the root node, for example, use a slash and the name of the root element: `/ead`. To reference an entire level, for example everything in the `dsc` level, you might use a path expansion: `/ead/archdesc/dsc/*`. Individual attributes may be referenced by the `@` symbol: `/ead/archdesc[@level]`. This notation provides a specific way to address any element of the structure and reference it within a program. 

**XPath** syntax allows for the creation of patterns that can select particular addresses within the tree. This might be akin to an advanced path expansion or regular expression but for finding things in XML. Currently, ElementTree does not provide full support for XPath syntax, but it does allow for many queries that allow a script to select data from XML in powerful ways. 

Paths in XPath separate node elements with slashes (`/`). 

Elements are strung together with slashes in bewteen to indicate a path from the location of the query. Generally, any query beginning with a slash descends from the root node, but this is not always the case. When the root is not the source, elements begin from a _context node_. 

XPath selectors available in ElementTree include:

| Syntax | Meaning |
| ------ | ------ | 
| `tag` | Selects all child elements of the context node with the given `tag`. These can be used with namespace selectors as well, for example `{namespace}*` selects all tags in a given namespace, or `{*}tag` selects all matching tags in any (or no namespace). |
| `*` | Selects all child elements of a context node. |
| `.` | Selects the current node. |
| `//` | Selects all subelements, on all levels beneath the context element. Useful for matching elements or attributes across various branches of the hierarchy. | 
| `..` | Select a parent element of the context node. |
| `[@attrib]` | Selects all elements with the given attribute. In this case where the attribute name matches `attrib`. |
| `[@attrib='value']` | Selects elements for which the attribute is a given value. In this case where an element includes `attrib="value"`. Note that the value cannot contain quote marks. | 
| `[tag]` | Selects elements with a child that matches the `tag`. | 

Additional XPath expressions are possible. For reference, see the [Python ElementTree documentation](https://docs.python.org/3/library/xml.etree.elementtree.html#supported-xpath-syntax), and for more on XPath generally this [quick introduction to XPath from Library Carpentry](https://librarycarpentry.org/lc-webscraping/02-xpath/index.html).

### Exploring Xpath in ElementTree

Let's take a closer look with the `<titleproper>` elements and explore basic XPath expressions:

In [25]:
title = tree.find('ead:control/ead:filedesc/ead:titlestmt/ead:titleproper', ns)
print(title.tag, title.text)

{http://ead3.archivists.org/schema/}titleproper Finding Aid for the William R. Day Collection day 


In [26]:
for title in tree.findall('ead:control/ead:filedesc/ead:titlestmt/ead:titleproper', ns):
    print(title.tag, title.text)

{http://ead3.archivists.org/schema/}titleproper Finding Aid for the William R. Day Collection day 
{http://ead3.archivists.org/schema/}titleproper William R. Day Collection


Using an XPath selector, we can write the above more efficiently. We can select all elements with a matching element tag using `.` followed by `//`. The `.` selects the current element (if we specify `tree` that is the root element in this case `ead`), and the double slash `//` selects any child element matching the element name supllied. Thus. `.//ead:titleproper` will select any `titleproper` elements in the file:

In [27]:
for title in tree.findall('.//ead:titleproper', ns):
    print(title.tag, title.text)

{http://ead3.archivists.org/schema/}titleproper Finding Aid for the William R. Day Collection day 
{http://ead3.archivists.org/schema/}titleproper William R. Day Collection


Similarly, we could look for some of the elements in the collection contents using the `c` elements. For example, the `c01` tag (representing the "first level" of container objects), occurs multiple times and in this case represents the various series in the collection). Below, the loop uses `findall` to look for all `c01` elements in the current tree, prints the `id` attribute, tag name and list of attributes. Then, a `find` statement is extracts the text of the series description from the `scopecontent` note of the `c01` level and prints it. Finally, a second nested `findall()` looks for the `unittitle` of all `c02` elements within the series and prints a list of the folders or boxes in the series:  

In [28]:
did_count = 0

for obj in tree.findall('.//ead:c01', ns):
    did_count += 1
    # extract the series id for the c01 element 
    print(f'Series id: {obj.attrib["id"]}\n', obj.tag, obj.attrib)

    # extract and print the paragraph in the scopecontent note for the series 
    scope = obj.find('.//ead:scopecontent/ead:p', ns)
    print(scope.text,'\n')

    # look through the siers and find the c02 second levels and their unittitles to see the various folders or subseries in the c01 level
    for item in obj.findall('.//ead:c02//ead:unittitle', ns):
        print(item.text)

Series id: aspace_ref1
 {http://ead3.archivists.org/schema/}c01 {'id': 'aspace_ref1', 'level': 'series'}
The Correspondence and Papers series contains correspondence and papers from William Day and various family members. 

William Day
1896
1897
1898 (3 folders)
1899
1900-1911
1920-1923
Undated
Biographical
Scrapbook
Family
Luther Day (Father)
Emily Spalding Day (Mother)
Louis Schaefer (Father-in-Law)
Other Members
Miscellaneous (3 folders)
Series id: aspace_ref18
 {http://ead3.archivists.org/schema/}c01 {'id': 'aspace_ref18', 'level': 'series'}
The Manuscripts series contains work by William Day and his son, Stephen Day. It also has a dissertation about William Day by Joseph McLean, and a folder of miscellaneous materials. 

William Day on McKinley
Stephan Day Notebook
McLean Dissertation on William Day (2 folders)
Miscellaneous
Series id: aspace_ref22
 {http://ead3.archivists.org/schema/}c01 {'id': 'aspace_ref22', 'level': 'series'}
The Newspaper series includes issues of the Univers

To summarize, this notebook illustrates the basics of parsing XML with the eTree module in Python. Specifically, we demonstrated how to load and parse an XML document from a file, explored how to navigate and identify specific elements and attributes, how to work with namespaces (within a progrm and when writing out XML), how to write XML to a file, and how add and remove attributes. This provides a basic toolbox for working with, analyzing, and modifying XML.  