# Introduction to XML Parsing in Python wiht the ETree module

This notebook reviews some of the essential functions and procedures for parsing XML in Python.

First, import the etree module. Although it does support other parsers, we will mostly use ElementTree, which is imported as follows (this is favored since it allows you to call the module by typing `ET` rather than the whole string): 

In [1]:
import xml.etree.ElementTree as ET

To get an idea of the available classes and functions, try the `inspect` module:

In [2]:
from inspect import getmembers, isclass, isfunction

In [3]:
# Display classes in ET module
for (name, member) in getmembers(ET, isclass):
    if not name.startswith('_'):
        print(name)

C14NWriterTarget
Element
ElementTree
ParseError
QName
TreeBuilder
XMLParser
XMLPullParser


# focus on Element and ElementTree

In [4]:
# display functions in ET module
for (name, member) in getmembers(ET, isfunction):
    if not name.startswith('_'):
        print(name)

Comment
PI
ProcessingInstruction
XML
XMLID
canonicalize
dump
fromstring
fromstringlist
indent
iselement
iterparse
parse
register_namespace
tostring
tostringlist


In this section, we will focus on the basic usage of a few of the functions. To accomplish initial XML parsing these are:

* `.parse()` - creates a python object that we can manipulate with ElementTree
* `.getroot()` - structures an ElementTree object according to the root element that you set
* `.tostring()` - converts XML object data into string format
* `.fromstring()` - converts string data into an XML encodable object

Additional methods that we will use include:

* `.get()` - allows you to get specified attributes
* `.set()` - allows you to add ("set") specified attributes
* `.write()` when applied to an ElementTree object, this will write out to the filename passed as an argument
* `.append()` - to add a new Element or "tag" if input as a string (that is, `.fromstring()`); alternatively use an `Element` constructor
* to remove attributes, use `del()` - this works because the ElementTree processes attributes as a dictionary

In [2]:
ead_file = os.path.join('data','day_20221004_205435_UTC__ead.xml')

In [3]:
tree = ET.parse(ead_file)
root = tree.getroot()

print(root[:250])

[<Element '{http://ead3.archivists.org/schema/}control' at 0x7fe048d804a0>, <Element '{http://ead3.archivists.org/schema/}archdesc' at 0x7fe048d78950>]


Note it is possible to use the `tag` method to print the name of the tag. 

In addition, since we are parsing an EAD document with a namespace specified, the name of the element includes the very specific and helpful (though rather long) specification to the EAD3 namespace. In other words, it is telling us the vocabulary that it is associated with.

In [7]:
for element in root:
    print(element.tag, type(element))

{http://ead3.archivists.org/schema/}control <class 'xml.etree.ElementTree.Element'>
{http://ead3.archivists.org/schema/}archdesc <class 'xml.etree.ElementTree.Element'>


If you see things like Element ... `control` and `archdesc` IT WORKED! You've parsed XML with python. 

Now, let's get more fancy. For example, let's look at the content in a human readable text. This can be done wiht the `.tostring()` function, which will convert those binary byte objects into plain text.

In [10]:
print(ET.tostring(root)[:500])

b'<ns0:ead xmlns:ns0="http://ead3.archivists.org/schema/"><ns0:control countryencoding="iso3166-1" dateencoding="iso8601" langencoding="iso639-2b" relatedencoding="marc" repositoryencoding="iso15511" scriptencoding="iso15924"><ns0:recordid instanceurl="">umich-scl-day</ns0:recordid><ns0:filedesc><ns0:titlestmt><ns0:titleproper>Finding Aid for the William R. Day Collection day </ns0:titleproper><ns0:titleproper localtype="filing">William R. Day Collection</ns0:titleproper><ns0:author>Finding aid pr'


We can seet the tags of an Element object using the `.tag` method, and the attributes associated with the element are stored in a dictionary that can be called with the `.attrib` method. 

In [24]:
root.tag

'{http://ead3.archivists.org/schema/}ead'

In [25]:
root.attrib

{}

In [26]:
for element in root:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}control {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924'}
{http://ead3.archivists.org/schema/}archdesc {'level': 'collection'}


To select particular elements, use the `find()` function (this will return the first element that matches your request). The argument here is a modified XPath selector, which is how we will guide the function to the elements that we want to see in the tree. If looking for multiple elements, there is also a `findall()` function. 

In [32]:
control = tree.find('{http://ead3.archivists.org/schema/}control')
print(control.attrib) 

{'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924'}



Once we find the elements, use the `.get()` method to take a look at the attributes associated with a particular element. To start, let's check the root element. This has attributes which list the namespaces used in the file, in this case the attribute is named `xmlns`.

In [34]:
countryCode = control.get('countryencoding')

print(f'Country encoding is according to: {countryCode}')

Country encoding is according to: iso3166-1


Once identified, Element objects can be iterated through, like the root element previously:

In [35]:
type(control)

xml.etree.ElementTree.Element

In [36]:
for element in control:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}recordid {'instanceurl': ''}
{http://ead3.archivists.org/schema/}filedesc {}
{http://ead3.archivists.org/schema/}maintenancestatus {'value': 'derived'}
{http://ead3.archivists.org/schema/}maintenanceagency {'countrycode': 'US'}
{http://ead3.archivists.org/schema/}languagedeclaration {}
{http://ead3.archivists.org/schema/}conventiondeclaration {}
{http://ead3.archivists.org/schema/}localcontrol {'localtype': 'findaidstatus'}
{http://ead3.archivists.org/schema/}maintenancehistory {}


Now, let's simplify the usage of namespaces. As you can see from the tags above, it can get tedious to use the full reference for each tag. Instead, our aim will be to shorten this to a prefix, in this case `ead:` as a shorthand for the schema URI. 

To do this, the etree module provides a namespace handler. To initial namespaces, establish a dictionary, typically named `ns` (or something short and easy to remember), that will be passed into the parser:

In [37]:
ns = {
    'ead' : 'http://ead3.archivists.org/schema/'
}

In [49]:
control = root.find('ead:control', ns)
print(control.attrib)

{'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924'}


Compared to JSON, this may seem a bit fussy, but the advantage is that we can be highly specific when referencing elements, and we can be very precise in defining what an element means. For example, all of the elements and allowed attributes are defined in detail at the EAD specification: https://loc.gov/ead/EAD3taglib/. 

## Using XPath within etree

XPath is a selector language that allows us to search for specific elements and attributes within the tree. The power of this language is that it allows us to select very precisely, and it also allows us to see multiple items at similar levels in the hierarchy or those that meet particular characteristics (such as having a particular attribute). 

For a quick introduction to XPath, see this introduction from Library Carpentry: https://librarycarpentry.org/lc-webscraping/02-xpath/index.html

Let's take a closer look with the `<titleproper>` elements:

In [58]:
title = tree.find('ead:control/ead:filedesc/ead:titlestmt/ead:titleproper', ns)
print(title.tag, title.text)

{http://ead3.archivists.org/schema/}titleproper Finding Aid for the William R. Day Collection day 


In [59]:
for title in tree.findall('ead:control/ead:filedesc/ead:titlestmt/ead:titleproper', ns):
    print(title.tag, title.text)

{http://ead3.archivists.org/schema/}titleproper Finding Aid for the William R. Day Collection day 
{http://ead3.archivists.org/schema/}titleproper William R. Day Collection


Using a more general XPath selector, we can write the above more efficiently. We can select all elements with a matching element tag using `.` followed by `//`. The `.` selects the current element (if we specify `tree` that is the root element in this case `ead`), and the double slash `//` selects any child element matching the element name supllied. Thus. `.//ead:titleproper` will select any `titleproper` elements in the file:

In [63]:
for title in tree.findall('.//ead:titleproper', ns):
    print(title)

<Element '{http://ead3.archivists.org/schema/}titleproper' at 0x7fe048d80090>
<Element '{http://ead3.archivists.org/schema/}titleproper' at 0x7fe048d80770>


Similarly, we could look for digital objects, the `c01` tag (representing the "first level" of container objects, here we will see that these represent series in the collection):

In [96]:
did_count = 0

for obj in tree.findall('.//ead:c01', ns):
    did_count += 1
    # pull out the series id for the c01 element 
    print(f'Series id: {obj.attrib["id"]}\n',obj.tag, obj.attrib)
    # pull out and print the paragraph in the scopecontent note for the series 
    scope = obj.find('.//ead:scopecontent/ead:p', ns)
    print(scope.text,'\n')
    # look through the siers and find the c02 second levels and their unittitles to see the various folders or subseries in the c01 level
    for item in obj.findall('.//ead:c02//ead:unittitle', ns):
        print(item.text)

Series id: aspace_ref1
 {http://ead3.archivists.org/schema/}c01 {'id': 'aspace_ref1', 'level': 'series'}
The Correspondence and Papers series contains correspondence and papers from William Day and various family members. 

William Day
1896
1897
1898 (3 folders)
1899
1900-1911
1920-1923
Undated
Biographical
Scrapbook
Family
Luther Day (Father)
Emily Spalding Day (Mother)
Louis Schaefer (Father-in-Law)
Other Members
Miscellaneous (3 folders)
Series id: aspace_ref18
 {http://ead3.archivists.org/schema/}c01 {'id': 'aspace_ref18', 'level': 'series'}
The Manuscripts series contains work by William Day and his son, Stephen Day. It also has a dissertation about William Day by Joseph McLean, and a folder of miscellaneous materials. 

William Day on McKinley
Stephan Day Notebook
McLean Dissertation on William Day (2 folders)
Miscellaneous
Series id: aspace_ref22
 {http://ead3.archivists.org/schema/}c01 {'id': 'aspace_ref22', 'level': 'series'}
The Newspaper series includes issues of the Univers

## Modifying XML with the etree module

The `etree` module can also be used to modify XML, including adding/modifying/removing attributes, reading and modifying elements, or adding text within an element.

`.set()` and `del()`