# XML: XPath and EAD examples

Some additional examples using XPath to make queries to sample EAD documents stored as XML.

In [1]:
import xml.etree.ElementTree as ET

The lxml library allows for greater use of XPath and also pretty printing. The lxml library also has some useful functions like `.subElement()` and `.getparent()`. So let's import lxml too:

In [2]:
# depending on your VS Code setup or what python you're using, you may need to install lxml
from lxml import etree

In [4]:
tree = ET.parse('data/xml/superior-papers-complete.xml')
root = tree.getroot()

In [5]:
ns = {
    'ead': 'http://ead3.archivists.org/schema/'
}

You can double check to see that `root` is an Element type. 
And, the name of the element (in this case it is the root `<ead></ead>` element): 

In [11]:
root.tag

'{http://ead3.archivists.org/schema/}ead'

Use a basic loop to see the children of the selected element 
(in this case any subelements of the `<ead></ead>` element):

In [6]:
for element in root:
    print(element.tag)

{http://ead3.archivists.org/schema/}control
{http://ead3.archivists.org/schema/}archdesc


Use the `.iter()` function to look for all elements in the tree:

In [8]:
for element in root.iter():
    print(element.tag)

{http://ead3.archivists.org/schema/}ead
{http://ead3.archivists.org/schema/}control
{http://ead3.archivists.org/schema/}recordid
{http://ead3.archivists.org/schema/}filedesc
{http://ead3.archivists.org/schema/}titlestmt
{http://ead3.archivists.org/schema/}titleproper
{http://ead3.archivists.org/schema/}titleproper
{http://ead3.archivists.org/schema/}author
{http://ead3.archivists.org/schema/}publicationstmt
{http://ead3.archivists.org/schema/}publisher
{http://ead3.archivists.org/schema/}address
{http://ead3.archivists.org/schema/}addressline
{http://ead3.archivists.org/schema/}addressline
{http://ead3.archivists.org/schema/}addressline
{http://ead3.archivists.org/schema/}date
{http://ead3.archivists.org/schema/}num
{http://ead3.archivists.org/schema/}p
{http://ead3.archivists.org/schema/}maintenancestatus
{http://ead3.archivists.org/schema/}maintenanceagency
{http://ead3.archivists.org/schema/}agencycode
{http://ead3.archivists.org/schema/}agencyname
{http://ead3.archivists.org/schema

Or you can combine these two to look at the two main elements (`control` and `archdesc`), then show all of the subelements of those two main sections:

In [12]:
for element in root:
    print(element.tag)
    for el in element.iter():
        print('  ', el.tag, el.attrib)

{http://ead3.archivists.org/schema/}control
   {http://ead3.archivists.org/schema/}control {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924'}
   {http://ead3.archivists.org/schema/}recordid {'instanceurl': 'Reading Room & website'}
   {http://ead3.archivists.org/schema/}filedesc {}
   {http://ead3.archivists.org/schema/}titlestmt {}
   {http://ead3.archivists.org/schema/}titleproper {}
   {http://ead3.archivists.org/schema/}titleproper {'localtype': 'filing'}
   {http://ead3.archivists.org/schema/}author {}
   {http://ead3.archivists.org/schema/}publicationstmt {}
   {http://ead3.archivists.org/schema/}publisher {}
   {http://ead3.archivists.org/schema/}address {}
   {http://ead3.archivists.org/schema/}addressline {}
   {http://ead3.archivists.org/schema/}addressline {}
   {http://ead3.archivists.org/schema/}addressline {'localtype': 'email'}
   {http://ead3.

## Prefix and asterisk

Use the prefix and an asterisk to find all tags from a particular namespace:

In [51]:
allEadTags = root.findall('.//ead:*', ns)

for tag in allEadTags:
    print(tag.tag, tag.attrib)

{http://ead3.archivists.org/schema/}control {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b', 'relatedencoding': 'marc', 'repositoryencoding': 'iso15511', 'scriptencoding': 'iso15924'}
{http://ead3.archivists.org/schema/}recordid {'instanceurl': 'Reading Room & website'}
{http://ead3.archivists.org/schema/}filedesc {}
{http://ead3.archivists.org/schema/}titlestmt {}
{http://ead3.archivists.org/schema/}titleproper {}
{http://ead3.archivists.org/schema/}titleproper {'localtype': 'filing'}
{http://ead3.archivists.org/schema/}author {}
{http://ead3.archivists.org/schema/}publicationstmt {}
{http://ead3.archivists.org/schema/}publisher {}
{http://ead3.archivists.org/schema/}address {}
{http://ead3.archivists.org/schema/}addressline {}
{http://ead3.archivists.org/schema/}addressline {}
{http://ead3.archivists.org/schema/}addressline {'localtype': 'email'}
{http://ead3.archivists.org/schema/}date {}
{http://ead3.archivists.org/schema/}num {}
{http://ead3

## Prefix, asterisk, and attribute

Use the prefix, an asterisk, and look for any matching tag wiht a particular attribute. In this case, `level="collection"`:

In [65]:
levelTags = root.findall('.//ead:*[@level="collection"]', ns)

for tag in levelTags:
    print(tag.tag, tag.attrib)

{http://ead3.archivists.org/schema/}archdesc {'level': 'collection'}


Similar to above, but audience attribute with the value of "series", and 
add the attribute `checked="yes"` as if you were going through to indicate a process had been undertaken. 

In [66]:
# find all tags that have a level attribute that is set to "series"
levelTags = root.findall('.//ead:*[@level="series"]', ns)

for tag in levelTags:
    tag.set('checked', 'yes')
    print(tag.tag, tag.attrib)


{http://ead3.archivists.org/schema/}c {'id': 'aspace_edab5fb678b5e37fcf34130da5836686', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_cf40ab6e8beda6e2fdb51e0a08d30e06', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_5f14153a5cfd7f41c0b74cb59f429490', 'level': 'series', 'checked': 'yes'}


## Prefix with asterisk and a specific attribute, no specified value

Here any tag from the EAD namespace that has an `level` attribute:

In [67]:
# look for any tags with level attribute
levelTags = root.findall('.//ead:*[@level]', ns)

for tag in levelTags:
    print(tag.tag, tag.attrib)

{http://ead3.archivists.org/schema/}archdesc {'level': 'collection'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_edab5fb678b5e37fcf34130da5836686', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_cf40ab6e8beda6e2fdb51e0a08d30e06', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_5f14153a5cfd7f41c0b74cb59f429490', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_0d77dcd4a4330f180ac372a192c51427', 'level': 'file'}


Using the asterisk, but only looking wihtin the `archdesc` element:

In [68]:
archDesc = root.find('.//ead:archdesc', ns)

for tag in archDesc:
    print(tag.tag, tag.attrib)
    print('\nlooking for subElements:')
    subElements = tag.findall('.//ead:*', ns)
    for subE in subElements:
        print(subE.tag)

{http://ead3.archivists.org/schema/}did {}

looking for subElements:
{http://ead3.archivists.org/schema/}unittitle
{http://ead3.archivists.org/schema/}unitid
{http://ead3.archivists.org/schema/}unitid
{http://ead3.archivists.org/schema/}repository
{http://ead3.archivists.org/schema/}corpname
{http://ead3.archivists.org/schema/}part
{http://ead3.archivists.org/schema/}langmaterial
{http://ead3.archivists.org/schema/}language
{http://ead3.archivists.org/schema/}physdescstructured
{http://ead3.archivists.org/schema/}quantity
{http://ead3.archivists.org/schema/}unittype
{http://ead3.archivists.org/schema/}physfacet
{http://ead3.archivists.org/schema/}dimensions
{http://ead3.archivists.org/schema/}physdesc
{http://ead3.archivists.org/schema/}unitdatestructured
{http://ead3.archivists.org/schema/}daterange
{http://ead3.archivists.org/schema/}fromdate
{http://ead3.archivists.org/schema/}todate
{http://ead3.archivists.org/schema/}unitdate
{http://ead3.archivists.org/schema/}controlaccess {}

l

In [69]:
# all top-level elements
elements = root.findall('.', ns)
for element in elements:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}ead {}


In [71]:
# all elements with an level=series attibute that have a child element did
elements = root.findall('.//ead:did/..[@level="series"]', ns)
for element in elements:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}c {'id': 'aspace_edab5fb678b5e37fcf34130da5836686', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_cf40ab6e8beda6e2fdb51e0a08d30e06', 'level': 'series', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c {'id': 'aspace_5f14153a5cfd7f41c0b74cb59f429490', 'level': 'series', 'checked': 'yes'}


In [75]:
# all unitid elements that are children of a node with an series level attribute
elements = root.findall('.//ead:*[@level="series"]//ead:unitid', ns)
for element in elements:
    print(element.tag, element.attrib, element.text)

{http://ead3.archivists.org/schema/}unitid {'localtype': 'aspace_uri'} /repositories/2/archival_objects/3
{http://ead3.archivists.org/schema/}unitid {} 00-1332-01
{http://ead3.archivists.org/schema/}unitid {'localtype': 'aspace_uri'} /repositories/2/archival_objects/4
{http://ead3.archivists.org/schema/}unitid {} 00-1332-02
{http://ead3.archivists.org/schema/}unitid {'localtype': 'aspace_uri'} /repositories/2/archival_objects/1
{http://ead3.archivists.org/schema/}unitid {} deep-six
{http://ead3.archivists.org/schema/}unitid {'localtype': 'aspace_uri'} /repositories/2/archival_objects/2
{http://ead3.archivists.org/schema/}unitid {} deep-six


# Check and Correct or Remove based on criteria

Using an XPath expression, check for certain conditions, then check or remove elements based on those conditions.

In [83]:
# TODO: fix removals example; this follows an example in the ET documentation, but 
# presently, does not work:
count_removals = 0

# remove all unitid nodes with an localtype attrib
for unitid in root.findall('.//ead:unitid[@localtype]', ns):
    print(unitid.tag, unitid.attrib)
    localtype = unitid.get('localtype')
    print('localtype:',localtype)
    if localtype == 'aspace_uri':
        count_removals += 1
        print('removing...')
        root.remove(unitid)

# report
print(f'removed {count_removals} localtype elements.')

{http://ead3.archivists.org/schema/}unitid {'localtype': 'aspace_uri'}
localtype: aspace_uri
removing...


ValueError: list.remove(x): x not in list

# Write out / Save as XML

Examples of how to save the modified tree as well-formed and valid XML with EAD namespace:

In [84]:
ET.register_namespace('ead', 'http://ead3.archivists.org/schema/')

In [91]:
print(ET.tostring(root, encoding='utf-8', xml_declaration=True).decode('utf-8'))

<?xml version='1.0' encoding='utf-8'?>
<ead:ead xmlns:ead="http://ead3.archivists.org/schema/">
  <ead:control countryencoding="iso3166-1" dateencoding="iso8601" langencoding="iso639-2b" relatedencoding="marc" repositoryencoding="iso15511" scriptencoding="iso15924"><ead:recordid instanceurl="Reading Room &amp; website">00-1332</ead:recordid><ead:filedesc><ead:titlestmt><ead:titleproper>The Superior Papers </ead:titleproper><ead:titleproper localtype="filing">Superior Papers, The</ead:titleproper><ead:author>Jesse Johnston</ead:author></ead:titlestmt><ead:publicationstmt><ead:publisher>SI 667: Foundations of Digital Curation, winter 2023</ead:publisher><ead:address><ead:addressline>105 S State St</ead:addressline><ead:addressline>Ann Arbor, Michigan 48109</ead:addressline><ead:addressline localtype="email">jajohnst@umich.edu</ead:addressline></ead:address><ead:date>July 10, 2023</ead:date><ead:num>00.1332</ead:num><ead:p /></ead:publicationstmt></ead:filedesc><ead:maintenancestatus valu

In [88]:
tree.write('data/xml/superior-papers.xml', xml_declaration=True, encoding='utf-8')