# XML: XPath and EAD examples

Some additional examples using XPath to make queries to sample EAD documents stored as XML.

In [4]:
import xml.etree.ElementTree as ET

The lxml library allows for greater use of XPath and also pretty printing. The lxml library also has some useful functions like `.subElement()` and `.getparent()`. So let's import lxml too:

In [5]:
# depending on your VS Code setup or what python you're using, you may need to install lxml
from lxml import etree

In [6]:
tree = ET.parse('data/xml/superior-papers.xml')
root = tree.getroot()

In [7]:
ns = {
    'ead': 'http://ead3.archivists.org/schema/'
}

In [None]:
for element in root:
    print(element.tag)

{http://ead3.archivists.org/schema/}control
{http://ead3.archivists.org/schema/}archdesc


## Prefix and asterisk

Use the prefix and an asterisk to find all tags from a particular namespace:

In [8]:
allEadTags = root.findall('.//ead:*', ns)

for tag in allEadTags:
    print(tag.tag, tag.attrib)

{http://ead3.archivists.org/schema/}control {'countryencoding': 'iso3166-1', 'dateencoding': 'iso8601', 'langencoding': 'iso639-2b'}
{http://ead3.archivists.org/schema/}recordid {'instanceurl': 'http://jajohnst.si.umich.edu/fake-ead.xml'}
{http://ead3.archivists.org/schema/}filedesc {}
{http://ead3.archivists.org/schema/}titlestmt {}
{http://ead3.archivists.org/schema/}titleproper {}
{http://ead3.archivists.org/schema/}publicationstmt {}
{http://ead3.archivists.org/schema/}publisher {}
{http://ead3.archivists.org/schema/}date {'normal': '2022-09-01'}
{http://ead3.archivists.org/schema/}archdesc {'level': 'collection', 'audience': 'external'}
{http://ead3.archivists.org/schema/}did {}
{http://ead3.archivists.org/schema/}repository {}
{http://ead3.archivists.org/schema/}corpname {}
{http://ead3.archivists.org/schema/}part {}
{http://ead3.archivists.org/schema/}part {}
{http://ead3.archivists.org/schema/}bioghist {}
{http://ead3.archivists.org/schema/}dsc {'dsctype': 'otherdsctype', 'audi

## Prefix, asterisk, and attribute

Use the prefix, an asterisk, and look for any matching tag wiht a particular attribute. In this case, `audience="internal"`:

In [9]:
internalTags = root.findall('.//ead:*[@audience="internal"]', ns)

for tag in internalTags:
    print(tag.tag, tag.attrib)

{http://ead3.archivists.org/schema/}c03 {'level': 'file', 'audience': 'internal'}


Similar to above, but audience attribute with the value of "external": 

In [None]:
# look for any tags that have an audience attribute that is set to "external"
internalTags = root.findall('.//ead:*[@audience="external"]', ns)

for tag in internalTags:
    tag.set('checked', 'yes')
    print(tag.tag, tag.attrib)


{http://ead3.archivists.org/schema/}archdesc {'level': 'collection', 'audience': 'external', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}dsc {'dsctype': 'otherdsctype', 'audience': 'external', 'checked': 'yes'}


## Prefix with asterisk and a specific attribute, no specified value

Here any tag from the EAD namespace that has an `audience` attribute:

In [None]:
# look for any tags with audience attribute
internalTags = root.findall('.//ead:*[@audience]', ns)

for tag in internalTags:
    print(tag.tag, tag.attrib)

{http://ead3.archivists.org/schema/}archdesc {'level': 'collection', 'audience': 'external', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}dsc {'dsctype': 'otherdsctype', 'audience': 'external', 'checked': 'yes'}
{http://ead3.archivists.org/schema/}c03 {'level': 'file', 'audience': 'internal'}


Using the asterisk, but only looking wihtin the `archdesc` element:

In [24]:
archDesc = root.find('.//ead:archdesc', ns)

for tag in archDesc:
    print(tag.tag, tag.attrib)
    print('\nlooking for subElements:')
    subElements = tag.findall('.//ead:*', ns)
    for subE in subElements:
        print(subE.tag)

{http://ead3.archivists.org/schema/}did {}

looking for subElements:
{http://ead3.archivists.org/schema/}repository
{http://ead3.archivists.org/schema/}corpname
{http://ead3.archivists.org/schema/}part
{http://ead3.archivists.org/schema/}part
{http://ead3.archivists.org/schema/}bioghist {}

looking for subElements:
{http://ead3.archivists.org/schema/}dsc {'dsctype': 'otherdsctype', 'audience': 'external'}

looking for subElements:
{http://ead3.archivists.org/schema/}c01
{http://ead3.archivists.org/schema/}c02
{http://ead3.archivists.org/schema/}c03


In [29]:
# all top-level elements
elements = root.findall('.', ns)
for element in elements:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}ead {'audience': 'external'}


In [35]:
# all elements with an audience=external attibute that have a child element did
elements = root.findall('.//ead:did/..[@audience="external"]', ns)
for element in elements:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}archdesc {'level': 'collection', 'audience': 'external'}


In [43]:
# all corpname elements that are children of a node with an external audience attribute
elements = root.findall('.//ead:*[@audience="external"]//ead:corpname', ns)
for element in elements:
    print(element.tag, element.attrib)

{http://ead3.archivists.org/schema/}corpname {}


In [107]:
# TODO: fix removals example
# At present, this does not work:
count_removals = 0
# remove all ead nodes with an internal audience attrib
for eadNode in root.findall('.//{http://ead3.archivists.org/schema/}c03'):
    print(eadNode.tag, type(eadNode))
    audience = eadNode.get('audience')
    print('audience:',audience)
    if audience == 'internal':
        count_removals += 1
        print('removing...')
        root.remove(eadNode)
    else:
        continue

print(f'removed {count_removals} internal elements.')

{http://ead3.archivists.org/schema/}c03 <class 'xml.etree.ElementTree.Element'>
audience: internal
removing...


ValueError: list.remove(x): x not in list

# Write out / Save as XML

Examples of how to save the modified tree as well-formed and valid XML with EAD namespace:

In [25]:
ET.register_namespace('ead', 'http://ead3.archivists.org/schema/')

The `ET` print function is okay, but the `etree` function provides a nice "pretty" function:

In [16]:
print(ET.tostring(root))

b'<ns0:ead xmlns:ns0="http://ead3.archivists.org/schema/" audience="external">\n    <ns0:control countryencoding="iso3166-1" dateencoding="iso8601" langencoding="iso639-2b">\n        <ns0:recordid instanceurl="http://jajohnst.si.umich.edu/fake-ead.xml">1234</ns0:recordid>\n        <ns0:filedesc>\n            <ns0:titlestmt>\n                <ns0:titleproper>A Finding Aid for the Superior Papers</ns0:titleproper>\n            </ns0:titlestmt>\n            <ns0:publicationstmt>\n                <ns0:publisher>University of Michigan School of Information</ns0:publisher>\n                <ns0:date normal="2022-09-01">September 2022</ns0:date>\n            </ns0:publicationstmt>\n        </ns0:filedesc>\n    </ns0:control>\n    <ns0:archdesc level="collection" audience="external">\n        <ns0:did>\n            <ns0:repository>\n                <ns0:corpname>\n                    <ns0:part>University of Michigan</ns0:part>\n                    <ns0:part>School of Information</ns0:part>\n  

In [102]:
tree.write('data/xml/superior-papers.xml', xml_declaration=True, encoding='utf-8')