# Navigating XML Document Trees Using `lxml.etree`

Virtually all user-friendly tutorials for accessing XML document trees describe `xml.etree.ElementTree`. That's all fine, but another library, `lxml.etree`, has a slight edge in functionality, notably its XPath implementation and its `getparent()` method. However, its documentation expects users to have learned `ElementTree` first and is thus useless to newcomers. Also it focuses on _writing_ XML rather than reading it, which is what we need here. So we'll learn by example how to extract text nodes and metadata from TEI XML documents using `lxml.etree`.

In [1]:
import os
from lxml import etree
from git import Repo

We'll ascertain the ECHOE repository has been cloned so we have XML documents to work with:

In [2]:
remote = 'https://github.com/ECHOEProject/echoe.git'
local = 'echoe'
# Only clone if the target folder doesn't already exist:
if not(os.path.exists(local)):
    repo = Repo.clone_from(remote, local)
# Else, just update the working copy from remote:
else:
    repo = Repo(local)
    assert isinstance(repo, Repo)
    repo.remotes.origin.pull()
assert not repo.bare

## Namespaces

One of the biggest headaches of XML APIs consists in namespaces. Essentially, because an XML document may mix elements from different schemas (i.e. custom rules for what elements are valid in what contexts), it is possible for element names to be ambiguous between two or more schemas. Well-formed XML documents therefore encode every instance of an element with the corresponding namespace. Since this information cascades down the document tree, a TEI document typically encodes the TEI namespace explicitly only in its root tag (`<TEI xmlns="http://www.tei-c.org/ns/1.0">`), everything below which is then automatically coded as TEI as well unless prefixed by another namespace abbreviation (e.g. `<me:norm>` to use a Menota tag `<norm>` for normalized text). But the downside of all this namespace-attribution is that if you try to iterate through `<w>` elements in a TEI document (say), your query will come up empty unless you make clear you meant to say `<w>`-elements _from the TEI namespace_.

If you have the stomach for it, you can read [the `lxml` tutorial's section on namespaces](https://lxml.de/tutorial.html#namespaces). But it comes down to two options:

1. Either you prefix each element with the namespace URI in curly braces, as below:
   
   ```python
   for element in root.iter('{http://www.tei-c.org/ns/1.0}resp'):
       etree.tostring(element.text, method='text', encoding='unicode')
   ```

2. Or you set up a shorthand as follows:
   
   ```python
   tei_namespace = 'http://www.tei-c.org/ns/1.0'
   tei = '{%s}' % tei_namespace
   NSMAP = {None : tei_namespace}
   ```

   after which you use the argument `nsmap=NSMAP` with each command that accepts it, as demonstrated in the `lxml` manual.

I recommend you start with the former solution, and then once you've confirmed your code works you can always experiment with the second.

## Loading Your Document

We don't need to use stock `load()` before accessing our document. Instead, we initialize our parser with the desired options; load our document into a document tree by feeding it the location on disk directly; and navigate to the root node, as follows:

In [3]:
parser = etree.XMLParser(remove_blank_text=True,resolve_entities=True)
filename = 'echoe/xml/344.05.xml'
tree = etree.parse(filename, parser=parser)
root = tree.getroot()

## Tree Navigation, ElementPath, and XPath

At this point if there's some part of the document you're especially interested in, such as the text or some part of the header, you may as well go ahead and assign the lowest ancestor node for that part of the tree a variable for easy access using `lxml`'s `find()` function. However, for this and a number of other  features you'll need to rely on XPath-style tree navigation, either using actual XPath as implemented in `etree`'s `xpath()` method or using the simplified syntax called ElementPath and used in both ElementTree and `lxml.etree`. Both will take some getting used to! Here are a few pointers:

| Syntax | Refers to                                              | Example                                 |
| ------ | ------------------------------------------------------ | --------------------------------------- |
| `.`    | Current node                                           | (see below)                             |
| `/`    | Axis selector: use to navigate between related nodes   | `./w[0]` (first `w` below present node) |
| `/`    | Used initially: start from root node                   | `/teiHeader`                            |
| `//`   | Relative axis selector: "anywhere below this point"    | `//w` (to select all `w` below root)    |

Thus far, the syntax holds true both of ElementPath and XPath. But XPath has axis classes like `child::` and `preceding-sibling::`, while `lxml` embeds its navigation in functions like `find()` and `get_parent()` and is on the whole more limited. Read an online XPath tutorial for more guidance.

In [4]:
# You'd think you should be able to drop the period here, 
# but not if we start in a defined position like "root"!
text = root.find('.//{http://www.tei-c.org/ns/1.0}text')


## Text Nodes

Let's say we want to ignore anything above the level of `<w>` (i.e. any sentence-like segments or verse lines), but want to store the text node of each word as a token.

To retrieve every matching node we can use the `iter()` method, which returns all nodes of the specified element name(s) (comma-separated multiples are permitted) below the specified node in the tree. Or we can use `findall()`, which runs an ElementPath query and returns all matches. Essentially, as long as you just want to process all nodes of a certain element type, use `iter()`; reserve `findall()` for more complex queries, e.g. involving attribute value matching. Also keep in mind that the `iter()` simply takes (namespaced!) element names as its argument, while `findall()` may need an ElementPath-style starting point, such as `.//`.

The problem with just querying `w.text` is that TEI documents are full of mixed nodes, i.e. your text node may be interrupted by another element, which you would lose. So instead, we'll use the `etree.tostring()` method to turn the whole `<w>` node into a text string, descendant nodes and all. We'll use the arguments `method="text"` and `encoding="unicode"` for this:

In [5]:
tokens = []
for token in text.iter('{http://www.tei-c.org/ns/1.0}w'):
    tokens.append(etree.tostring(token, method='text', encoding='unicode'))

In [6]:
tokens[:25]

['VRVM',
 'ǷEALDENDE',
 'RIHTGELẎFENDV̅M',
 'A',
 'ǷORVLD\n    a',
 'ƿoruld',
 'minum',
 'þam',
 'leofeſtan',
 'hlaforde',
 'ofer',
 'ealle',
 'oðre',
 'men',
 'eorðlice',
 'kẏninga\uf127',
 'alfƿold',
 'ea\uf127t\n    engla',
 'kẏning',
 'mid',
 'rihte',
 '⁊',
 'mid',
 'geri\uf127enūm',
 'rice']

That's a start, but there's room for improvement. Clearly we need to normalize by stripping

- spaces
- newlines
- at least some unusual characters (perhaps things like "ſ" and "ẏ," but certainly `\uf127`, which represents descending s)
- parallel environments: delete `<abbr>` if printing `<expan>`

Let's start with that last problem. We can make a list of any elements we need ignored, then nuke them before gathering our tokens (don't worry, the input document is not affected; we're just deleting nodes from the document loaded into memory at this stage):

In [7]:
# We'll make a list of elements to get rid of, 
# which we can always add to later:
delenda = ['abbr', 'am', 'sic', 'note']
# Now we search and destroy:
query = ['{http://www.tei-c.org/ns/1.0}' + i for i in delenda]
for hit in text.iter(query):
    hit.getparent().remove(hit)

Now if we repeat the preceding two code cells, the abbreviation markers are gone.

Time to move on to traditional normalization:

In [8]:
# Create a dict with characters we want replaced:
substitutions = {
    'ę': 'æ',
    'ƿ': 'w',
    'ẏ': 'y',
    'ſ': 's',
    '': 's', # Using the glyph for descending s, instead of the unicode key point
    'v': 'u',
    'j': 'i',
    '⁊': 'and',
    ' ': '',
    '\n': ''
}

# Write a function carrying out the desired operations:
def normalize(token):
    # Lowercase:
    token = token.lower()
    for k,v in substitutions.items():
        # Carry out replacements:
        token = token.replace(k, v)
    return token

Now we simply loop our function `normalize()` into the existing routine:

In [9]:
tokens = []
for token in text.iter('{http://www.tei-c.org/ns/1.0}w'):
    tokens.append(normalize(etree.tostring(token, method='text', encoding='unicode')))

In [10]:
tokens[:25]

['urum',
 'wealdende',
 'rihtgelyfendum',
 'a',
 'worulda',
 'woruld',
 'minum',
 'þam',
 'leofestan',
 'hlaforde',
 'ofer',
 'ealle',
 'oðre',
 'men',
 'eorðlice',
 'kyningas',
 'alfwold',
 'eastengla',
 'kyning',
 'mid',
 'rihte',
 'and',
 'mid',
 'gerisenum',
 'rice']

## Metadata

That's great: we now have a list of tokens for our document. If we want to go by verse line or sentence-like segment, we simply loop in an extra level. But what about all that metadata we've encoded?

At present, ECHOE is not yet a great example for metadata expected of every node, such as part of speech and lemma. So as a demonstration of the sort of attributes encoded for each word in your document, we can move away from the `<w>` element and query any proper nouns instead:

In [11]:
names = []
elements = ['placeName', 'persName']
query = ['{http://www.tei-c.org/ns/1.0}' + i for i in elements]
for token in text.iter(query):
    data = dict()
    data['form'] = normalize(etree.tostring(token, method='text', encoding='unicode'))
    # Filtering out "#", which is used in ECHOE documents
    # so a geotag can be applied:
    data['label'] = token.get('key').replace('#', '')
    names.append(data)

In [12]:
names

[{'form': 'alfwold', 'label': 'Ælfwald'},
 {'form': 'felix', 'label': 'Felix'},
 {'form': 'guðhlaces', 'label': 'Guthlac'},
 {'form': 'guðlaces', 'label': 'Guthlac'},
 {'form': 'guðlaces', 'label': 'Guthlac'},
 {'form': 'wilfrides', 'label': 'Wilfrid'},
 {'form': 'angelcynnesland', 'label': 'England'},
 {'form': 'æþelredes', 'label': 'Æthelred'},
 {'form': 'myrcnarice', 'label': 'Mercia'},
 {'form': 'penwald', 'label': 'Penwealh'},
 {'form': 'tette', 'label': 'Tette'},
 {'form': 'middelenglaland', 'label': 'MiddleAnglia'},
 {'form': 'guþlac', 'label': 'Guthlac'},
 {'form': 'guðlac', 'label': 'Guthlac'},
 {'form': 'guthlac', 'label': 'Guthlac'},
 {'form': 'cristes', 'label': 'Jesus'},
 {'form': 'cristes', 'label': 'Jesus'},
 {'form': 'crist', 'label': 'Jesus'},
 {'form': 'hrypadun', 'label': 'Repton'},
 {'form': 'petres', 'label': 'Peter'},
 {'form': 'ælfðryðe', 'label': 'Ælfthryth'},
 {'form': 'bretonelande', 'label': 'Britain'},
 {'form': 'grante', 'label': 'Cam'},
 {'form': 'grantece

Of course for the sort of data here retrieved, this is not the most sensible way to organize the results. We will instead want to create a single dictionary with the labels as keys, and a list of distinct forms as the value. We can use the syntax `persName[@key = 'Moses']` to select only those element nodes where the attribute `key` has the value `Moses`:

In [13]:
labels = []
elements = ['placeName', 'persName']
query = ['{http://www.tei-c.org/ns/1.0}' + i for i in elements]
for token in text.iter(query):
    labels.append(token.get('key').replace('#', ''))
labels = list(set(labels))

data = dict()
for label in labels:
    forms = []
    queries = ['.//{http://www.tei-c.org/ns/1.0}placeName[@key = "#' + label + '"]', './/{http://www.tei-c.org/ns/1.0}persName[@key = "' + label + '"]']
    for query in queries:
        for token in text.findall(query):
            forms.append(normalize(etree.tostring(token, method='text', encoding='unicode')))
    data[label] = list(set(forms))

In [14]:
data.items()

dict_items([('Hædde', ['hæddehæddan', 'hædda']), ('Pega', ['pege', 'pegan']), ('Hwætred', ['hwætred']), ('Ealdwulf', ['aldwulfes']), ('Moses', ['moyses']), ('Mercia', ['mercnarice', 'myrcnarice']), ('Britain', ['bretonelande', 'bretone']), ('Crowland', ['cruwlande', 'cruwland']), ('Penwealh', ['penwald']), ('Ova', ['oua']), ('Damascus', ['damascūm']), ('Ælfthryth', ['ælfðryðe']), ('Cissa', ['cissa']), ('Tatwine', ['tatwine']), ('Peter', ['petres']), ('Felix', ['felix']), ('NorthSea', ['norðsæ']), ('Æthelbald', ['æþelbald', 'aþelbald', 'aðelbalde', 'aþelbalde', 'aþelbaldes', 'æþelbaldes']), ('Elijah', ['helias']), ('Guthlac', ['guthlac', 'gutðlac', 'guðlace', 'guðlac', 'guðlaces', 'guðhlaces', 'guthlaces', 'guthlace', 'guðlacesguðlac', 'guþlaces', 'guþlac']), ('Wigfrid', ['wigfrið']), ('Jesus', ['cristes', 'crist']), ('Tette', ['tette']), ('MiddleAnglia', ['middelenglaland']), ('Ecga', ['ecga']), ('Bartholomew', ['bartholomei', 'bartholomeus']), ('Ecgburh', ['ecgburh', 'ecgburhe']), ('

Now suppose we want to retrieve all attested forms of the name "Æthelbald":

In [15]:
data['Æthelbald']

['æþelbald', 'aþelbald', 'aðelbalde', 'aþelbalde', 'aþelbaldes', 'æþelbaldes']

Or produce a human-readable table of all names and forms:

In [16]:
for k,v in data.items():
    v = ', '.join(v)
    print(f'{k}: {v}')

Hædde: hæddehæddan, hædda
Pega: pege, pegan
Hwætred: hwætred
Ealdwulf: aldwulfes
Moses: moyses
Mercia: mercnarice, myrcnarice
Britain: bretonelande, bretone
Crowland: cruwlande, cruwland
Penwealh: penwald
Ova: oua
Damascus: damascūm
Ælfthryth: ælfðryðe
Cissa: cissa
Tatwine: tatwine
Peter: petres
Felix: felix
NorthSea: norðsæ
Æthelbald: æþelbald, aþelbald, aðelbalde, aþelbalde, aþelbaldes, æþelbaldes
Elijah: helias
Guthlac: guthlac, gutðlac, guðlace, guðlac, guðlaces, guðhlaces, guthlaces, guthlace, guðlacesguðlac, guþlaces, guþlac
Wigfrid: wigfrið
Jesus: cristes, crist
Tette: tette
MiddleAnglia: middelenglaland
Ecga: ecga
Bartholomew: bartholomei, bartholomeus
Ecgburh: ecgburh, ecgburhe
Ælfwald: alfwold
Repton: hrypadun
Paul: paulus
Beccel: beccelle, beccel
Grantchester: granteceaster
EastAnglia: eastenglalande
Cam: grante
EcgberhtCrowland: ecgberhte, ecgbriht
Æthelred: æþelredes
Wilfrid: wilfrid, wilfriðe, wilfriþ, wilfrið, wilfrides
Coenred: cenredes
England: angelcynnesland
Ceolred