# `lxml.etree` 2

Demonstrating just a few more things, such as going verse line by verse line or paragraph by paragraph, and keeping nonnormalized forms in your token dict. This notebook will use one of the student transcriptions, not included in the repository, so substitute in your own TEI XML file with matching metadata.

In [1]:
from lxml import etree

In [2]:
parser = etree.XMLParser(remove_blank_text=True,resolve_entities=True)
filename = '../projects/example.xml'
tree = etree.parse(filename, parser=parser)
root = tree.getroot()

In [3]:
# Create a dict with characters we want replaced:
substitutions = {
    'ę': 'æ',
    'ƿ': 'w',
    'ẏ': 'y',
    'ſ': 's',
    '': 's', # Using the glyph for descending s, instead of the unicode key point
    'v': 'u',
    'j': 'i',
    '⁊': 'and',
    ' ': '',
    '\n': ''
}

# Write a function carrying out the desired operations:
def normalize(token):
    # Lowercase:
    token = token.lower()
    for k,v in substitutions.items():
        # Carry out replacements:
        token = token.replace(k, v)
    return token

We're going to assume our document organizes word tokens into either verse lines or paragraphs, or both. Also, let's provide for any rubrics (`<head>` nodes) encountered and treat them as equivalent to our paragraph/verse line units; if that's undesirable, you can always remove `head` from the iterator or set up your code to flag which chunks are rubrics.

Another thing we neglected to do last week is retain the form as encountered alongside the normalized token form. We'll do this simply by adding an additional key to our dictionary, and normalizing the form for just one of the two.

In [4]:
# Our outermost container will be a list of lists in this approach:
data = []
text = root.find('.//{http://www.tei-c.org/ns/1.0}text')


# And now we compile a list of token dictionaries for each rubric, verse line, or paragraph:
for chunk in text.iter('{http://www.tei-c.org/ns/1.0}head', '{http://www.tei-c.org/ns/1.0}p', '{http://www.tei-c.org/ns/1.0}l'):
    this_data = []
    for i in chunk.iter('{http://www.tei-c.org/ns/1.0}w'):
        dictionary = dict()
        dictionary['form'] = etree.tostring(i, method='text', encoding='unicode')
        dictionary['norm'] = normalize(dictionary['form'])
        dictionary['pos'] = i.get('pos')
        dictionary['lemma'] = i.get('lemma')
        this_data.append(dictionary)
    data.append(this_data)


In [5]:
data

[[{'form': ' Ælfric', 'norm': 'ælfric', 'pos': 'NR^N', 'lemma': ''},
  {'form': 'gret', 'norm': 'gret', 'pos': 'VBPI', 'lemma': 'gretan'},
  {'form': 'eadmolice',
   'norm': 'eadmolice',
   'pos': 'ADV',
   'lemma': 'eadmolice'},
  {'form': 'ÆĐelƿerd', 'norm': 'æđelwerd', 'pos': 'NR^A', 'lemma': ''},
  {'form': 'Ealdorman',
   'norm': 'ealdorman',
   'pos': 'N^A',
   'lemma': 'ealdorman'}],
 [{'form': '⁊', 'norm': 'and', 'pos': 'CONJ', 'lemma': 'and'},
  {'form': 'ic', 'norm': 'ic', 'pos': 'PRO^N', 'lemma': 'ic'},
  {'form': 'secge', 'norm': 'secge', 'pos': 'VBPI', 'lemma': 'secgan'},
  {'form': 'þe', 'norm': 'þe', 'pos': 'PRO^D', 'lemma': 'þu'},
  {'form': 'leof', 'norm': 'leof', 'pos': 'ADJ^N', 'lemma': 'leof'},
  {'form': '\uf149Þæt', 'norm': '\uf149þæt', 'pos': 'C', 'lemma': 'Þæt'},
  {'form': 'ic', 'norm': 'ic', 'pos': 'PRO^N', 'lemma': 'ic'},
  {'form': 'hæbbe', 'norm': 'hæbbe', 'pos': 'HVPI', 'lemma': 'habban'},
  {'form': 'nu', 'norm': 'nu', 'pos': 'ADV', 'lemma': 'nu'},
  {'fo

This approach still omits any non-word content, such as punctuation marks. You can of course modify the approach to cover any content you need.

If we now want to perform operations on all tokens, we simply merge our lower-level lists of token dictionaries back into one master list. This is where list comprehension gets a little hard to read, so we'll call our constituent lists `inner` and `outer` respectively to help understand the syntax:

In [6]:
all_tokens = [inner for outer in data for inner in outer]
len(all_tokens)
#alternative_way = []
#for i in data:
#    alternative_way.extend(i)

378

In [7]:
all_tokens[:10]
#alternative_way[:10]

[{'form': ' Ælfric', 'norm': 'ælfric', 'pos': 'NR^N', 'lemma': ''},
 {'form': 'gret', 'norm': 'gret', 'pos': 'VBPI', 'lemma': 'gretan'},
 {'form': 'eadmolice',
  'norm': 'eadmolice',
  'pos': 'ADV',
  'lemma': 'eadmolice'},
 {'form': 'ÆĐelƿerd', 'norm': 'æđelwerd', 'pos': 'NR^A', 'lemma': ''},
 {'form': 'Ealdorman',
  'norm': 'ealdorman',
  'pos': 'N^A',
  'lemma': 'ealdorman'},
 {'form': '⁊', 'norm': 'and', 'pos': 'CONJ', 'lemma': 'and'},
 {'form': 'ic', 'norm': 'ic', 'pos': 'PRO^N', 'lemma': 'ic'},
 {'form': 'secge', 'norm': 'secge', 'pos': 'VBPI', 'lemma': 'secgan'},
 {'form': 'þe', 'norm': 'þe', 'pos': 'PRO^D', 'lemma': 'þu'},
 {'form': 'leof', 'norm': 'leof', 'pos': 'ADJ^N', 'lemma': 'leof'}]

In [8]:
with open('test.txt', 'w') as outfile: 
    for chunk in data:
        text = ' '.join([token['norm'] for token in chunk])
        outfile.write(text + '\n')