# XML Exercise
## Build a list of catchwords and their corresponding first words

Now that we have a better sense of how to use lxml, can we find all the catchwords in the Spenser XML as well as all the words that correspond to those catchwords?

**n.b.: This is based on a private XML from the Spenser Project. If you'd like to try this exercise, contact me for the file.**

In [1]:
# First, we need to import lxml and the Book 1 Spenser project XML file
from lxml import etree

with open('data/fq1590.bk1.xml', 'r') as fqfile:
    fqxml = etree.fromstring(fqfile.read()) # We can skip a step by nesting our ".read()" inside the ".fromstring()"

# Check our work by printing out the XML:
print(etree.tostring(fqxml, pretty_print=True, encoding="unicode"))

<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>

      <fileDesc>
         <titleStmt>
            <title level="m" type="main">
               <hi rend="face(italics)">The faerie queene.</hi>
            </title>
            <title level="m" type="sub">a digital edition</title>
            <author>Edmund Spenser</author>
            <editor>David Miller</editor>
            <editor>Joseph Loewenstein</editor>
            <editor>Patrick Cheney</editor>
            <editor>Elizabeth Fowler</editor>
            <respStmt>
               <resp>Creation of machine-readable version: </resp>
               <name>Text Creation Partnership</name>
            </respStmt>
            <respStmt>
               <resp>Conversion to TEI-conformant XML: </resp>
               <name>The Spenser Project Staff</name>
            </respStmt>
            <sponsor/>
            <sponsor/>
            <sponsor/>
            <funder/>
            <funder/>
         </titleStmt>
         <editionStm

In [13]:
# Browsing the XML, it's easy to see that catchwords are kept in the "fw" tag with the "type=catch" attribute

# We can build a big list comprehension to get all the catchwords

# Let's start with the .findall() command we need, that should look like:
# fqxml.findall(".//{*}fw[@type='catch']")

# We can embed that in a list comprehension, like so:
# [c for c in fqxml.findall(".//{*}fw[@type='catch']")]

# That would get us a list of the Element objects, but we want just a list of the words
# Sometimes "fw" tags have child tags, so we can't just do c.text
# Instead, we can use .tostring(), like so:
# etree.tostring(c, method="text", encoding="unicode")

# That will give us a string, but it may have some unnecessary spaces and other stuff in it
# We can add some basic string methods to clean it up
# First we use .strip() to remove any whitespace around the text
# To get rid of any extra text, we use .split() and then [0] to get the first item of whatever we split
# Adding that to our .tostring() method we get:
# etree.tostring(c, method="text", encoding="unicode").strip().split()[0]

# We can put all of that into our list comprehension, so we get a one line
# script that pulls out all the catchwords.

catchwords = [etree.tostring(c, method="text", encoding="unicode").strip().split()[0] for c in fqxml.findall(".//{*}fw[@type='catch']")]
print(catchwords)

# There's a small problem with this. It seems like a couple signatures (M1-4) were mislabeled as catchwords
# Thankfully there's a good workaround:
# We just modify our .findall() to only pick up items placed in the bottom-right with the type "catch", like so:

# fqxml.findall(.//{*}fw[@type='catch'][@place='bottom-right']")

# Now we can run our list comprehension again with this new .findall():

catchwords = [etree.tostring(c, method="text", encoding="unicode").strip().split()[0] for c in fqxml.findall(".//{*}fw[@type='catch'][@place='bottom-right']")]
print(catchwords)

['Of', 'CANT.', 'which', 'Enforst', 'But', 'But', 'There.', 'As', 'Or', 'Then', 'Of', 'Arriued', 'To', 'The', 'That', 'In', "Captiu'd", 'Cant.', 'His', 'Now', 'Sometime', 'With', 'And', 'Shee', 'O', 'Long', 'Least', 'In', 'A', 'The', 'And', 'Yet', 'And', 'Long', 'She', 'He', 'Druncke', 'But', 'That', 'Much', 'Loth', 'Her', 'And', 'Cant.', 'Which', 'By', 'A', 'With', 'Great', 'And', 'Who', 'His', 'In', 'Full', 'And', 'And', 'Cause', 'Thereto', 'Cant..', 'To', 'And', 'And', 'In', 'And', 'Who', 'And', 'Then', 'With', 'There', 'Both', 'Ah', 'Not', 'All', 'But', 'Yet', 'And', 'The', 'They', 'The', 'During', 'So', 'The', 'Yet', 'Too', 'That', 'Whereas', 'So', 'Most', 'Wh', 'The', 'The', 'But', 'From', 'His', 'O', 'The', 'And', 'But', 'A', 'But', 'Till', 'Had', 'As', 'Her', 'The', 'So', 'That', 'That', 'And', 'That', 'The', 'Where', 'Then', 'With', 'But', 'And', 'Her', 'And', 'Faire', 'Aread', 'That', 'Whiles', 'Thine', 'And', 'He', 'But', 'How', 'Whose', 'What', 'Nor', 'Thou', 'And', 'Which'

In [25]:
# So now we have all the catchwords
# Getting the first word on every page is a bit more complicated...

first_words = [] # Create an empty list for our first words

for pb in fqxml.iter("{*}pb"): # Iterate through each page break tag
    # Page break tags are "self-closing" and don't have children
    # So we need to find their next tags
    for next_tag in pb.itersiblings(): # Use a special method to iterate through pb's siblings
        # Make sure that the tag isn't a comment, and make sure it isn't an "fw" tag
        if isinstance(next_tag.tag, str) and next_tag.tag != "{http://www.tei-c.org/ns/1.0}fw":
            # Let's get all the text inside the tag, with the same stripping and splitting we did on the catchwords
            first_word = etree.tostring(next_tag, method="text", encoding="unicode").strip().split()[0]
            # This works in every case except one, there's one weird case in which we get a whole Canto
            # I wrote a special rule to get it right:
            if next_tag.get("type") == "canto":
                head_tag = next_tag.find(".//{*}head")
                first_word = etree.tostring(head_tag, method="text", encoding="unicode").strip().split()[0]
            first_words.append(first_word)
            # We don't want to iterate through *every* sibling of pb
            # We only need the first one, so we want to stop once we do one
            # To stop a for loop, we can use a special Python operator called break:
            break
            
# Now we can print out those first_words:
print(first_words)

# The first two of those are from pages we don't care about, so we can simply truncate them:
first_words = first_words[2:]
print(first_words)

['THE', 'The', 'Of', 'Canto', 'Which', 'Enforst', 'But', 'But', 'Therewith', 'As', 'Or', 'Then', 'Of', 'Arriued', 'To', 'The', 'That', 'In', "Captiu'd", 'Cant.', 'His', 'Now', 'Sometime', 'With', 'And', 'Shee', 'O', 'Long', 'Least', 'In', 'A', 'The', 'And', 'Yet', 'And', 'Long', 'Shee', 'He', 'Dronke', 'But', 'That', 'Much', 'Loth', 'Her', 'And', 'Can.', 'Which', 'By', 'A', 'With', 'Great', 'And', 'Who', 'His', 'In', 'Full', 'And', 'And', 'Cause', 'Thereto', 'Cant', 'To', 'The', 'And', 'In', 'And', 'Who', 'And', 'Then', 'With', 'There', 'Both', 'Ah', 'Not', 'All', 'But', 'Yet', 'And', 'The', 'They', 'The', 'During', 'So', 'The', 'Yet', 'Too', 'That', 'Whereas', 'So', 'Most', 'Who', 'The', 'The', 'But', 'From', 'His', 'O', 'The', 'And', 'But', 'A', 'But', 'Till', 'Had', 'At', 'Her', 'The', 'So', 'That', 'That', 'And', 'That', 'The', 'Where', 'Then', 'With', 'But', 'And', 'Her', 'And', 'Faire', 'Aread', 'That', 'Whiles', 'Thine,', 'And', 'He', 'But', 'How', 'Whose', 'What', 'Nor', 'Thou'

In [26]:
# Now we need to put our catchwords and our first_words together
# To do this we can use a function called "zip()" to make two lists into a list of tuples
# Zip creates a special zip iterator object, which has to be turned into a list to be viewed
# We can do all that by nesting the functions.

# This works best if our lists are the same length:
print(len(catchwords), len(first_words))


pairs = list(zip(catchwords, first_words))
print(pairs)

# There it is! A list of every catchword and its corresponding word.
# Notice the messiness of real-world XML. We needed special rules for edge cases as
# well as a knowledge of where the taggers had gotten things wrong.
# If we hadn't done these things, our final list would have been less useful.

182 182
[('Of', 'Of'), ('CANT.', 'Canto'), ('which', 'Which'), ('Enforst', 'Enforst'), ('But', 'But'), ('But', 'But'), ('There.', 'Therewith'), ('As', 'As'), ('Or', 'Or'), ('Then', 'Then'), ('Of', 'Of'), ('Arriued', 'Arriued'), ('To', 'To'), ('The', 'The'), ('That', 'That'), ('In', 'In'), ("Captiu'd", "Captiu'd"), ('Cant.', 'Cant.'), ('His', 'His'), ('Now', 'Now'), ('Sometime', 'Sometime'), ('With', 'With'), ('And', 'And'), ('Shee', 'Shee'), ('O', 'O'), ('Long', 'Long'), ('Least', 'Least'), ('In', 'In'), ('A', 'A'), ('The', 'The'), ('And', 'And'), ('Yet', 'Yet'), ('And', 'And'), ('Long', 'Long'), ('She', 'Shee'), ('He', 'He'), ('Druncke', 'Dronke'), ('But', 'But'), ('That', 'That'), ('Much', 'Much'), ('Loth', 'Loth'), ('Her', 'Her'), ('And', 'And'), ('Cant.', 'Can.'), ('Which', 'Which'), ('By', 'By'), ('A', 'A'), ('With', 'With'), ('Great', 'Great'), ('And', 'And'), ('Who', 'Who'), ('His', 'His'), ('In', 'In'), ('Full', 'Full'), ('And', 'And'), ('And', 'And'), ('Cause', 'Cause'), ('The