Now that we have some more Python tools at our disposal, we can begin to use XML to greater effect. The MorphAdorned XML for EEBO-TCP provides us a lot of tagged information alongside document structure. We can use these in concert to...

# Run TF-IDF on Sections of Text

We'll need a MorphAdorned XML copy of the Faerie Queene, which we can get off of Ada (see me for the file).

In [2]:
from lxml import etree
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [5]:
# This is all review:
with open('data/fq_ma.xml', 'r') as xmlfile: # Open the file
    fq_text = xmlfile.read() # Read the file as plaintext
    fq_xml = etree.fromstring(fq_text) # Convert plaintext to an etree object

### Note that this is slower than we're used to.

That's because there are many more elements (tags for each individual word!) in this xml file. The more elements or "nodes" there are, the longer it takes to parse with lxml.

In [9]:
# What if we simply wanted every word in the corpus?

all_words = [word.text for word in fq_xml.findall('.//{*}w')]
print(all_words)



In [13]:
# We could do the same thing, but getting the regularized values instead
# The regs are stored (along with the lemmas and pos tags) as *attributes*

all_regs = [word.get('reg').lower() for word in fq_xml.findall('.//{*}w')]
print(all_regs)

# Note that we no longer have to filter out punctuation.
# MorphAdorner handles that for us, but tagging punctuation not with "w"
# but with "pc" instead. MA also tags spaces with a "c" tag.



## So far, this gives us the same thing a plaintext CSV output would

We instead want to take advantage of the structural divisions that the xml provides, then get the regularized one book (or even one canto) at a time.

In [15]:
# Let's start by getting the elements for every book

all_books = fq_xml.findall('.//{*}div[@type="book"]')

# There should be 6 of them:
print(len(all_books))

6


In [24]:
# Now we should loop through and create sets of strings for SKLearn to handle.

books_as_reg_strings = []
for book in all_books:
    # Inside our loop we use the same code as above, but we look for w tags in the "book"
    # and not in "fq_xml" as a whole.
    book_regs = [word.get('reg').lower() for word in book.findall('.//{*}w')]
    book_regs_as_string = " ".join(book_regs)
    books_as_reg_strings.append(book_regs_as_string)
  
# We should still have 6 books:
print(len(books_as_reg_strings))

# How many characters in each book? These numbers should be different!
print([len(b) for b in books_as_reg_strings])

6
[237251, 263827, 259800, 230708, 223115, 221877]


In [25]:
# Just for kicks, we can do this task as one big, nested list comprehension

books_as_reg_strings = [" ".join([word.get('reg').lower() for word in book.findall('.//{*}w')]) for book in all_books]


# Our results are exactly the same:

# We should still have 6 books:
print(len(books_as_reg_strings))

# How many characters in each book? These numbers should be different!
print([len(b) for b in books_as_reg_strings])

6
[237251, 263827, 259800, 230708, 223115, 221877]


In [27]:
# Now we're ready for SKLearn to do its magic!
# See the Exercise3 notebook for full documentation of these steps:

vectorizer = TfidfVectorizer(max_df=.65, min_df=1, stop_words=None, use_idf=True, norm=None)

transformed_documents = vectorizer.fit_transform(books_as_reg_strings)

transformed_documents_all = transformed_documents.toarray()

print(transformed_documents_all)

[[0.         0.         1.55961579 ... 0.         0.         2.25276297]
 [1.55961579 2.25276297 0.         ... 2.25276297 2.25276297 0.        ]
 [0.         0.         1.55961579 ... 0.         0.         0.        ]
 [6.23846315 0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [3.11923158 0.         1.55961579 ... 0.         0.         0.        ]]


In [29]:
# Let's simply convert to pandas and make it readable
# See the Exercise3 notebook for documentation

# Instead of filenames, we just need the numbers 1-6 to label the books:
book_numbers = range(1,7)

all_words = vectorizer.get_feature_names()

df = pd.DataFrame(transformed_documents_all, columns=all_words, index=book_numbers)
df = df.T
df

Unnamed: 0,1,2,3,4,5,6
aback,0.000000,1.559616,0.000000,6.238463,0.000000,3.119232
aband,0.000000,2.252763,0.000000,0.000000,0.000000,0.000000
abandoned,1.559616,0.000000,1.559616,0.000000,0.000000,1.559616
abandoning,0.000000,0.000000,0.000000,0.000000,2.252763,0.000000
abase,0.000000,6.238463,0.000000,1.559616,0.000000,4.678847
abash,0.000000,1.847298,0.000000,0.000000,1.847298,0.000000
abashment,0.000000,0.000000,4.505526,0.000000,0.000000,0.000000
abated,0.000000,2.252763,0.000000,0.000000,0.000000,0.000000
abating,0.000000,0.000000,0.000000,0.000000,0.000000,2.252763
abear,0.000000,0.000000,0.000000,0.000000,1.847298,1.847298


# For Homework!

## Can you do this task for every canto instead of every book?