In [2]:
%pip install lxml

Note: you may need to restart the kernel to use updated packages.


In [3]:
from lxml import etree
from pathlib import Path

In [4]:
files = list(Path("tlg0012").glob("./**/*perseus-eng*.xml"))

In [5]:
TEI_NS = "http://www.tei-c.org/ns/1.0"
XML_NS = "http://www.w3.org/XML/1998/namespace"

NAMESPACES = {
    "tei": TEI_NS,
    "xml": XML_NS,
}

In [6]:
for file in files:
    print(file)
    tree = etree.parse(file)
    text = tree.xpath(f"//tei:div[@subtype='card']//text()", namespaces=NAMESPACES)
    
    cleaned_text = []
    for t in text:
        if t.strip() != "":
            cleaned_text.append(t)

    if len(cleaned_text) > 0:
        with open(str(file).split("/")[-1].replace(".xml", ".txt"), "w+") as f:
            f.write('\n'.join(cleaned_text))
    

tlg0012/tlg002/tlg0012.tlg002.perseus-eng3.xml
tlg0012/tlg002/tlg0012.tlg002.perseus-eng4.xml
tlg0012/tlg001/tlg0012.tlg001.perseus-eng3.xml
tlg0012/tlg001/tlg0012.tlg001.perseus-eng4.xml
tlg0012/tlg003/tlg0012.tlg003.perseus-eng1.xml


In [46]:
from collections import Counter

text_files = list(Path(".").glob("tlg0012.tlg00*.perseus-eng*.txt"))

counts = {}

for t in text_files:
    name = str(t)

    with open(t) as f:
        text = f.read().lower().split()
        counts[name] = Counter(text)
    print(counts[name]['throng'])

71
0
5
15


In [60]:
term = 'throng'

df_ulysses = 0

for _, els in counts.items():
    if term in els:
        df_ulysses += 1

df_ulysses

3

Notes when completing the assignment:
1. Review TF-IDF and understand the code:
- Re-inspect the code with: <br>
    `print(counts)` <br>
    `print(counts[name]['odysseus'])` <br>
    change other term <br>
    --> Figure out calculating TF-IDF score for a term

2. When calculating tf-idf score, I receive a negative score for `tf_idf_ulysses`. I was confused because I assumed that was impossible. I revisited the code I wrote for calculating IDF score, and I realize I added 1 to `df_ulysses` to prevent division by 0, but that means when `df_ulysses` = the # of docs, `# docs / (df_ulysses + 1)` < 1, which causes `idf_ulysses` to be negative. I re-read about TF-IDF [here](https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency/) to make sure the correct formula does add 1 into `df_ulysses` when calculating `idf_ulysses`. Though the link mentions that the formula can contain adding one into the denominator, which means I can keep the `+ 1` in calculating `idf_ulysses`, I decided to remove it as the example in the link doesn't show them adding one into the denominator.

3. To test for my code, I tried to find a term that does not appear in all 4 documents.

In [61]:
import numpy as np

# Calculate IDF score for 'term'
idf_ulysses = np.log10(len(counts) / df_ulysses)
print(len(counts), df_ulysses)

# Calculate TF-IDF score for 'term' in each document
tf_idf_ulysses = {}

for doc in counts:
    words = counts[doc]
    tf_score = words[term] / sum(words.values())
    tf_idf_score = tf_score * idf_ulysses
    print(tf_score, idf_ulysses, tf_idf_score)

4 3
0.0004022662889518414 0.12493873660829992 5.025864192175238e-05
0.0 0.12493873660829992 0.0
3.2083777158917363e-05 0.12493873660829992 4.0085065838573656e-06
0.00011265913102256937 0.12493873660829992 1.4075489497348745e-05
