This module is designed to scan the entire Tesserae text folder and build frequency data
for each lemma in the corpus. This is necessary for certain experiments involving the ideal
lemma (in ambiguous cases), synonym, or translation for a given token-in-context during
NLP tasks. Contributors: James Gawley.

In [3]:
from os import listdir
from os.path import isfile, join, expanduser
from collections import defaultdict
from pprint import PrettyPrinter
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer
from operator import itemgetter
from difflib import SequenceMatcher
import pickle
import os

In [6]:
rel_path = os.path.join('~/cltk')
path = os.path.expanduser(rel_path)
os.chdir(path)
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer
from cltk.semantics.latin.lookup import Lemmata

In [2]:
#This filepath needs to be customized. The git repo is located at https://github.com/jeffkinnison/tesserae-v5.git
rel_path = os.path.join('~/tesserae-v5')
path = os.path.expanduser(rel_path)
os.chdir(path)
from tesserae.utils import TessFile
from tesserae.utils import TessFile

NameError: name 'os' is not defined

COUNT_LIBRARY is a large dictionary whose keys are lemmas and whose values are intergers representing the number of times that lemma (may have) appeared in the training corpus.

To build this data structure, the method read_files() moves through a .tess formated file, and extract words in situ as tokens, one at a time. When a given token has only one possible lemmatization, then that lemma's entry in COUNT_DICTIONARY is incremented. The interesting case comes when a token has two possible lemmas.

In cases of ambiguous lemmatization, the COUNT_DICTIONARY entries for each possible lemma are incremented. This means that true positives and false positives are lumped in together. However over time, the true positives seem to outweigh the false positives, because this data structure can be built into a fairly accurate lemmatizer.

In [9]:
COUNT_LIBRARY = dict()

In [10]:
def read_files(filepath):
    '''Moves through a .tess file and calls the 'next' and 'count_lemma' functions as needed.
    Updates the SKIP_LIBRARY global object.
    Parameters
    ----------
    filepath: a file in .tess format
    '''
    tessobj = TessFile(filepath)
    tokengenerator = iter(tessobj.read_tokens())
    stop = 0
    while stop != 1:
        try: 
            rawtoken = next(tokengenerator)
            cleantoken_list = token_cleanup(rawtoken)
            count_lemma(cleantoken_list[0])
        except StopIteration:
            stop = 1

In [1]:
lemmatizer = Lemmata(dictionary = 'lemmata', language = 'latin')
def count_lemma(targettoken):
    '''Builds a complex data structure that will contain the 'average context'
    for each type in the corpus.
    param targettoken: the token in question
    param c: the context tokens
    global SKIP_LIBRARY: a dictionary whose keys are types and whose values are
    dictionaries; in turn their keys are context types and values are
    incremented counts.
    '''
    global COUNT_LIBRARY
    lemmas = lemmatizer.lookup([targettoken])
    lemmas = lemmatizer.isolate(lemmas)
    for lemma in lemmas:
        if lemma not in COUNT_LIBRARY:
            COUNT_LIBRARY[lemma] = 0
        COUNT_LIBRARY[lemma] = COUNT_LIBRARY[lemma] + 1

NameError: name 'Lemmata' is not defined

In [None]:
jv = JVReplacer()
word_tokenizer = WordTokenizer('latin')
def token_cleanup(rawtoken):
    # this cleaning algorithm is a potential area for improvement.
    rawtoken = jv.replace(rawtoken)
    rawtoken = rawtoken.lower()
    tokenlist = word_tokenizer.tokenize(rawtoken)
    #sometimes words are split into enclitics and punctuation.
    return tokenlist

The following is the actual program loop.

In [None]:
#open all the tesserae files
relativepath = join('~/cleantess/tesserae/texts/la')
path = expanduser(relativepath)
onlyfiles = [f for f in listdir(path) if isfile(join(path, f)) and 'augustine' not in f and 'ambrose' not in f and 'jerome' not in f and 'tertullian' not in f and 'eugippius' not in f and 'hilary' not in f]
onlyfiles = [join(path, f) for f in onlyfiles]
COUNT_LIBRARY = dict()
for filename in onlyfiles:
    print(filename)
    if '.tess' in filename:
        read_files(filename)

Now that the COUNT_LIBRARY data structure has been built, it's time to test its ability to assign a probability distribution to all possible lemmas in the ambiguous context.

In [2]:
def compare_count(target, control = 0):
    '''Assigns probability values to all possible lemmas.
    parameters
    ----------
    target: the word being lemmatized
    control: options for testing. 0 = use frequency values; 1 = choose at random; 2 = take the first option
    sample output
    -------------
    [(lemma1, .45), (lemma2, .55)]
    '''
    lemmas = lemmatizer.lookup([target])
    lemmas = lemmatizer.isolate(lemmas)
    if len(lemmas) > 1:
        if control == 1:
            lemmalist = []
            choice = randint(0, (len(lemmas) - 1))
            lemmaobj = (lemmas[choice], 1)
            lemmalist.append(lemmaobj)
            return lemmalist
        #this will return an even distribution, which will result in the 1st result being picked.
        if control == 2:
            lemmalist = lemmatizer.lookup([target])
            lemmalist = lemmalist[0][1]
            return lemmalist        
        if control == 3:
            all_lemmas_total = sum([COUNT_LIBRARY[lem] for lem in lemmas])
            # the probability distribution is just the # of appearances in the corpus for one lemma
            # vs. the number of appearances for all lemmata, total.
            return ([(lem, (COUNT_LIBRARY[lem] / all_lemmas_total)) for lem in lemmas])
    else:
        lemmalist = []
        lemmaobj = (lemmas[0], 1)
        lemmalist.append(lemmaobj)
        return lemmalist

The following is a test of the lemmatizer on the first 1000 sentences of parsed Latin in the latin_cltk_data repo.

In [None]:
tessobj = TessFile(onlyfiles[389])
tokengenerator = iter(tessobj.read_tokens())
tokens = new_file(tokengenerator, 4)
target = tokens.pop(1)
compare_context(target, tokens)