This module is designed to scan the entire Tesserae text folder and build contextual data
for each type in the corpus. This is necessary for certain experiments involving the ideal
lemma (in ambiguous cases), synonym, or translation for a given token-in-context during
NLP tasks. Contributors: James Gawley.

In [3]:
from os import listdir
from os.path import isfile, join, expanduser
from collections import defaultdict
from pprint import PrettyPrinter
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer
from operator import itemgetter
from difflib import SequenceMatcher
import pickle
import os

In [6]:
rel_path = os.path.join('~/cltk')
path = os.path.expanduser(rel_path)
os.chdir(path)
from cltk.tokenize.word import WordTokenizer
from cltk.stem.latin.j_v import JVReplacer
from cltk.semantics.latin.lookup import Lemmata

In [7]:
#This filepath needs to be customized. The git repo is located at https://github.com/jeffkinnison/tesserae-v5.git
os.chdir('/Users/James/tesserae-v5')
from tesserae.utils import TessFile
from tesserae.utils import TessFile

SKIP_LIBRARY is a large dictionary whose keys are lemmas and whose values are dictionaries.
Second-layer dictionaries describe context of word forms in corpus through counts of 
surrounding word-forms.

In the original version of this data structure, the SKIP_LIBRARY keys were normalized tokens–in other words, inflected forms of Latin words. Problematically, converting from tokens to lemmas *massively* increased run time. Instead of several hours, the code would take several weeks to execute. The only changes were to the skipgram() method. Specifically, these lines were added:

    lemmatizer = Lemmata(dictionary = 'lemmata', language = 'latin')
    lemmas = lemmatizer.lookup(targettoken)
    lemmas = lemmatizer.isolate(lemmas)

One possible problem is that a new lemmatizer is instantiated at each step in the program's main loop–in other words, close to 10 million times. Since it's the same lemmatizer, this doesn't actually need to happen. The problem is python's (lack of) scope.

The idea behind making lemmas the SKIP_LIBRARY keys is that the contextual information for each lemma will be drawn from the inflected form in the corpus. When a form is ambiguous, contextual info for both possible lemmas are updated. When the form is unambiguous, only the correct lemma is updated. So when we lemmatize in-context, we can look at the surrounding word forms and compare that context to the stored context for each lemma. If the token is ambiguous but it's surrounding words look like the words we saw in unambiguous cases, then we know which possible lemma is more likely.

In [9]:
SKIP_LIBRARY = dict()

In [10]:
def read_files(filepath, context_window):
    '''Moves through a .tess file and calls the 'next' and 'skipgram' functions as needed.
    Updates the SKIP_LIBRARY global object.
    Parameters
    ----------
    filepath: a file in .tess format
    context_window: how many words on either side of the target to look at.
    '''
    tessobj = TessFile(filepath)
    tokengenerator = iter(tessobj.read_tokens())
    tokens = new_file(tokengenerator, context_window)
    stop = 0
    while stop != 1:
        #the target should be five away from the end of the file, until the end
        target_position = len(tokens) - (context_window + 1)
        targettoken = tokens[target_position]
        #grab all the other tokens but the target
        contexttokens = [x for i, x in enumerate(tokens) if i != target_position]
        #add this context to the skipgram map
        skipgram(targettoken, contexttokens)
        #prep the next token in the file
        try:
            rawtoken = next(tokengenerator)
            cleantoken = token_cleanup(rawtoken)            
            tokens.append(cleantoken)
            if len(tokens) > (context_window * 2 + 1):
                tokens.pop(0)
        except StopIteration:
            #we have reached EOF. Loop through until the last token is done then quit
            #when this happens, the token list should have n * 2 + 1 indices, and the 'target_position'
            #index will be n + 1. Pop the first index off, leaving n * 2. The target will be 
            #just past halfway through the list. Keep popping until target reaches end of list.
            while len(tokens) > (context_window):
                tokens.pop(0)
                # This loop makes the target_position move to the end. E.g. if the context_window is 6, then
                # as long as there are six or more indexes, make the target_position the sixth index.
                if len(tokens) > (context_window + 1):
                    target_position = (context_window)
                # But if there six or fewer indexes, then the target_position is the last index.
                else:
                    target_position = len(tokens) - 1
                targettoken = tokens[target_position]
                #grab all the other tokens but the target
                contexttokens = [x for i, x in enumerate(tokens) if i != target_position]
                #add this context to the skipgram map
                skipgram(targettoken, contexttokens)
            stop = 1

In [18]:
lemmatizer = Lemmata(dictionary = 'lemmata', language = 'latin')
def skipgram(targettoken, contexttokens):
    '''Builds a complex data structure that will contain the 'average context'
    for each type in the corpus. Updates SKIP_LIBRARY.
    param targettoken: the token in question
    param contexttokens: list of tokens surrounding the targettoken
    global SKIP_LIBRARY: a dictionary whose keys are types and whose values are
    dictionaries; see above.
    '''
    global SKIP_LIBRARY
    lemmas = lemmatizer.lookup(targettoken)
    lemmas = lemmatizer.isolate(lemmas)
    for lemma in lemmas:
        if lemma not in SKIP_LIBRARY:
            SKIP_LIBRARY[lemma] = defaultdict(int)
        for contextword in contexttokens:
            SKIP_LIBRARY[lemma][contextword] += 1

In [13]:
def new_file(tokengenerator, context_window):
    '''Takes an iterator object for the file being read.
    Reads in the first n tokens and returns them'''
    tokens = []
    for i in range(0, (context_window + 1)):
        rawtoken = next(tokengenerator)
        cleantoken = token_cleanup(rawtoken)
        # NB: right now the code assumes that first sentence is > n + 1 words
        tokens.append(cleantoken)
    return tokens

In [17]:
jv = JVReplacer()
word_tokenizer = WordTokenizer('latin')
def token_cleanup(rawtoken):
    '''This method is intented to make word-forms in the corpus more uniform.'''
    rawtoken = jv.replace(rawtoken)
    rawtoken = rawtoken.lower()
    tokenlist = word_tokenizer.tokenize(rawtoken)
    return tokenlist[0]

The following is the actual program loop.

In [None]:
#open all the tesserae files
relativepath = join('~/cleantess/tesserae/texts/la')
path = expanduser(relativepath)
onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]
onlyfiles = [join(path, f) for f in onlyfiles]

for filename in onlyfiles:
    print(filename)
    if '.tess' in filename:
        read_files(filename, context_window = 2)

Now that the SKIP_LIBRARY data structure has been built, it's time to test its ability to assign a probability distribution to all possible lemmas in the ambiguous context. This can be done by comparing the context in which the target token is found against the representative context in SKIP_LIBRARY.

In [2]:
def compare_context(target, context):
    '''Assigns a probability value to each possible lemma in ambiguous context.
    returns a standard cltk.semantics object.
    params
    ------
    target: the token to lemmatize
    context: tokens found in the vicinity of the target, in situ
    '''
    #gather a list of possible lemmas
    lemmas = lemmatizer.lookup(target)
    lemmas = lemmatizer.isolate(lemmas)
    #if there is more than one possibility, load up their lemma-contexts from SKIP_LIBRARY
    if len(lemmas) > 1:
        shared_context_counts = dict()
        for lem in lemmas:
            #the number of context words in common will always be even, between lemmas, 
            #unless SKIP_LIBRARY is trained on a different corpus than the one being lemmatized.
            #so instead of *whether* a word was seen, we rely on how many times.
            lemma_context_dictionary = SKIP_LIBRARY[lem]
            lemma_context_words = lemma_context_dictionary.keys()
            counts = [lemma_context_dictionary[context_token] for context_token in set(context).intersection(lemma_context_words)]
            shared_context_counts[lem] = sum(counts)
            print(shared_context_counts[lem])
        total_shared = sum(shared_context_counts.values())
        lemmalist = []
        for lem in lemmas:
            lemmaprob = shared_context_counts[lem] / total_shared
            lemmaobj = (lem, lemmaprob)
            lemmalist.append(lemmaobj)
        return lemmalist
    else:
        return lemmatizer.lookup(target)

The following is a test of the lemmatizer on the first words of the Aeneid. It happens that the Aeneid is file number 389 in my folder; this will not be true on all installs. 

In [None]:
tessobj = TessFile(onlyfiles[389])
tokengenerator = iter(tessobj.read_tokens())
tokens = new_file(tokengenerator, 4)
target = tokens.pop(1)
compare_context(target, tokens)