## Latin pedagogy tool 

We make use of the CLTK library, a NLP library for classical languages.

**Introduction** : https://aclanthology.org/2021.acl-demo.3.pdf

**Documentation**
* API : https://docs.cltk.org/en/latest/index.html
* Demos : https://github.com/cltk/cltk/tree/master/notebooks

In [10]:
text = """
Architecti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur.
Opera ea nascitur et fabrica et ratiocinatione."""

In [11]:
#corpus = get_corpus_reader(corpus_name='latin_text_perseus', language='latin')
from cltk.data.fetch import FetchCorpus
corpus_downloader = FetchCorpus(language="lat")
corpus_downloader.import_corpus('lat_text_perseus')

## Preprocessing

In [12]:
from cltk.alphabet.lat import drop_latin_punctuation
import re

def cleanDoc(text, convertLower=False):
    # Remove metainfo like [c 1Kb]
    cleaned = re.sub(r"[\(\[].*?[\)\]]", "", text)
    # Remove wide spaces
    cleaned = cleaned.replace("   ", " ").replace("  ", " ")
    return cleaned.lower() if convertLower else cleaned

## Macronizer

In [13]:
from cltk.prosody.lat.macronizer import Macronizer

# NOTE: subpar accuracy for the macronizer 
def macronizer(text: str) -> str:
    macronizer = Macronizer("tag_tnt")
    text = macronizer.macronize_text(text)
    return text

tēxt = macronizer(text)
print(f"Example : {text[1:33]} -> {tēxt[:32]}")

Example : Architecti est scientia pluribus -> architectī est scientiā plūribus


## Lemmatizer

In [14]:
from cltk.lemmatize.lat import LatinBackoffLemmatizer

# Returns tuples of (original, root)
# Requires lower-case, non-macron inputs
def lemmatize(tokens: list)-> list:
    lemmatizer = LatinBackoffLemmatizer()
    tokens = lemmatizer.lemmatize(tokens)
    return [root for _, root in tokens]

tokens = ["filias", "pueri", "cecini", "variis"]
print(f"Example : {tokens} -> {lemmatize(tokens)}")


Example : ['filias', 'pueri', 'cecini', 'variis'] -> ['filia', 'puer', 'cano', 'varius1']


## Decliner

Declension encodings are described here : https://github.com/cltk/latin_treebank_perseus#readme

E.g. --s----n- => singular nominative

It can be used to construct declension tables as below.

In [15]:
from cltk.morphology.lat import CollatinusDecliner
from collections import OrderedDict
from tabulate import tabulate

words = ['leo', 'via']
def declensions(rootWords: list)-> dict:
    dec, decliner = {}, CollatinusDecliner()
    for word in rootWords:
        # Expect root words only
        try: dec[word] = decliner.decline(word)
        except Exception: print('Not a root word')
    return dec

# Usage example : declension table (only nouns for now)
def printDecTable(lemma, POS):

    rows = []
    cases = {'n':'Nominative',
             'g':'Genitive',
             'd':'Dative',
             'a':'Accusative',
             'b':'Ablative'}

    d = OrderedDict({c: {} for c in cases.keys()})
    if POS=="noun":
        declens = CollatinusDecliner().decline(lemma)
        for dec, code in declens:
            number, case = code[2], code[7]
            if case in d: d[case][number] = macronizer(dec).lower()
        
        for key, val in d.items():
            row =[cases[key]]+list(val.values())
            rows.append(row)
            
    print(tabulate(rows, headers=['Case', 'Singular', 'Plural']))

decs = declensions(words)
printDecTable("leo", "noun")

Case        Singular    Plural
----------  ----------  --------
Nominative  leō         leōnēs
Genitive    leōnis      leōnum
Dative      leōnī       leōnibus
Accusative  leōnem      leōnēs
Ablative    leōne       leōnibus


## Tokenizer

In [16]:
from cltk.sentence.lat import LatinPunktSentenceTokenizer
from cltk.alphabet.text_normalization import remove_non_latin
from cltk.tokenizers.lat.lat import LatinWordTokenizer

# Sentence tokenizer
def sentTokenize(doc: str, punct=True) -> list:
    sent_tokenize = LatinPunktSentenceTokenizer()
    sentences = sent_tokenize.tokenize(doc)
    return [remove_non_latin(s).lower() for s in sentences] if punct else sentences

# Word tokenizer
def word_Tokenizer(sent: str) -> list:
    word_tokenize = LatinWordTokenizer()
    tokens = word_tokenize.tokenize(sent)
    return tokens

sentences = sentTokenize(text)
tokens = word_Tokenizer(sentences[0])

## NLP pipeline

The CLTK library also has a pre-configured NLP pipeline for latin. The most useful feature is that it automatically tokenises the text, processes information (such as gender, case etc.) into each `Word` object and also creates Word2Vec embeddings.

In [17]:
from cltk import NLP
cltk_nlp = NLP(language="lat")

text = drop_latin_punctuation(cleanDoc(text, convertLower=True))
cltk_doc = cltk_nlp.analyze(text=text)

‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.


In [33]:
cltk_doc.words[0].__dict__

{'index_char_start': None,
 'index_char_stop': None,
 'index_token': 0,
 'index_sentence': 0,
 'string': 'architecti',
 'pos': verb,
 'lemma': 'architico',
 'stem': None,
 'scansion': None,
 'xpos': 'L2|modM|tem4|grp1|casB|gen1',
 'upos': 'VERB',
 'dependency_relation': 'root',
 'governor': -1,
 'features': {Aspect: [perfective], Case: [genitive], Degree: [positive], Gender: [masculine], Number: [singular], Tense: [past], VerbForm: [participle], Voice: [passive]},
 'category': {F: [neg], N: [neg], V: [pos]},
 'embedding': array([ 1.6831e-01, -1.3812e-01, -2.2983e-01, -1.1404e-01,  6.0152e-01,
        -4.8938e-02, -4.0779e-01,  7.2283e-01,  1.9987e-01, -8.3087e-02,
         2.3405e-01,  2.8201e-01, -2.8227e-01, -4.3663e-01,  1.8110e-01,
        -2.1940e-01,  2.9445e-01,  2.3684e-01,  3.9450e-01, -8.0237e-02,
        -4.2323e-02,  6.2347e-01,  6.8870e-01,  1.2506e-01, -8.6620e-01,
        -3.4647e-01,  4.7634e-01, -3.1648e-01,  5.1305e-01, -6.8620e-01,
        -5.4679e-01, -3.2498e-01,  

## Dictionary

Although the word objects have definitions, the formatting is a mess (one giant string)
Use the following API to get a compact definition: https://www.latin-is-simple.com/api/

"intern_type" can has the following values:
- "dempron" for demonstrative pronouns

In [18]:
import requests

# Formats the header of the word
def formatHeader(header: str, POS: str) -> str:

    # Formatting for verbs
    if POS=="verb":
        principalParts = header.split(",")
        header = ",".join(principalParts[:1]+principalParts[2:])
        return header

    # Formatting for nouns
    elif POS=="noun":
        # decParadigm = header[-1]
        # Can use the paradigm to further format the header
        # E.g. head = formatNoun(header, decParadigm)
        header = re.sub("[\[\]]", "", header)[:-2]
        return header
    #elif POS=="adverb":
    #elif POS=="adjective":
    #elif POS=="dempron"
    else:
        return header

def getDefinition(word: str, POS: str) -> str:
    word, POS = word.lower(), POS.lower()

    apiURL = "https://www.latin-is-simple.com/api/vocabulary/search/?query="+word+"&forms_only=true"
    r = requests.get(apiURL)

    definition = ""

    # Only one result
    if len(r.json())==1:
        entry = r.json()[0]
        header = formatHeader(entry['full_name'], entry["intern_type"])
        body = entry["translations_unstructured"]["en"]
        definition = header + " : " + body
    
    # Multiple results (get the first entry that matches POS)
    else:
        for entry in r.json():
            if entry["intern_type"]==POS:
                header = formatHeader(entry['full_name'], POS)
                body = entry["translations_unstructured"]["en"]
                definition = header + " : " + body
                break

    return definition 

# Example
print(getDefinition('fabrica', 'noun'))

fabrica, fabricae f. : (1.) craft, art, craft of metalwork/building, construction/building/making (2.) smith's shop, workshop


## USE CASE - generate vocabulary aid

I primarily use this library to generate rudimentary word lists (with declension information) to pre-study vocabulary or study as I read.

This idea was inspired to automate creating a latin reader like this https://geoffreysteadman.files.wordpress.com/2019/05/ritchie.may2019.pdf

It's more useful towards beginners that can have trouble identifying declension forms. 

The example below makes a vocabulary aid from the Gallic War written by Caesar. More texts are available at https://github.com/cltk/lat_text_latin_library 


In [30]:
with open("./Corpus/gall1.txt", "r") as f:
    x = f.read()

# Get the first paragraph 
l = re.search(r"\[ 1 \] (.*)", x)
x = cleanDoc(drop_latin_punctuation(l[1]), convertLower=True)

print(f"Example text : {x[:101]}")

Example text : gallia est omnis divisa in partes tres quarum unam incolunt belgae aliam aquitani tertiam qui ipsorum


In [20]:
gallicWars1 = cltk_nlp.analyze(text=x)

In [31]:
def generateLex(WordList):
    s = set()
    for word in WordList:
        if not word.stop:
            result = getDefinition(word.string, str(word.pos))
            # Alert if word wasn't found
            if result:
                s.add(result)
            else:
                print(word.string + " not found, POS : " + str(word.pos))
    return s

words = generateLex(gallicWars1)

omnis not found, POS : determiner
divisa not found, POS : adjective
tres not found, POS : numeral
aliam not found, POS : determiner
nostra not found, POS : determiner
omnes not found, POS : determiner
omnium not found, POS : pronoun
quod not found, POS : subordinating_conjunction
saepe not found, POS : adverb
quoque not found, POS : adverb
quod not found, POS : subordinating_conjunction
fere not found, POS : adposition
belgae not found, POS : adjective


In [25]:
import pprint
m = sorted(list(words))
pprint.pprint(m[:10])
with open("lexical_list", "w") as f:
    f.write("\n\n".join(m))

['Aquitania, Aquitaniae f. : Aquitania, one of the divisions of Gaul/France '
 '(southwest)',
 'Aquitanus/Aquitana/Aquitanum, AO : of Aquitania  (southwest Gaul/France)',
 'Belga, Belgae m. : Belgae (pl.)',
 'Celtus/Celta/Celtum, AO : Celts',
 'Gallia, Galliae f. : Gaul',
 'Gallus, Galli m. : Gaul, the Gauls (pl.), cock, rooster',
 'Garumna, Garumnae f. : Garonna',
 'Helvetia, Helvetiae f. : Switzerland',
 'Helvetius, Helvetii m. : Helvetii (pl.), tribe in Central Gaul (Switzerland)',
 'Hispania, Hispaniae f. : Hispania, Spain']
