## Latin pedagogy tool 

We make use of the CLTK library, a NLP library for classical languages.

**Introduction** : https://aclanthology.org/2021.acl-demo.3.pdf

**Documentation**
* API : https://docs.cltk.org/en/latest/index.html
* Demos : https://github.com/cltk/cltk/tree/master/notebooks

In [90]:
text = """
Architecti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur.
Opera ea nascitur et fabrica et ratiocinatione.
"""

In [5]:
#corpus = get_corpus_reader(corpus_name='latin_text_perseus', language='latin')
from cltk.data.fetch import FetchCorpus
corpus_downloader = FetchCorpus(language="lat")
corpus_downloader.import_corpus('lat_text_perseus')

Downloaded 100% 112.75 MiB | 1.99 MiB/s 

## Preprocessing

In [3]:
from cltk.alphabet.lat import drop_latin_punctuation
import re

def cleanDoc(text, convertLower=False):
    # Remove metainfo like [c 1Kb]
    cleaned = re.sub(r"[\(\[].*?[\)\]]", "", text)
    # Remove wide spaces
    cleaned = cleaned.replace("   ", " ").replace("  ", " ")
    return cleaned.lower() if convertLower else cleaned

## Decliner

Declension encodings are described here : https://github.com/cltk/latin_treebank_perseus#readme

E.g. --s----n- => singular nominative

In [7]:
from cltk.morphology.lat import CollatinusDecliner
from collections import OrderedDict
from tabulate import tabulate

words = ['leo', 'via']
def declensions(rootWords: list)-> dict:
    dec, decliner = {}, CollatinusDecliner()
    for word in rootWords:
        # Expect root words only
        try: dec[word] = decliner.decline(word)
        except Exception: print('Not a root word')
    return dec

# Usage example : declension table (only nouns for now)
def printDecTable(lemma, POS):

    rows = []
    cases = {'n':'Nominative',
             'g':'Genitive',
             'd':'Dative',
             'a':'Accusative',
             'b':'Ablative'}

    d = OrderedDict({c: {} for c in cases.keys()})
    if POS=="noun":
        declens = CollatinusDecliner().decline(lemma)
        for dec, code in declens:
            number, case = code[2], code[7]
            if case in d: d[case][number] = macronizer(dec).lower()
        
        for key, val in d.items():
            row =[cases[key]]+list(val.values())
            rows.append(row)
            
    print(tabulate(rows, headers=['Case', 'Singular', 'Plural']))

decs = declensions(words)
printDecTable("leo", "noun")

Case        Singular    Plural
----------  ----------  --------
Nominative  leō         leōnēs
Genitive    leōnis      leōnum
Dative      leōnī       leōnibus
Accusative  leōnem      leōnēs
Ablative    leōne       leōnibus


## Lemmatizer

In [6]:
from cltk.lemmatize.lat import LatinBackoffLemmatizer

# Returns tuples of (original, root)
# Requires lower-case, non-macron inputs
def lemmatize(tokens: list)-> list:
    lemmatizer = LatinBackoffLemmatizer()
    tokens = lemmatizer.lemmatize(tokens)
    return [root for _, root in tokens]

tokens = ["filias", "pueri", "cecini", "variis"]
print(f"Example : {tokens} -> {lemmatize(tokens)}")


Example : ['filias', 'pueri', 'cecini', 'variis'] -> ['filia', 'puer', 'cano', 'varius1']


## Macronizer

In [5]:
from cltk.prosody.lat.macronizer import Macronizer

# NOTE: subpar accuracy for the macronizer 
def macronizer(text: str) -> str:
    macronizer = Macronizer("tag_tnt")
    text = macronizer.macronize_text(text)
    return text

tēxt = macronizer(text)
print(tēxt)

architectī est scientiā plūribus disciplīnīs et variīs ērudītiōnibus ōrnāta , quae ab cēterīs artibus perficiuntur . opera ea nāscitur et fabricā et ratiōcinātiōne .


## Tokenizer

In [8]:
from cltk.sentence.lat import LatinPunktSentenceTokenizer
from cltk.alphabet.text_normalization import remove_non_latin
from cltk.tokenizers.lat.lat import LatinWordTokenizer

# Sentence tokenizer
def sentTokenize(doc: str, punct=True) -> list:
    sent_tokenize = LatinPunktSentenceTokenizer()
    sentences = sent_tokenize.tokenize(doc)
    return [remove_non_latin(s).lower() for s in sentences] if punct else sentences

# Word tokenizer
def word_Tokenizer(sent: str) -> list:
    word_tokenize = LatinWordTokenizer()
    tokens = word_tokenize.tokenize(sent)
    return tokens

sentences = sentTokenize(text)
tokens = word_Tokenizer(sentences[0])

In [91]:
# NLP pipeline
from cltk import NLP
cltk_nlp = NLP(language="lat")

text = drop_latin_punctuation(cleanDoc(text, convertLower=True))
cltk_doc = cltk_nlp.analyze(text=text)

‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.


In [104]:
# Word attributes
# word.lemma
# word.gender (for nouns)
# word.pos (part of speech nouns etc)
# word.string
# word.features
# word.xpos (treebank POS tag)
# word.upos (universal POS tag)
#   For verbs : Aspect, case, degree, gender
#   For nouns : case, degree, gender, number


architectus, architecti m. : architect, master-builder, inventor, designer, maker, author, deviser
scientia, scientiae f. : knowledge
plus, pluris M : more, several. many
disciplina, disciplinae f. : teaching, instruction, education, training, discipline, method, science, study
varius/varia/varium, AO : different, various, diverse, changing, colored, party colored, variegated
eruditio, eruditionis f. : instruction/teaching/education, learning/erudition, taught knowledge, culture
orno, ornare A, ornavi, ornatum : adorn, decorate
ceterus/cetera/ceterum, AO : remaining, rest
ars, artis f. : art, skill
perficio, perficere M, perfeci, perfectum : complete, finish, execute, bring about, accomplish, do thoroughly
opus, operis n. : work, achievement, oeuvre
nascor, nasci C, natus sum (Dep.) : (1.) be born (2.) spring forth, arise

ratiocinatio, ratiocinationis f. : reasoning, esp. a form of argument, syllogism


## Dictionary 1

Use the following API to get a succinct definition: https://www.latin-is-simple.com/api/

"intern_type" can has the following values:
- "dempron" for demonstrative pronouns
- 

In [97]:
import requests

# Formats the header of the word
def formatHeader(header: str, POS: str) -> str:

    # Formatting for verbs
    if POS=="verb":
        principalParts = header.split(",")
        header = ",".join(principalParts[:1]+principalParts[2:])
        return header

    # Formatting for nouns
    elif POS=="noun":
        # decParadigm = header[-1]
        # Can use the paradigm to further format the header
        # E.g. head = formatNoun(header, decParadigm)
        header = re.sub("[\[\]]", "", header)[:-2]
        return header
    #elif POS=="adverb":
    #elif POS=="adjective":
    #elif POS=="dempron"
    else:
        return header

def getDefinition(word: str, POS: str) -> str:
    word, POS = word.lower(), POS.lower()

    apiURL = "https://www.latin-is-simple.com/api/vocabulary/search/?query="+word+"&forms_only=true"
    r = requests.get(apiURL)

    definition = ""

    # Only one result
    if len(r.json())==1:
        entry = r.json()[0]
        header = formatHeader(entry['full_name'], entry["intern_type"])
        body = entry["translations_unstructured"]["en"]
        definition = header + " : " + body
    
    # Multiple results (get the first entry that matches POS)
    else:
        for entry in r.json():
            if entry["intern_type"]==POS:
                header = formatHeader(entry['full_name'], POS)
                body = entry["translations_unstructured"]["en"]
                definition = header + " : " + body
                break

    return definition 

# Example
print(getDefinition('fabrica', 'noun'))

fabrico, fabricare A, fabricavi, fabricatum : build/construct/fashion/forge/shape, train, get ready (meal), invent/devise
fabrica, fabricae f. : (1.) craft, art, craft of metalwork/building, construction/building/making (2.) smith's shop, workshop
architectus, architecti m. : architect, master-builder, inventor, designer, maker, author, deviser


## Use case - generate lexical list

Gallic wars of Caesar

In [147]:
with open("./Corpus/gall1.txt", "r") as f:
    x = f.read()

# Get the first paragraph 
l = re.search(r"\[ 1 \] (.*)", x)
x = cleanDoc(drop_latin_punctuation(l[1]), convertLower=True)

In [149]:
gallicWars1 = cltk_nlp.analyze(text=x)

In [155]:
def generateLex(WordList):
    s = set()
    for word in WordList:
        if not word.stop:
            result = getDefinition(word.string, str(word.pos))
            # Need to alert if word wasn't found
            if result:
                s.add(result)
            else:
                print(word.string + " not found, POS : " + str(word.pos))
    return s

In [156]:
words = generateLex(gallicWars1)

omnis not found
divisa not found
tres not found
aliam not found
nostra not found
omnes not found
omnium not found
quod not found
saepe not found
quoque not found
quod not found
fere not found
belgae not found


In [163]:
m = sorted(list(words))
with open("lexical_list", "w") as f:
    f.write("\n\n".join(m))