## Latin pedagogy tool 

We make use of the CLTK library, a NLP library for classical languages.

**Introduction** : https://aclanthology.org/2021.acl-demo.3.pdf

**Documentation**
* API : https://docs.cltk.org/en/latest/index.html
* Demos : https://github.com/cltk/cltk/tree/master/notebooks

In [13]:
text = """Architecti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur.
Opera ea nascitur et fabrica et ratiocinatione."""

In [11]:
#corpus = get_corpus_reader(corpus_name='latin_text_perseus', language='latin')
from cltk.data.fetch import FetchCorpus
corpus_downloader = FetchCorpus(language="lat")
corpus_downloader.import_corpus('lat_text_perseus')

## 1. Preprocessing

Text preprocessing is an important step for NLP tasks. The processing performed should be tailored around the idiosyncrasies of the text and objective of the task.

#### Processing of non-informative attributes

In [26]:
from cltk.alphabet.lat import normalize_lat
from cltk.alphabet.lat import swallow_angle_brackets
from cltk.alphabet.lat import swallow_square_brackets
from cltk.alphabet.lat import swallow_braces
import re

# Removes any non-digit, non-letter character
def removePunc(text):
    return re.sub(r'[^\w\s]', '', text)

# Removes meta-info (e.g. [5], [c 1Kb], {PRO.})
def removeMeta(text):

    text = swallow_angle_brackets(text)
    text = swallow_square_brackets(text)
    text = swallow_braces(text)

    return text

# Normalise a text based on Latin unique features
def normalizeLatText(text):

    text = normalize_lat(
        text,
        drop_accents=True,          # á -> a
        drop_macrons=True,          # ā -> a
        jv_replacement=True,        # verus -> uerus
        ligature_replacement=True)  # æ -> ae, œ -> oe
    
    return text

# Define final function to process text how we want to
# This should be changed according to the document
def cleanDoc(text, convertLower=False):
    # text = normalizeLatText(text)
    text = removePunc(removeMeta(text))
    text = text.replace("   ", " ").replace("  ", " ")
    return text.lower() if convertLower else text


The order of removing punctuation and meta-information is important!

In [27]:
t = "[33] Quid est enim ..."
print(f"Removing punctuation first : '{removeMeta(removePunc(t))}'")
print(f"Removing meta-information first : '{removePunc(removeMeta(t))}'")

Removing punctuation first : '33 Quid est enim'
Removing meta-information first : 'Quid est enim '


#### Macronizer

Vowel lengths (indicated by macrons) are quite important in Latin, especially with regards to poetry scansion. However many publishers persist to omit them either out of ignorance or tradition. The following is one method to macronize a given text, though its performance is not ideal. A better alternative (from experience) is provided here https://github.com/Alatius/latin-macronizer.

In [33]:
from cltk.prosody.lat.macronizer import Macronizer

def macronizer(text: str) -> str:
    macronizer = Macronizer("tag_tnt")
    text = macronizer.macronize_text(text)
    return text

tēxt = macronizer(text)
print(f"Example : {text[:32]} -> {tēxt[:32]}")

Example : Architecti est scientia pluribus -> architectī est scientiā plūribus


#### Lemmatizer

Lemmatisation is a process that reduces the inflected form of a word back to its lemma (dictionary form). This is incredibly important for Latin, a highly inflected language, much so than any other modern Indo-European language.


In [29]:
from cltk.lemmatize.lat import LatinBackoffLemmatizer
from cltk.alphabet.lat import drop_latin_punctuation

# Returns tuples of (original, root)
# Requires lower-case, non-macron inputs
def lemmatize(tokens: list)-> list:
    lemmatizer = LatinBackoffLemmatizer()
    tokens = lemmatizer.lemmatize(tokens)
    return [root for _, root in tokens]

tokens = ["filias", "pueri", "cecini", "variis"]
print(f"Example : {tokens} -> {lemmatize(tokens)}")

Example : ['filias', 'pueri', 'cecini', 'variis'] -> ['filia', 'puer', 'cano', 'varius1']


#### Decliner

This is the opposite of lemmatisation, generating all the possible inflections of a given lemma. The cltk decliner returns a dictionary containing tuples of the inflected form and its respective declension encoding (https://github.com/cltk/latin_treebank_perseus#readme). <br> 

E.g. --s----n- => singular nominative

It can be used to construct declension tables as below.

In [36]:
from cltk.morphology.lat import CollatinusDecliner
from collections import OrderedDict
from tabulate import tabulate

words = ['leo', 'via']
def declensions(rootWords: list)-> dict:
    dec, decliner = {}, CollatinusDecliner()
    for word in rootWords:
        # Expect root words only
        try:
            dec[word] = decliner.decline(word)
        except Exception:
            print('Not a root word')
    return dec

# Usage example : declension table (only nouns for now)
def printDecTable(lemma):

    rows = []
    cases = {'n':'Nominative',
             'g':'Genitive',
             'd':'Dative',
             'a':'Accusative',
             'b':'Ablative'}

    d = OrderedDict({c: {} for c in cases.keys()})
    declens = CollatinusDecliner().decline(lemma)

    for dec, code in declens:
        number, case = code[2], code[7]
        if case in d: d[case][number] = macronizer(dec).lower()

    for key, val in d.items():
        row =[cases[key]]+list(val.values())
        rows.append(row)

    print(tabulate(rows, headers=['Case', 'Singular', 'Plural']))

decs = declensions(words)
printDecTable("leo")

Case        Singular    Plural
----------  ----------  --------
Nominative  leō         leōnēs
Genitive    leōnis      leōnum
Dative      leōnī       leōnibus
Accusative  leōnem      leōnēs
Ablative    leōne       leōnibus


#### Tokenizer

Tokenisation is the task of chopping a text up into pieces. The most common basis for tokenisation are sentences and words.

In [46]:
from cltk.sentence.lat import LatinPunktSentenceTokenizer
from cltk.tokenizers.lat.lat import LatinWordTokenizer
import pprint

# Sentence tokenizer
def sent_Tokenize(doc: str, punct=True) -> list:
    sent_tokenize = LatinPunktSentenceTokenizer()
    sentences = sent_tokenize.tokenize(doc)
    return [removePunc(s).lower() for s in sentences] if punct else sentences

# Word tokenizer
def word_Tokenizer(sent: str) -> list:
    word_tokenize = LatinWordTokenizer()
    tokens = word_tokenize.tokenize(sent)
    return tokens

sentences = sent_Tokenize(text)
tokens = word_Tokenizer(sentences[0])

pprint.pprint(sentences[:1])
pprint.pprint(tokens[:5])

['architecti est scientia pluribus disciplinis et variis eruditionibus ornata '
 'quae ab ceteris artibus perficiuntur']
['architecti', 'est', 'scientia', 'pluribus', 'disciplinis']


## 2. NLP pipeline

The CLTK library also has a pre-configured NLP pipeline for latin. It automatically tokenises the text and processes information (such as gender, case etc.) into each `Word` token/object and creates embeddings that can be used for machine learning applications.

In [47]:
from cltk import NLP
cltk_nlp = NLP(language="lat")

text = removePunc(cleanDoc(text, convertLower=True))
cltk_doc = cltk_nlp.analyze(text=text)

‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.


In [81]:
# Attributes for the token 'scientia'
tmp_word = cltk_doc.words[2]
tmp_word.__dict__

{'index_char_start': None,
 'index_char_stop': None,
 'index_token': 2,
 'index_sentence': 0,
 'string': 'scientia',
 'pos': noun,
 'lemma': 'scientia',
 'stem': None,
 'scansion': None,
 'xpos': 'A1|grn1|casA|gen2|vgr1',
 'upos': 'NOUN',
 'dependency_relation': 'nsubj',
 'governor': 0,
 'features': {Case: [nominative], Gender: [feminine], Number: [singular]},
 'category': {F: [neg], N: [pos], V: [neg]},
 'embedding': array([-2.8462e-01,  6.4238e-01, -4.0037e-01,  3.9382e-01,  6.0418e-02,
         2.7501e-01,  3.1526e-01,  2.9083e-01, -1.4485e-02,  1.7901e-01,
         2.2285e-01,  6.7856e-01, -1.6518e-01, -9.1198e-02,  2.8839e-01,
         3.6772e-01, -1.6601e-01, -4.8859e-01,  4.9720e-02,  2.7487e-01,
         9.4751e-02, -1.2327e-01,  1.5279e-01, -1.9930e-01,  5.3575e-01,
        -3.5485e-02,  5.1619e-01,  1.4068e-01,  3.3552e-01, -2.2774e-01,
        -8.7011e-01, -5.3603e-01,  5.3102e-01, -3.0086e-01, -1.0687e-01,
        -7.3727e-01, -1.0646e-01, -7.2034e-01, -1.4764e-01, -1.2940e

## 3. Definitions

Although the `Word` objects have definition attributes, the formatting is a mess (one giant string).

In [88]:
print(f"Definition of {tmp_word.string} :")
cltk_doc.words[2].definition[:100] + " ..."

Definition of scientia :


'scientia\n\n\n ae, \nf\n\nsciens, \na knowing, knowledge, intelligence, science\n: nullam rem quae huius vir ...'

We can use <a html='https://www.latin-is-simple.com/api/'>this site</a> to get a compact definition.

In [60]:
import requests

# Formats the header of the word
def formatHeader(header: str, POS: str) -> str:

    # Formatting for verbs
    if POS=="verb":
        principalParts = header.split(",")
        header = ",".join(principalParts[:1]+principalParts[2:])
        return header

    # Formatting for nouns
    elif POS=="noun":
        # decParadigm = header[-1]
        # Can use the paradigm to further format the header
        # E.g. head = formatNoun(header, decParadigm)
        header = re.sub("[\[\]]", "", header)[:-2]
        return header
    #elif POS=="adverb":
    #elif POS=="adjective":
    #elif POS=="dempron"
    else:
        return header

def getDefinition(word: str, POS: str) -> str:
    word, POS = word.lower(), POS.lower()

    apiURL = "https://www.latin-is-simple.com/api/vocabulary/search/?query="+word+"&forms_only=true"
    r = requests.get(apiURL)

    definition = ""

    # Only one result for the query
    if len(r.json())==1:
        entry = r.json()[0]
        header = formatHeader(entry['full_name'], entry["intern_type"])
        body = entry["translations_unstructured"]["en"]
        definition = header + " : " + body
    
    # Multiple results (get the first entry that matches POS)
    else:
        for entry in r.json():
            if entry["intern_type"]==POS:
                header = formatHeader(entry['full_name'], POS)
                body = entry["translations_unstructured"]["en"]
                definition = header + " : " + body
                break

    return definition 

# Example
print(getDefinition('fabrica', 'noun'))

fabrica, fabricae f. : (1.) craft, art, craft of metalwork/building, construction/building/making (2.) smith's shop, workshop


### Generate vocabulary aid

I frequently generate rudimentary word lists (with declension information) to pre-study vocabulary or study as I read. This idea was inspired to automate creating a latin reader of <a html='https://geoffreysteadman.files.wordpress.com/2019/05/ritchie.may2019.pdf'>this format</a>. 
It's more useful towards beginners that can have trouble identifying declension forms. The example below makes a vocabulary aid from the Gallic War written by Caesar. More texts are available at the <a html='https://github.com/cltk/lat_text_latin_library'>latin library</a>. 


In [59]:
with open("./Corpus/gall1.txt", "r") as f:
    x = f.read()

# Get the first paragraph 
l = re.search(r"\[ 1 \] (.*)", x)
x = cleanDoc(drop_latin_punctuation(l[1]), convertLower=True)

print(f"DE BELLO GALLICO : {x[:59]} ...")

DE BELLO GALLICO : gallia est omnis divisa in partes tres quarum unam incolunt ...


In [53]:
gallicWars1 = cltk_nlp.analyze(text=x)

In [31]:
def generateLex(WordList):
    s = set()
    for word in WordList:
        if not word.stop:
            result = getDefinition(word.string, str(word.pos))
            # Alert if word wasn't found
            if result:
                s.add(result)
            else:
                print(word.string + " not found, POS : " + str(word.pos))
    return s

words = generateLex(gallicWars1)

omnis not found, POS : determiner
divisa not found, POS : adjective
tres not found, POS : numeral
aliam not found, POS : determiner
nostra not found, POS : determiner
omnes not found, POS : determiner
omnium not found, POS : pronoun
quod not found, POS : subordinating_conjunction
saepe not found, POS : adverb
quoque not found, POS : adverb
quod not found, POS : subordinating_conjunction
fere not found, POS : adposition
belgae not found, POS : adjective


In [25]:
m = sorted(list(words))
pprint.pprint(m[:10])
with open("lexical_list", "w") as f:
    f.write("\n\n".join(m))

['Aquitania, Aquitaniae f. : Aquitania, one of the divisions of Gaul/France '
 '(southwest)',
 'Aquitanus/Aquitana/Aquitanum, AO : of Aquitania  (southwest Gaul/France)',
 'Belga, Belgae m. : Belgae (pl.)',
 'Celtus/Celta/Celtum, AO : Celts',
 'Gallia, Galliae f. : Gaul',
 'Gallus, Galli m. : Gaul, the Gauls (pl.), cock, rooster',
 'Garumna, Garumnae f. : Garonna',
 'Helvetia, Helvetiae f. : Switzerland',
 'Helvetius, Helvetii m. : Helvetii (pl.), tribe in Central Gaul (Switzerland)',
 'Hispania, Hispaniae f. : Hispania, Spain']
