## Latin pedagogy tool 

We make use of the CLTK library, a NLP library for classical languages.

**Introduction** : https://aclanthology.org/2021.acl-demo.3.pdf

**Documentation**
* API : https://docs.cltk.org/en/latest/index.html
* Demos : https://github.com/cltk/cltk/tree/master/notebooks

In [1]:
text = """
Architecti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur.
Opera ea nascitur et fabrica et ratiocinatione.
"""

In [5]:
#corpus = get_corpus_reader(corpus_name='latin_text_perseus', language='latin')
from cltk.data.fetch import FetchCorpus
corpus_downloader = FetchCorpus(language="lat")
corpus_downloader.import_corpus('lat_text_perseus')

Downloaded 100% 112.75 MiB | 1.99 MiB/s 

## Preprocessing

In [43]:
import re

# For imported text
# Remove metainfo like [c 1Kb]
def cleanDoc(text, convertLower=False):
    cleaned = re.sub(r"[\(\[].*?[\)\]]", "", text)
    cleaned = cleaned.replace("   ", " ").replace("  ", " ")
    return cleaned.lower() if convertLower else cleaned

## Decliner

In [42]:
from cltk.morphology.lat import CollatinusDecliner

words = ['leonis', 'via']#['via', 'arbor', 'leo']
def declensions(rootWords: list)-> dict:
    dec, decliner = {}, CollatinusDecliner()
    for word in rootWords:
        # Expect root words only
        try: dec[word] = decliner.decline(word)
        except Exception: print('Not a root word')
    return dec

# Only for noun for now
#def printDecTable()
decs = declensions(words)
decs

Not a root word


{'via': [('via', '--s----n-'),
  ('via', '--s----v-'),
  ('viam', '--s----a-'),
  ('viae', '--s----g-'),
  ('viae', '--s----d-'),
  ('via', '--s----b-'),
  ('viae', '--p----n-'),
  ('viae', '--p----v-'),
  ('vias', '--p----a-'),
  ('viarum', '--p----g-'),
  ('viis', '--p----d-'),
  ('viis', '--p----b-')]}

## Lemmatizer

In [26]:
from cltk.lemmatize.lat import LatinBackoffLemmatizer

# Returns tuples of (declined, root)
# Requires lower-case, non-macron inputs
def lemmatize(tokens: list)-> list:
    lemmatizer = LatinBackoffLemmatizer()
    tokens = lemmatizer.lemmatize(tokens)
    return [root for _, root in tokens]

tokens = ["filias", "pueri", "cano"]

lem = lemmatize(tokens)
print(lem)


['filia', 'puer', 'cano']


## Macronizer

In [7]:
from cltk.prosody.lat.macronizer import Macronizer

# NOTE: subpar accuracy for the macronizer 
def macronizer(text: str) -> str:
    macronizer = Macronizer("tag_tnt")
    text = macronizer.macronize_text("Soles occidere et redire possunt")
    return text

tēxt = macronizer(text)
print(tēxt)

solēs occīdere et redīre possunt


## Tokenizer

In [24]:
from cltk.sentence.lat import LatinPunktSentenceTokenizer
from cltk.alphabet.text_normalization import remove_non_latin
from cltk.tokenizers.lat.lat import LatinWordTokenizer

# Sentence tokenizer
def sentTokenize(doc: str, punct=True) -> list:
    sent_tokenize = LatinPunktSentenceTokenizer()
    sentences = sent_tokenize.tokenize(doc)
    return [remove_non_latin(s).lower() for s in sentences] if punct else sentences

# Word tokenizer
def word_Tokenizer(sent: str) -> list:
    word_tokenize = LatinWordTokenizer()
    tokens = word_tokenize.tokenize(sent)
    return tokens

sentences = sentTokenize(text)
tokens = word_Tokenizer(sentences[0])
tokens

['architecti',
 'est',
 'scientia',
 'pluribus',
 'disciplinis',
 'et',
 'variis',
 'eruditionibus',
 'ornata',
 'quae',
 'ab',
 'ceteris',
 'artibus',
 'perficiuntur']

In [25]:
root_words=lemmatize(tokens)
root_words

['architectus',
 'sum',
 'scientia',
 'multus',
 'disciplina',
 'et',
 'varius1',
 'eruditio',
 'orno',
 'qui',
 'ab',
 'ceterus',
 'ars',
 'perficio']

In [2]:
# NLP pipeline - not working of now
#from cltk import NLP
#cltk_nlp = NLP(language="lat")
# TROUBLE LINE cltk_doc = cltk_nlp.analyze(text=text)

‎𐤀 CLTK version '1.1.6'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinLexiconProcess`.
