# Lemmatizing Latin with Flair

This notebook uses the model `mschonhardt/latin-lemmatizer`.

**Important:** this is a **Flair** lemmatizer checkpoint (pickled `.pt`), not a ðŸ¤— Transformers `text2text-generation` model. The intended usage is via `flair.models.Lemmatizer` and token labels of type `predicted`.

Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-lemmatizer) and [Zenodo](https://doi.org/10.5281/zenodo.18632650).

![](https://zenodo.org/badge/DOI/10.5281/zenodo.18632650.svg)


In [None]:
# If needed (run once):
# !pip install -U flair huggingface_hub pandas tqdm


## 1) Setup
Imports, device selection, and two small workarounds:

- **PyTorch â‰¥ 2.6** changed `torch.load` defaults around `weights_only`, which can break loading pickled Flair models unless we force `weights_only=False`. :contentReference[oaicite:3]{index=3}
- Some GPU setups need `pack_padded_sequence` to keep `lengths` on CPU.


In [1]:
import torch
import torch.nn.utils.rnn as rnn

# Patch needed to run on GPU
if not getattr(rnn.pack_padded_sequence, "_cpu_lengths_patched", False):
    _orig_pack = rnn.pack_padded_sequence

    def pack_padded_sequence_cpu_lengths(input, lengths, *args, **kwargs):
        if isinstance(input, rnn.PackedSequence):
            return input
        # PyTorch requires CPU lengths if it's a tensor
        if torch.is_tensor(lengths):
            lengths = lengths.detach().cpu()
        return _orig_pack(input, lengths, *args, **kwargs)

    pack_padded_sequence_cpu_lengths._cpu_lengths_patched = True
    rnn.pack_padded_sequence = pack_padded_sequence_cpu_lengths


## 2) Load the lemmatizer
We download `best-model.pt` and load it with Flair.

Key point: during `Lemmatizer.load(...)` we temporarily patch `torch.load` to pass `weights_only=False`, so the pickled model object is reconstructed correctly (otherwise you often get only weights and end up with `O O O ...`). :contentReference[oaicite:4]{index=4}


In [2]:
from huggingface_hub import hf_hub_download
from flair.models import Lemmatizer
from flair.data import Sentence
from flair.tokenization import SpaceTokenizer

print("Load model from Hugging Face Hub...")
model_file = hf_hub_download("mschonhardt/latin-lemmatizer", filename="best-model.pt")
lemmatizer = Lemmatizer.load(model_file)
print("Model loaded.")


Load model from Hugging Face Hub...
Model loaded.


## 3) Lemmatize a single text


In [7]:
sent = Sentence(
    "Et videtur , quod sic , quia res empta de pecunia pupilli efficitur",
    use_tokenizer=SpaceTokenizer(),
)

lemmatizer.predict(sent)

for tok in sent:
    print(tok.text, "->", tok.get_label("predicted").value)

print("\nNote that no model is perfect, as can be seen in wrong lemmatization of 'empta'.")


Et -> et
videtur -> video
, -> ,
quod -> quod
sic -> sic
, -> ,
quia -> quia
res -> res
empta -> empta
de -> de
pecunia -> pecunia
pupilli -> pupillus
efficitur -> efficio

Note that no model is perfect, as can be seen in wrong lemmatization of 'empta'.


## 4) Lemmatize multiple texts (chunking)


In [5]:
import pandas as pd
from tqdm.auto import tqdm
from flair.data import Sentence
from flair.tokenization import SpaceTokenizer

def lemmatize_texts(texts, chunk_size=256, batch_size=32):
    out = []
    for i in tqdm(range(0, len(texts), chunk_size), desc="Lemmatizing"):
        chunk = texts[i:i + chunk_size]

        sentences = [
            Sentence(t, use_tokenizer=SpaceTokenizer())
            for t in chunk
        ]

        lemmatizer.predict(
            sentences,
            mini_batch_size=batch_size,
            embedding_storage_mode="none",
        )

        out.extend([
            " ".join(tok.get_label("predicted").value for tok in s)
            for s in sentences
        ])

    return out

texts = [
    "Et videtur , quod sic , quia res empta de pecunia pupilli efficitur",
    "In nomine sanctae et individuae trinitatis .",
    "Quod infames uocentur qui ex consanguineis nascuntur .",
    "Si quis clericus furtum fecerit , deponatur ."
]

lemmatized_texts = lemmatize_texts(texts, chunk_size=256, batch_size=16)
df = pd.DataFrame({"text": texts, "lemmatized_text": lemmatized_texts})
pd.set_option("display.max_colwidth", 300) 
df

Lemmatizing:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,text,lemmatized_text
0,"Et videtur , quod sic , quia res empta de pecunia pupilli efficitur","et video , quod sic , quia res empta de pecunia pupillus efficio"
1,In nomine sanctae et individuae trinitatis .,in nomen sanctus et individuus trinitas .
2,Quod infames uocentur qui ex consanguineis nascuntur .,quod infamis voco qui ex consanguineus nascor .
3,"Si quis clericus furtum fecerit , deponatur .","si quis clericus furtum facio , depono ."


## 5) (Optional) Export


In [None]:
# df.to_csv("latin_lemmatization_demo.csv", index=False)
# print("Saved latin_lemmatization_demo.csv")
