# Lemmatizing Latin with a Seq2Seq Transformer (`mschonhardt/latin-lemmatizer`)

This notebook shows how to use the Hugging Face Transformers pipeline with the lemmatization model:

- Lemmatizer: `mschonhardt/latin-lemmatizer`

The model is a `text2text-generation` (Seq2Seq) model that takes Latin text as input and returns a lemmatized version as generated text.
Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-lemmatizer) and [Zenodo](https://doi.org/10.5281/zenodo.18632650).


![](https://zenodo.org/badge/DOI/10.5281/zenodo.18632650.svg)




In [None]:
# If needed, install dependencies
# !pip install -U transformers torch sentencepiece accelerate pandas tqdm


## 1) Setup
Select device, import libraries.


In [None]:
import torch
import pandas as pd
from tqdm.auto import tqdm
from transformers import pipeline

# Device selection for transformers pipeline:
# device=0 uses first CUDA GPU; device=-1 uses CPU.
device = 0 if torch.cuda.is_available() else -1
device_name = torch.cuda.get_device_name(0) if device == 0 else "CPU"
print(f"Using device: {device_name}")


## 2) Load the lemmatizer pipeline
We use `text2text-generation` because the model generates the lemmatized text as output.


In [None]:
MODEL_LEMMATIZER = "mschonhardt/latin-lemmatizer"

print("Loading lemmatizer pipeline...")
lemmatizer_pipe = pipeline(
    task="text2text-generation",
    model=MODEL_LEMMATIZER,
    device=device,
)
print("Loaded:", MODEL_LEMMATIZER)


## 3) Lemmatize a single text
The pipeline returns a list of dicts; the lemmatized string is in `generated_text`.


In [None]:
example_text = "In nomine sanctae et individuae trinitatis ."

result = lemmatizer_pipe(
    example_text,
    max_length=512,
    truncation=True,
)

lemmatized = result[0]["generated_text"]
print("INPUT :", example_text)
print("OUTPUT:", lemmatized)


## 4) Lemmatize multiple texts efficiently (batching)
For larger corpora you should batch inputs. Adjust `batch_size` based on your VRAM.


In [None]:
def lemmatize_texts(texts, batch_size=32, max_length=512):
    out = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Lemmatizing"):
        batch = texts[i:i+batch_size]
        results = lemmatizer_pipe(
            batch,
            max_length=max_length,
            truncation=True,
        )
        out.extend([r["generated_text"] for r in results])
    return out

texts = [
    "In nomine sanctae et individuae trinitatis .",
    "Quod infames uocentur qui ex consanguineis nascuntur .",
    "Si quis clericus furtum fecerit , deponatur .",
    "Arma virumque cano , Troiae qui primus ab oris .",
]

lemmatized_texts = lemmatize_texts(texts, batch_size=16, max_length=512)

df = pd.DataFrame({
    "text": texts,
    "lemmatized_text": lemmatized_texts,
})
df


## 5) (Optional) Export results
Example: save as CSV for later reuse.


In [None]:
# df.to_csv("latin_lemmatization_demo.csv", index=False)
# print("Saved latin_lemmatization_demo.csv")
