# Latin Abbreviation Expansion

This notebook demonstrates how to use the byt5 model `mschonhardt/latin-normalizer`. 
It normalises medieval text into standard latin.

Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-normalizer) and [Zenodo](https://doi.org/10.5281/zenodo.18416639).

![](https://zenodo.org/badge/DOI/10.5281/zenodo.18416639.svg)

## Quick check
You can use `pipeline` to quickly convert input text. 

In [1]:
from transformers import pipeline

# Initialize the normalizer
normalizer = pipeline("text2text-generation", model="mschonhardt/latin-normalizer")

# Example input
raw_text = "viiii vt in sabbato sancto ieiunium ante noctis initium non soluatur"
result = normalizer(raw_text, max_length=128)

print(f"Normalized: {result[0]['generated_text']}")


Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=128) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Normalized: ix ut in sabbato sancto ieiunium ante noctis initium non solvatur


The model can also be used and exemplified in a more detailed way. 

## Setup Environment

In [2]:
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Model should be used with GPU (cuda) if available for faster inference
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Torch version: {torch.__version__}")
print(f"Device: {device}")

print("Environment ready.")

Torch version: 2.10.0+cu128
Device: cuda
Environment ready.


## Load the Model from Hugging Face

In [3]:
# Load the model and tokenizer from Huggingface
model_name = "mschonhardt/latin-normalizer" 
print(f"Loading model: {model_name} ...")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
print("Model loaded successfully!")

Loading model: mschonhardt/latin-normalizer ...
Model loaded successfully!


### Prediction Logic
The model was trained on text lines from manuscripts. Quality might degrade if used for longer passages.
Note: model was trained on text without interpunctuation.

### Run Inference

In [7]:
# The original Medieval Latin text
lines = ["avt ferrum lapsvm de manubrio", "ueni uidi uici"]

for input_text in lines:

    # 1. Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    # 2. Generate output tokens
    output_tokens = model.generate(**inputs, max_length=128)

    # 3. Decode back to text
    expanded_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

    print(f"Input:    {input_text}")
    print(f"Normalised: {expanded_text}")


Input:    avt ferrum lapsvm de manubrio
Normalised: aut ferrum lapsum de manubrio
Input:    ueni uidi uici
Normalised: veni vidi vici
