# Latin Abbreviation Expansion

This notebook demonstrates how to use the byt5 model `mschonhardt/abbreviationes-v2`.
It expands medieval abbreviations based on a fixed set of special characters.

## Quick check
You can use `pipeline` to quickly convert input text. 

In [13]:
from transformers import pipeline

# Load the expander
expander = pipeline("text2text-generation", model="mschonhardt/abbreviationes-v2")

# Example: "aut ferrum lapsū de manubrio" abbreviated
text = "aut ferrum lapsū de manubrio"
result = expander(text, max_length=512)

print(f"Source: {text}")
print(f"Expanded: {result[0]['generated_text']}")

Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Source: aut ferrum lapsū de manubrio
Expanded: aut ferrum lapsum de manubrio


The model can also be used and exemplified in a more detailed way. 

## Setup Environment

In [14]:
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Model should be used with GPU (cuda) if available for faster inference
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Torch version: {torch.__version__}")
print(f"Device: {device}")

print("Environment ready.")

Torch version: 2.10.0+cu128
Device: cuda
Environment ready.


## Load the Model from Hugging Face

In [15]:
# Load the model and tokenizer from Huggingface
model_name = "mschonhardt/abbreviationes-v2" 
print(f"Loading model: {model_name} ...")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
print("Model loaded successfully!")

Loading model: mschonhardt/abbreviationes-v2 ...
Model loaded successfully!


### Prediction Logic
The model was trained on abbreviated text lines from manuscripts. Quality might degrade if used for longer passages.

### Run Inference

In [16]:
# The abbreviated Medieval Latin text
lines = ["aut ferrum lapsū de manubrio", "ei᷒ et surgens ꝑcusserit eum et", "tur ab ultore sanguinis ꝓximi sui", "et illū qui armis c̅tra iniquitatē"]

for input_text in lines:

    # 1. Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt").to(device)

    # 2. Generate output tokens
    output_tokens = model.generate(**inputs, max_length=128)

    # 3. Decode back to text
    expanded_text = tokenizer.decode(output_tokens[0], skip_special_tokens=True)

    print(f"Input:    {input_text}")
    print(f"Expanded: {expanded_text}")


Input:    aut ferrum lapsū de manubrio
Expanded: aut ferrum lapsum de manubrio
Input:    ei᷒ et surgens ꝑcusserit eum et
Expanded: eius et surgens percusserit eum et
Input:    tur ab ultore sanguinis ꝓximi sui
Expanded: tur ab ultore sanguinis proximi sui
Input:    et illū qui armis c̅tra iniquitatē
Expanded: et illum qui armis contra iniquitatem
