# Latin Interpunctuator

This notebook demonstrates how to use the mt5 model `mschonhardt/mt5-latin-punctuator-large`.
It applies interpunctuation and text formatting standards to Latin text.

## Quick check

In [1]:
from transformers import pipeline

# Load the expander
expander = pipeline("text2text-generation", model="mschonhardt/abbreviationes-v2")

# Example: "Vt ep̅i conꝓuinciales peregrina iu¬" abbreviated
text = "Vt ep̅i conꝓuinciales peregrina iu¬"
result = expander(text, max_length=512)

print(f"Source: {text}")
print(f"Expanded: {result[0]['generated_text']}")

Device set to use cuda:0
Both `max_new_tokens` (=256) and `max_length`(=512) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Source: Vt ep̅i conꝓuinciales peregrina iu¬
Expanded: Vt episcopi comprouinciales peregrina iu¬


In [2]:
## Setup Environment

In [None]:
# Import necessary libraries
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Model should be used with GPU (cuda) if available for faster inference
device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Torch version: {torch.__version__}")
print(f"Device: {device}")

print("Environment ready.")

## Load the Model from Hugging Face

In [None]:
# Load the model and tokenizer from Huggingface
model_name = "https://huggingface.co/mschonhardt/abbreviationes-v2" 
print(f"Loading model: {model_name} ...")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
print("Model loaded successfully!")

### Prediction Logic
Model was trained on prefix "punctuate: ". `Num_beams` needs to be adjusted when running into hallucinations or repetitions. 

In [None]:
def punctuate(text: str) -> str:
    # Best practice: Add prefix 'punctuate: 'and lowercase as per training script
    input_text = "punctuate: " + text.lower()
    
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        max_length=1024,
    ).to(device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_length=1024,
            # Adjust numbeams if hallucination occurs, but 4 is a good starting point for better punctuation
            num_beams=4,
            early_stopping=True,
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


In [None]:
text = """
Si quis Patrem et Filium et Spiritum Sanctum non confitetur tres personas unius substantiae et virtutis ac potestatis, 
sicut catholica et apostolica ecclesia docet, sed unam tantum ac solitariam dicit esse personam, 
ita ut ipse sit Pater qui Filius, ipse etiam sit Paraclitus Spiritus, sicut Sabellius et Priscillianus dixerunt, anathema sit."""

Model was trained on lower case input to prevent overfitting on capital letters and force learning of linguistic pattern.

In [None]:
text_without_punctuation = text.replace(".","").replace(",","").replace(";","").replace(":","").replace("?","").replace("!","").replace("-","").replace("(","").replace(")","").replace("[","").replace("]","").replace("{","").replace("}","").replace("\"","")
text_without_punctuation = text_without_punctuation.lower()
import textwrap
print(textwrap.fill(text_without_punctuation, width=80))

### Run Inference

In [None]:
# Model will predict punctuation for the input text as well as appropriate use of capital letters
# Note: The model will reflect conventions of material it has seen, which might differ from your expectations.
text_with_punctuation = punctuate(text_without_punctuation)


As the model does apply conventions it has learned from training data, the models decision might differ from your own conventions and expectations. It has not been designed to prepare a 'perfect' text, but to provide structure to unstrucutred text enabling downstream tasks,

In [None]:
print(textwrap.fill(text_with_punctuation, width=80))