# Latin Line Break (`<lb/>`) Detector

This notebook demonstrates how to use the Flair model `mschonhardt/latin-contextual-lb-detector`.
It detects if a line break `<lb/>` in Latin text acts as a 
**word break** (hyphenation) or a **separate word** (space).

Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-contextual-lb-detector) and [Zenodo](https://doi.org/10.5281/zenodo.18390269).


![](https://zenodo.org/badge/DOI/10.5281/zenodo.18390269.svg)


## Setup Environment

In [1]:
# Install flair if running in Colab or if not installed locally
# !pip install -q flair
# Flair 0.15.1 recommended

import flair
print(f"Flair version: {flair.__version__}")

print("Environment ready.")

Flair version: 0.15.1
Environment ready.


## Load the Model from Hugging Face

In [2]:
from flair.models import SequenceTagger
from flair.data import Sentence

print("Loading model: mschonhardt/latin-contextual-lb-detector ...")
tagger = SequenceTagger.load('mschonhardt/latin-contextual-lb-detector')
print("Model loaded successfully!")

Loading model: mschonhardt/latin-contextual-lb-detector ...
2026-02-12 15:49:19,216 SequenceTagger predicts: Dictionary with 5 tags: O, NB, WB, <START>, <STOP>
Model loaded successfully!


### Prediction Logic
We must tokenize by **whitespace** so that `<lb/>` is treated as a single token.

In [3]:
def predict_line_breaks(text_input):
    # Important: Model expects '<lb/>' as token. Tokenize by whitespace to ensure <lb/> is 
    # preserved as one token and not split into '<', 'lb/', '>'
    # Input: "line1 <lb/> line2" -> ["line1", "<lb/>", "line2"]
    token_list = text_input.split()
    if not token_list: return []

    # Create Flair Sentence
    sentence = Sentence(token_list)

    # Predict
    tagger.predict(sentence)

    # Extract Results
    results = []
    for token in sentence:
        if "<lb" in token.text:
            tag = token.get_label().value
            score = token.get_label().score
            
            # Map tags to human-readable meanings
            meaning = "Separate Words (Space)" if tag == "NB" else "Split Word (Join)"
            
            results.append({
                "token": token.text,
                "prediction": tag,
                "confidence": score,
                "meaning": meaning
            })
            
    return results

### Run Inference

In [5]:
# Example 1: A split words of lines

text1 = "impleatur, et ideo sequatur absolutio, quod post sententiam possit ite <lb/> rato agi. Ego considero, quod terminus offerendi est vsque ad sententiam"
print(f"Input: {text1}")
print(f"Prediction: {predict_line_breaks(text1)}")

print("-" * 30)

# Example 2: Separate words of lines
text2 =  "vt d. l. si rem. §. fi. Item considero, quod secundus creditor bene habet <lb/> hypothecariam, licet prior ei pręferatur, tamen sua hypothecaria non"
print(f"Input: {text2}")
print(f"Prediction: {predict_line_breaks(text2)}")


Input: impleatur, et ideo sequatur absolutio, quod post sententiam possit ite <lb/> rato agi. Ego considero, quod terminus offerendi est vsque ad sententiam
Prediction: [{'token': '<lb/>', 'prediction': 'WB', 'confidence': 0.8927411437034607, 'meaning': 'Split Word (Join)'}]
------------------------------
Input: vt d. l. si rem. §. fi. Item considero, quod secundus creditor bene habet <lb/> hypothecariam, licet prior ei pręferatur, tamen sua hypothecaria non
Prediction: [{'token': '<lb/>', 'prediction': 'NB', 'confidence': 0.9999591112136841, 'meaning': 'Separate Words (Space)'}]


## Integrate in postprocessing Pipeline
Now we can use prediction to join or split lines depending on the actual workflow.

In [6]:
def recreate_sentence(text_input):
    # Get predictions for line breaks
    predictions = predict_line_breaks(text_input)
    
    # Split the text by whitespace
    token_list = text_input.split()
    
    # Reconstruct sentence based on predictions
    reconstructed = []
    for token in token_list:
        if "<lb" in token:
            # Find the corresponding prediction
            pred = next((p for p in predictions if p["token"] == token), None)
            if pred and pred["prediction"] == "WB":  # B = Split Word (Join with hyphen)
                reconstructed.append("-")
            else:  # NB = Separate Words (Space)
                reconstructed.append(" ")
        else:
            reconstructed.append(token)

    # Join and clean up spacing
    result = " ".join(reconstructed)
    result = result.replace(" - ", "")
    result = result.replace("  ", " ")
    result = result.replace("  ", " ")
    
    return result

# Test with the existing examples
print("Reconstructed text 1:")
print(recreate_sentence(text1))

print("\n" + "="*50 + "\n")

print("Reconstructed text 2:")
print(recreate_sentence(text2))


Reconstructed text 1:
impleatur, et ideo sequatur absolutio, quod post sententiam possit iterato agi. Ego considero, quod terminus offerendi est vsque ad sententiam


Reconstructed text 2:
vt d. l. si rem. §. fi. Item considero, quod secundus creditor bene habet hypothecariam, licet prior ei pręferatur, tamen sua hypothecaria non


In [7]:
def recreate_tei(text_input):
    # Get predictions for line breaks
    predictions = predict_line_breaks(text_input)
    
    # Split the text by whitespace
    token_list = text_input.split()
    
    # Reconstruct sentence based on predictions
    reconstructed_tei = []
    for token in token_list:
        if "<lb" in token:
            # Find the corresponding prediction
            pred = next((p for p in predictions if p["token"] == token), None)
            if pred and pred["prediction"] == "WB":  # B = Split Word (Join with hyphen)
                reconstructed_tei.append("<lb break='no'/>")
            else:  # NB = Separate Words (Space)
                reconstructed_tei.append("<lb/>")
        else:
            reconstructed_tei.append(token)

    # Join and clean up spacing
    result = " ".join(reconstructed_tei)
    result = result.replace(" <lb break='no'/> ", "<lb break='no'/>")
    
    return result

# Test with the existing examples
print("Reconstructed text 1:")
print(recreate_tei(text1))

print("\n" + "="*50 + "\n")

print("Reconstructed text 2:")
print(recreate_tei(text2))

Reconstructed text 1:
impleatur, et ideo sequatur absolutio, quod post sententiam possit ite<lb break='no'/>rato agi. Ego considero, quod terminus offerendi est vsque ad sententiam


Reconstructed text 2:
vt d. l. si rem. §. fi. Item considero, quod secundus creditor bene habet <lb/> hypothecariam, licet prior ei pręferatur, tamen sua hypothecaria non
