# POS tagging Latin with Flair contextual string embeddings (legal) + trained tagger

This notebook shows how to use:

- Contextual string embeddings (Flair LM):
  - `mschonhardt/latin-legal-forward`
  - `mschonhardt/latin-legal-backward`
- POS tagger trained on top of those embeddings:
  - `mschonhardt/latin-pos-tagger`

Key point: for POS tagging you usually only need to load the **tagger** (`SequenceTagger.load(...)`), because it already contains/uses its base embeddings. Loading the two LM embeddings separately is still useful if you want to (a) inspect/verify embeddings, or (b) reuse them for other downstream tasks.


In [None]:
# If needed, install dependencies
# !pip install -U flair torch pandas


## 1) Setup
Select device, import libraries.


In [None]:
import torch
import pandas as pd
import flair

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger

# Device selection
flair.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("flair.device =", flair.device)
if torch.cuda.is_available():
    print("CUDA device:", torch.cuda.get_device_name(0))


## 2) Load your legal Latin Flair embeddings and stack them
This is the canonical bidirectional setup (forward + backward) for downstream sequence tagging.


In [None]:
LEGAL_FORWARD = "mschonhardt/latin-legal-forward"
LEGAL_BACKWARD = "mschonhardt/latin-legal-backward"

forward_embeddings = FlairEmbeddings(LEGAL_FORWARD)
backward_embeddings = FlairEmbeddings(LEGAL_BACKWARD)

stacked_embeddings = StackedEmbeddings([forward_embeddings, backward_embeddings])
print(stacked_embeddings)


## 3) (Optional) Embed a sentence to verify embeddings work
This does *not* do POS tagging yet; it just computes contextual embeddings for each token.


In [None]:
example_text = "In nomine sanctae et individuae trinitatis ."
sent = Sentence(example_text)

stacked_embeddings.embed(sent)

# Each token now has an embedding vector. We'll print token texts + embedding dimensionality.
for token in sent:
    emb = token.embedding
    print(token.text, "\t", emb.shape)


## 4) Load your POS tagger (trained on those embeddings)
For POS tagging, this is usually all you need.


In [None]:
POS_TAGGER_ID = "mschonhardt/latin-pos-tagger"
tagger = SequenceTagger.load(POS_TAGGER_ID)
print(tagger)

# Optional: inspect what embeddings are inside the tagger
try:
    print("\nTagger embeddings:")
    print(tagger.embeddings)
except Exception as e:
    print("Could not print tagger.embeddings:", e)


## 5) POS-tag a single sentence
Your model card suggests reading tags from `upos`.
Weâ€™ll implement a small helper that tries `upos` first, then falls back to `pos`.


In [None]:
def get_pos(token):
    # Prefer UPOS (as in your model card), fall back to POS.
    for label_type in ("upos", "pos"):
        try:
            t = token.get_tag(label_type)
            if t is not None and getattr(t, "value", None):
                return label_type, t.value, float(getattr(t, "score", 0.0))
        except Exception:
            pass
    return "", "", 0.0

sent = Sentence(example_text)
tagger.predict(sent)

rows = []
for tok in sent:
    label_type, value, score = get_pos(tok)
    rows.append({"token": tok.text, "label_type": label_type, "pos": value, "score": score})

df = pd.DataFrame(rows)
df


## 6) POS-tag multiple texts efficiently (mini-batching)
This is the typical pattern for processing a corpus.


In [None]:
texts = [
    "In nomine sanctae et individuae trinitatis .",
    "Quod infames uocentur qui ex consanguineis nascuntur .",
    "Si quis clericus furtum fecerit , deponatur .",
]

sentences = [Sentence(t) for t in texts]

# mini_batch_size can be increased if you have enough VRAM
tagger.predict(sentences, mini_batch_size=16)

all_rows = []
for i, s in enumerate(sentences):
    for j, tok in enumerate(s):
        label_type, value, score = get_pos(tok)
        all_rows.append({
            "doc_id": i,
            "token_id": j,
            "token": tok.text,
            "label_type": label_type,
            "pos": value,
            "score": score,
        })

df_all = pd.DataFrame(all_rows)
df_all.head(30)


## 7) Export as CoNLL-style text (optional)
Useful if you want to feed the output into evaluation scripts or your own downstream pipeline.


In [None]:
def to_conll(sentences):
    lines = []
    for doc_id, s in enumerate(sentences):
        for tok in s:
            _, pos, _ = get_pos(tok)
            lines.append(f"{tok.text}\t{pos}")
        lines.append("")  # sentence boundary
    return "\n".join(lines)

print(to_conll(sentences)[:1000])
