# POS tagging Latin with Flair contextual string embeddings + trained tagger

This notebook shows how to use:

- Contextual string embeddings (Flair LM):
  - `mschonhardt/latin-legal-forward`
  - `mschonhardt/latin-legal-backward`
- POS tagger trained on top of those embeddings:
  - `mschonhardt/latin-pos-tagger`

For POS tagging you usually only need to load the **tagger** (`SequenceTagger.load(...)`), because it already contains/uses its base embeddings. Here, they are loaded for illustrative purposes. Loading the two LM embeddings separately is still useful if you want to (a) inspect/verify embeddings, or (b) reuse them for other downstream tasks.
Model can be found on [Hugging Face](https://huggingface.co/mschonhardt/latin-pos-tagger) and [Zenodo](https://doi.org/10.5281/zenodo.18631267).


![](https://zenodo.org/badge/DOI/10.5281/zenodo.18631267.svg)




In [1]:
# If needed, install dependencies
# !pip install -U flair torch pandas


## 1) Setup
Select device, import libraries.


In [3]:
import torch
import pandas as pd
import flair

from flair.data import Sentence
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger

# Device selection
flair.device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("flair.device =", flair.device)
if torch.cuda.is_available():
    print("CUDA device:", torch.cuda.get_device_name(0))


flair.device = cuda
CUDA device: NVIDIA GeForce RTX 3060


## 2) Load your legal Latin Flair embeddings and stack them
This is the bidirectional setup (forward + backward) for downstream sequence tagging.


In [5]:
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from huggingface_hub import hf_hub_download

# Download the actual model files from Hugging Face
forward_path = hf_hub_download(repo_id="mschonhardt/latin-legal-forward", filename="latin-legal-forward.pt")
backward_path = hf_hub_download(repo_id="mschonhardt/latin-legal-backward", filename="latin-legal-backward.pt")

# Load them using the local paths
forward_embeddings = FlairEmbeddings(forward_path)
backward_embeddings = FlairEmbeddings(backward_path)

# Stack as usual
stacked_embeddings = StackedEmbeddings([forward_embeddings, backward_embeddings])
print(stacked_embeddings)


StackedEmbeddings [0-/home/micha/.cache/huggingface/hub/models--mschonhardt--latin-legal-forward/snapshots/8bea03e437de9ad7da812d6d686ad1fd1d1b1d0c/latin-legal-forward.pt,1-/home/micha/.cache/huggingface/hub/models--mschonhardt--latin-legal-backward/snapshots/d56792215e4f59843b2a08a5804068df749cbaaf/latin-legal-backward.pt]


## 3) (Optional) Embed a sentence to verify embeddings work
This does *not* do POS tagging yet; it just computes contextual embeddings for each token.


In [6]:
example_text = "In nomine sanctae et individuae trinitatis ."
sent = Sentence(example_text)

stacked_embeddings.embed(sent)

# Each token now has an embedding vector. We'll print token texts + embedding dimensionality.
for token in sent:
    emb = token.embedding
    print(token.text, "\t", emb.shape)


In 	 torch.Size([4096])
nomine 	 torch.Size([4096])
sanctae 	 torch.Size([4096])
et 	 torch.Size([4096])
individuae 	 torch.Size([4096])
trinitatis 	 torch.Size([4096])
. 	 torch.Size([4096])


## 4) Load POS tagger (trained on those embeddings)
For simple POS tagging, this is usually all you need.


In [8]:
POS_TAGGER_ID = "mschonhardt/latin-pos-tagger"

# Download the specific model file from the repo
# Note: In this repo, the relevant file is named 'best-model.pt'
tagger_path = hf_hub_download(repo_id=POS_TAGGER_ID, filename="best-model.pt")

# Load the tagger using the local path
tagger = SequenceTagger.load(tagger_path)

print(tagger)

# Inspect embeddings
try:
    print("\nTagger embeddings:")
    print(tagger.embeddings)
except Exception as e:
    print("Could not print tagger.embeddings:", e)

best-model.pt:   0%|          | 0.00/752M [00:00<?, ?B/s]

2026-02-14 12:04:39,474 SequenceTagger predicts: Dictionary with 17 tags: ADV, CCONJ, ADJ, NOUN, VERB, ADP, PUNCT, NUM, PRON, PROPN, FM, PART, ORD, ITJ, X, <START>, <STOP>
SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(333, 200)
        (rnn): LSTM(200, 2048, num_layers=2, dropout=0.1)
      )
    )
    (list_embedding_1): FlairEmbeddings(
      (lm): LanguageModel(
        (drop): Dropout(p=0.1, inplace=False)
        (encoder): Embedding(333, 200)
        (rnn): LSTM(200, 2048, num_layers=2, dropout=0.1)
      )
    )
  )
  (dropout): Dropout(p=0.2, inplace=False)
  (word_dropout): WordDropout(p=0.1)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=4096, out_features=4096, bias=True)
  (rnn): LSTM(4096, 1024, num_layers=2, batch_first=True, dropout=0.5, bidirectional=True)
  (linear): Linear(in_features=2048, o

## 5) POS-tag a single sentence


In [None]:
def get_pos(token):
    # model specifically uses "pos"
    for label_type in ("pos"):
        label = token.get_label(label_type)
        
        if label and label.value and label.value != "O":
            return label_type, label.value, float(label.score)
    
    primary_label = token.get_label(tagger.label_type)
    if primary_label:
        return tagger.label_type, primary_label.value, float(primary_label.score)
        
    return "", "", 0.0

sent = Sentence(example_text)
tagger.predict(sent)

rows = []
for tok in sent:
    label_type, value, score = get_pos(tok)
    rows.append({
        "token": tok.text, 
        "label_type": label_type, 
        "pos": value, 
        "score": round(score, 4)
    })

df = pd.DataFrame(rows)
df

Unnamed: 0,token,label_type,pos,score
0,Arma,pos,NOUN,0.5548
1,virumque,pos,PROPN,0.9616
2,cano,pos,NOUN,0.4905
3,",",pos,PUNCT,1.0
4,Troiae,pos,PROPN,0.9718
5,qui,pos,PRON,0.9985
6,primus,pos,NOUN,0.6289
7,ab,pos,ADP,1.0
8,oris,pos,NOUN,0.9981
9,.,pos,PUNCT,1.0


## 6) POS-tag multiple texts efficiently (mini-batching)
This is the typical pattern for processing a corpus.


In [None]:
texts = [
    "In nomine sanctae et individuae trinitatis .",
    "Quod infames uocentur qui ex consanguineis nascuntur .",
    "Si quis clericus furtum fecerit , deponatur .",
]

sentences = [Sentence(t) for t in texts]

# mini_batch_size can be increased if you have enough VRAM
tagger.predict(sentences, mini_batch_size=16)

all_rows = []
for i, s in enumerate(sentences):
    for j, tok in enumerate(s):
        label_type, value, score = get_pos(tok)
        all_rows.append({
            "doc_id": i,
            "token_id": j,
            "token": tok.text,
            "label_type": label_type,
            "pos": value,
            "score": score,
        })

df_all = pd.DataFrame(all_rows)
df_all.head(30)


## 7) Export as CoNLL-style text (optional)
Useful if you want to feed the output into evaluation scripts or your own downstream pipeline.


In [None]:
def to_conll(sentences):
    lines = []
    for doc_id, s in enumerate(sentences):
        for tok in s:
            _, pos, _ = get_pos(tok)
            lines.append(f"{tok.text}\t{pos}")
        lines.append("")  # sentence boundary
    return "\n".join(lines)

print(to_conll(sentences)[:1000])
