Release v0.9.0 - Anchor Engine: language packs become unnecessary · oleksiijko/pmb

PMB now understands many languages through the embedder instead of a
hand-written pack per language. The RU/UK packs are gone; recall, intent
detection and keyed-fact extraction ride one mechanism that transfers across
every language the embedder knows — and the cold path teaches itself the rest
from your own traffic.

✨ Highlights

Semantic Anchor Engine (SAE). Intent detection and keyed-fact extraction
run on English semantic anchors, classified by margin against calibrated
per-set thresholds (FPR ≤ 1%). The multilingual embedder projects any language
next to the English exemplars, so „was sind meine Ziele" and „what are my
open goals" hit the same anchor - no per-language data.
Anchor→Lexicon Distillation (ALD). The cold lexical path self-compiles
from your traffic: the maintenance tick mines high-precision n-grams that
co-fire with anchors into $PMB_HOME/lang/auto.yaml. A language you actually
use gets faster over time, with zero configuration.
One mechanism, every language. ~2,000 lines of hand-written RU/UK lists
deleted in favour of the embedder + anchors. Adding a language is now usually
nothing.

⚠️ Breaking changes

packs/ru.yaml and packs/uk.yaml are deleted; no pack is active by
default (_DEFAULT_ACTIVE = ()). The packs-off eval is now a blocking CI gate.
RU/UK recall is unaffected - the embedder carries it (verified byte-identical).
On a cold, daemon-less stdio path, RU/UK lexical matchers (first-person,
self-intent, relation, negation, future-intent, general atomic extraction) no
longer fire until ALD distils them from traffic. The warm-daemon path — the
default — is unaffected; the anchor tier handles those.

✅ Verified

V1 recall: en/ru/uk top-1 = 1.00 (RU/UK byte-identical to the pack era).
Multilingual eval (101 queries): overall top-1 = 0.77, top-3 = 0.91;
top-1 = 1.00 for en/fr/pt/ru.
Anchor classify latency: p50 ≈ 48 ms, p95 ≈ 81 ms.
Tests: 1308 passed / 4 skipped / 0 failed · eval gates 18 passed / 0 failed.

🔎 Honest limits (0.9, not 1.0)

Non-English intents/extraction are warm-only (need the daemon); the cold
path self-heals with use.
CJK (zh/ja) is weak on exact top-1 (strong in top-3); ALD covers
space-delimited languages only.
The new hypothesis-margin keyed extraction ships default-off pending
real-world precision data.
ALD's real-traffic self-healing rate is proven in tests, not yet measured in
the wild.
The latency SLO was measured on a constrained box - re-measure on your hardware.

⬆️ Upgrading

No action needed for English. For other languages: run the daemon (so the warm
anchor tier + ALD are active) and use PMB normally - the cold path fills in over
a few days. If recall is weak for your language, upgrade the embedder:
pmb config set embedding.model BAAI/bge-m3 && pmb reindex.
Details: https://github.com/oleksiijko/pmb/blob/main/docs/adding-a-language.md

Full changelog: https://github.com/oleksiijko/pmb/blob/main/CHANGELOG.md ·
v0.8.0...v0.9.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9.0 - Anchor Engine: language packs become unnecessary

Choose a tag to compare

Sorry, something went wrong.