v0.9.0 - Anchor Engine: language packs become unnecessary
PMB now understands many languages through the embedder instead of a
hand-written pack per language. The RU/UK packs are gone; recall, intent
detection and keyed-fact extraction ride one mechanism that transfers across
every language the embedder knows — and the cold path teaches itself the rest
from your own traffic.
✨ Highlights
- Semantic Anchor Engine (SAE). Intent detection and keyed-fact extraction
run on English semantic anchors, classified by margin against calibrated
per-set thresholds (FPR ≤ 1%). The multilingual embedder projects any language
next to the English exemplars, so „was sind meine Ziele" and „what are my
open goals" hit the same anchor - no per-language data. - Anchor→Lexicon Distillation (ALD). The cold lexical path self-compiles
from your traffic: the maintenance tick mines high-precision n-grams that
co-fire with anchors into$PMB_HOME/lang/auto.yaml. A language you actually
use gets faster over time, with zero configuration. - One mechanism, every language. ~2,000 lines of hand-written RU/UK lists
deleted in favour of the embedder + anchors. Adding a language is now usually
nothing.
⚠️ Breaking changes
packs/ru.yamlandpacks/uk.yamlare deleted; no pack is active by
default (_DEFAULT_ACTIVE = ()). The packs-off eval is now a blocking CI gate.- RU/UK recall is unaffected - the embedder carries it (verified byte-identical).
- On a cold, daemon-less stdio path, RU/UK lexical matchers (first-person,
self-intent, relation, negation, future-intent, general atomic extraction) no
longer fire until ALD distils them from traffic. The warm-daemon path — the
default — is unaffected; the anchor tier handles those.
✅ Verified
- V1 recall: en/ru/uk top-1 = 1.00 (RU/UK byte-identical to the pack era).
- Multilingual eval (101 queries): overall top-1 = 0.77, top-3 = 0.91;
top-1 = 1.00 for en/fr/pt/ru. - Anchor classify latency: p50 ≈ 48 ms, p95 ≈ 81 ms.
- Tests: 1308 passed / 4 skipped / 0 failed · eval gates 18 passed / 0 failed.
🔎 Honest limits (0.9, not 1.0)
- Non-English intents/extraction are warm-only (need the daemon); the cold
path self-heals with use. - CJK (zh/ja) is weak on exact top-1 (strong in top-3); ALD covers
space-delimited languages only. - The new hypothesis-margin keyed extraction ships default-off pending
real-world precision data. - ALD's real-traffic self-healing rate is proven in tests, not yet measured in
the wild. - The latency SLO was measured on a constrained box - re-measure on your hardware.
⬆️ Upgrading
No action needed for English. For other languages: run the daemon (so the warm
anchor tier + ALD are active) and use PMB normally - the cold path fills in over
a few days. If recall is weak for your language, upgrade the embedder:
pmb config set embedding.model BAAI/bge-m3 && pmb reindex.
Details: https://github.com/oleksiijko/pmb/blob/main/docs/adding-a-language.md
Full changelog: https://github.com/oleksiijko/pmb/blob/main/CHANGELOG.md ·
v0.8.0...v0.9.0