Skip to content

v0.9.0 - Anchor Engine: language packs become unnecessary

Choose a tag to compare

@oleksiijko oleksiijko released this 12 Jun 22:28
· 43 commits to main since this release
d22c3c8

PMB now understands many languages through the embedder instead of a
hand-written pack per language. The RU/UK packs are gone; recall, intent
detection and keyed-fact extraction ride one mechanism that transfers across
every language the embedder knows — and the cold path teaches itself the rest
from your own traffic.

✨ Highlights

  • Semantic Anchor Engine (SAE). Intent detection and keyed-fact extraction
    run on English semantic anchors, classified by margin against calibrated
    per-set thresholds (FPR ≤ 1%). The multilingual embedder projects any language
    next to the English exemplars, so „was sind meine Ziele" and „what are my
    open goals"
    hit the same anchor - no per-language data.
  • Anchor→Lexicon Distillation (ALD). The cold lexical path self-compiles
    from your traffic: the maintenance tick mines high-precision n-grams that
    co-fire with anchors into $PMB_HOME/lang/auto.yaml. A language you actually
    use gets faster over time, with zero configuration.
  • One mechanism, every language. ~2,000 lines of hand-written RU/UK lists
    deleted in favour of the embedder + anchors. Adding a language is now usually
    nothing.

⚠️ Breaking changes

  • packs/ru.yaml and packs/uk.yaml are deleted; no pack is active by
    default (_DEFAULT_ACTIVE = ()). The packs-off eval is now a blocking CI gate.
  • RU/UK recall is unaffected - the embedder carries it (verified byte-identical).
  • On a cold, daemon-less stdio path, RU/UK lexical matchers (first-person,
    self-intent, relation, negation, future-intent, general atomic extraction) no
    longer fire until ALD distils them from traffic. The warm-daemon path — the
    default — is unaffected
    ; the anchor tier handles those.

✅ Verified

  • V1 recall: en/ru/uk top-1 = 1.00 (RU/UK byte-identical to the pack era).
  • Multilingual eval (101 queries): overall top-1 = 0.77, top-3 = 0.91;
    top-1 = 1.00 for en/fr/pt/ru.
  • Anchor classify latency: p50 ≈ 48 ms, p95 ≈ 81 ms.
  • Tests: 1308 passed / 4 skipped / 0 failed · eval gates 18 passed / 0 failed.

🔎 Honest limits (0.9, not 1.0)

  • Non-English intents/extraction are warm-only (need the daemon); the cold
    path self-heals with use.
  • CJK (zh/ja) is weak on exact top-1 (strong in top-3); ALD covers
    space-delimited languages only.
  • The new hypothesis-margin keyed extraction ships default-off pending
    real-world precision data.
  • ALD's real-traffic self-healing rate is proven in tests, not yet measured in
    the wild
    .
  • The latency SLO was measured on a constrained box - re-measure on your hardware.

⬆️ Upgrading

No action needed for English. For other languages: run the daemon (so the warm
anchor tier + ALD are active) and use PMB normally - the cold path fills in over
a few days. If recall is weak for your language, upgrade the embedder:
pmb config set embedding.model BAAI/bge-m3 && pmb reindex.
Details: https://github.com/oleksiijko/pmb/blob/main/docs/adding-a-language.md

Full changelog: https://github.com/oleksiijko/pmb/blob/main/CHANGELOG.md ·
v0.8.0...v0.9.0