v0.2.5 — Embed Quality + Tokenizer Parity + HF Regression Gate
Highlights
Embed quality + tokenizer parity + permanent HF regression gate. This release closes the embedding quality gap surfaced by the embed-perf-quality show. Four production embedding models (BGE-small, multilingual-E5-small, all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2) now match HuggingFace reference at cosine ≥ 0.9998 end-to-end across 5 measured inputs.
Tokenizer parity was lifted from 8/28 to 32/32 audit cases across all three tokenizer families. Embedding service gains role-aware prompts + cache key distinguishment + per-model pooling routing. A new HF-reference parity regression test is wired into make ci so future tokenization or forward-pass divergence is caught automatically.
Tokenizer parity (8/28 → 32/32 cases)
- SentencePiece BOS/EOS injection from
tokenizer.jsonpost_processorTemplateProcessing(E5 family). Template IDs are authoritative when present; falls back to vocab-name guesses only when the template omits explicit IDs. - Qwen BPE EOS injection (id
151643) + GPT-4-style pre-tokenizer regex alignment (URL/punctuation edge cases). - WordPiece AddedToken metadata preservation + longest-match scan for special tokens appearing literally in input text.
- WordPiece CJK character splitting via Hiragana/Katakana NFD voicing fold (58 entries). Mirrors HuggingFace
BertNormalizerNFD + Mn-stripping for canonical kana. Closes the BGE Japanese UNK bug. - SentencePiece trailing-whitespace Metaspace handling. The
normalize()loop no longer emits a trailing▁for whitespace-trailing input — matches HF's "▁ is the space before a word" semantics. Closes the E5 leading/trailing-ws regression.
Embed service
- Role-aware prompts:
embed_query()/embed_passage()apply E5"passage: "and Qwen task-instruction prefixes before forward.embed()retains backwards-compatible Generic semantics. - Role-aware cache keys:
EmbeddingRole { Query | Passage | Generic }distinguished via Blake3 hash tag injection. Identical text in different roles no longer collides inCachedEmbeddingService. - Per-model pooling:
BertPooling { Mean | CLS }enum onBertModel. BGE-{small,base,large}-en-v1.5 → CLS; E5/MiniLM/paraphrase → Mean; Qwen → LastToken (already routed viaQwenModel). L2 normalization stays post-pool for all paths.
Permanent HF parity regression gate
scripts/gen_embed_parity_goldens.py— one-shot HuggingFace golden generator (uv run --with transformers --with torch --with numpy --with sentencepiece).crates/embed/tests/embed_parity_vs_hf.rs— Rust integration test that loads committed JSON fixtures + computes lattice embeddings + asserts cosine + max-abs-diff against HF reference.- Committed fixtures for 5 models × 5 inputs in
crates/embed/tests/fixtures/embed_parity_v1/. - Wired into
make civiascripts/ci.sh.
| Model | Min cosine vs HF (5 inputs) |
|---|---|
BAAI/bge-small-en-v1.5 |
0.999868 |
intfloat/multilingual-e5-small |
0.999937 |
sentence-transformers/all-MiniLM-L6-v2 |
0.999899 |
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
0.999875 |
Qwen/Qwen3-Embedding-0.6B |
#[ignore] — forward-pass divergence under investigation, see #103 |
Bench-compare
No regression in untouched SIMD/forward kernels. The show touched only tokenizer code + BertPooling enum (single match-branch in BertModel::encode); SIMD code unchanged. Bench-compare against v0.2.4 was clean once measurement artifacts from cross-worktree builds were isolated.
Crates Published
lattice-inference0.2.5lattice-embed0.2.5lattice-fann0.2.5lattice-tune0.2.5lattice-transport0.2.5
Follow-ups (deferred to subsequent releases)
- #102 — SIMD throughput (quantization amortization, simsimd_comparison bench restore, NEON normalize target)
- #103 — Qwen3-Embedding forward-pass divergence (0.948 cosine on whitespace input, 0.989 on tokens-match input)
- #116 — Codex-review follow-ups from the show stack (edge cases the parity gate doesn't cover: AddedToken structured-field enforcement, Qwen
\s+(?!\S)whitespace regex, WordPiece NFD combining-mark stripping, SentencePiece literal-▁distinction)
Diff Stats
Across 14 commits merged via #104 + 1 commit via #117:
crates/inference/src/tokenizer/{bpe,common,sentencepiece,wordpiece}.rs— tokenizer fixescrates/inference/src/{lib.rs, model/bert.rs, pool.rs}— BertPooling enum + branchcrates/embed/src/{cache.rs, lib.rs, model.rs}— EmbeddingRole + role-aware cache keyscrates/embed/src/service/{cached.rs, mod.rs, native.rs, tests.rs}— service-level role threading + testscrates/embed/tests/{embed_parity_vs_hf.rs, tokenizer_parity_e2e.rs, fixtures/embed_parity_v1/*.json}— regression gatecrates/inference/tests/audit_tokenizer_parity.rs— 32 audit cases (28 original + 4 added in fixes)scripts/gen_embed_parity_goldens.py,scripts/ci.sh— generator + CI wiring