Skip to content

v0.2.5 — Embed Quality + Tokenizer Parity + HF Regression Gate

Choose a tag to compare

@ohdearquant ohdearquant released this 25 May 21:21
· 33 commits to main since this release

Highlights

Embed quality + tokenizer parity + permanent HF regression gate. This release closes the embedding quality gap surfaced by the embed-perf-quality show. Four production embedding models (BGE-small, multilingual-E5-small, all-MiniLM-L6-v2, paraphrase-multilingual-MiniLM-L12-v2) now match HuggingFace reference at cosine ≥ 0.9998 end-to-end across 5 measured inputs.

Tokenizer parity was lifted from 8/28 to 32/32 audit cases across all three tokenizer families. Embedding service gains role-aware prompts + cache key distinguishment + per-model pooling routing. A new HF-reference parity regression test is wired into make ci so future tokenization or forward-pass divergence is caught automatically.

Tokenizer parity (8/28 → 32/32 cases)

  • SentencePiece BOS/EOS injection from tokenizer.json post_processor TemplateProcessing (E5 family). Template IDs are authoritative when present; falls back to vocab-name guesses only when the template omits explicit IDs.
  • Qwen BPE EOS injection (id 151643) + GPT-4-style pre-tokenizer regex alignment (URL/punctuation edge cases).
  • WordPiece AddedToken metadata preservation + longest-match scan for special tokens appearing literally in input text.
  • WordPiece CJK character splitting via Hiragana/Katakana NFD voicing fold (58 entries). Mirrors HuggingFace BertNormalizer NFD + Mn-stripping for canonical kana. Closes the BGE Japanese UNK bug.
  • SentencePiece trailing-whitespace Metaspace handling. The normalize() loop no longer emits a trailing for whitespace-trailing input — matches HF's "▁ is the space before a word" semantics. Closes the E5 leading/trailing-ws regression.

Embed service

  • Role-aware prompts: embed_query() / embed_passage() apply E5 "passage: " and Qwen task-instruction prefixes before forward. embed() retains backwards-compatible Generic semantics.
  • Role-aware cache keys: EmbeddingRole { Query | Passage | Generic } distinguished via Blake3 hash tag injection. Identical text in different roles no longer collides in CachedEmbeddingService.
  • Per-model pooling: BertPooling { Mean | CLS } enum on BertModel. BGE-{small,base,large}-en-v1.5 → CLS; E5/MiniLM/paraphrase → Mean; Qwen → LastToken (already routed via QwenModel). L2 normalization stays post-pool for all paths.

Permanent HF parity regression gate

  • scripts/gen_embed_parity_goldens.py — one-shot HuggingFace golden generator (uv run --with transformers --with torch --with numpy --with sentencepiece).
  • crates/embed/tests/embed_parity_vs_hf.rs — Rust integration test that loads committed JSON fixtures + computes lattice embeddings + asserts cosine + max-abs-diff against HF reference.
  • Committed fixtures for 5 models × 5 inputs in crates/embed/tests/fixtures/embed_parity_v1/.
  • Wired into make ci via scripts/ci.sh.
Model Min cosine vs HF (5 inputs)
BAAI/bge-small-en-v1.5 0.999868
intfloat/multilingual-e5-small 0.999937
sentence-transformers/all-MiniLM-L6-v2 0.999899
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 0.999875
Qwen/Qwen3-Embedding-0.6B #[ignore] — forward-pass divergence under investigation, see #103

Bench-compare

No regression in untouched SIMD/forward kernels. The show touched only tokenizer code + BertPooling enum (single match-branch in BertModel::encode); SIMD code unchanged. Bench-compare against v0.2.4 was clean once measurement artifacts from cross-worktree builds were isolated.

Crates Published

  • lattice-inference 0.2.5
  • lattice-embed 0.2.5
  • lattice-fann 0.2.5
  • lattice-tune 0.2.5
  • lattice-transport 0.2.5

Follow-ups (deferred to subsequent releases)

  • #102 — SIMD throughput (quantization amortization, simsimd_comparison bench restore, NEON normalize target)
  • #103 — Qwen3-Embedding forward-pass divergence (0.948 cosine on whitespace input, 0.989 on tokens-match input)
  • #116 — Codex-review follow-ups from the show stack (edge cases the parity gate doesn't cover: AddedToken structured-field enforcement, Qwen \s+(?!\S) whitespace regex, WordPiece NFD combining-mark stripping, SentencePiece literal- distinction)

Diff Stats

Across 14 commits merged via #104 + 1 commit via #117:

  • crates/inference/src/tokenizer/{bpe,common,sentencepiece,wordpiece}.rs — tokenizer fixes
  • crates/inference/src/{lib.rs, model/bert.rs, pool.rs} — BertPooling enum + branch
  • crates/embed/src/{cache.rs, lib.rs, model.rs} — EmbeddingRole + role-aware cache keys
  • crates/embed/src/service/{cached.rs, mod.rs, native.rs, tests.rs} — service-level role threading + tests
  • crates/embed/tests/{embed_parity_vs_hf.rs, tokenizer_parity_e2e.rs, fixtures/embed_parity_v1/*.json} — regression gate
  • crates/inference/tests/audit_tokenizer_parity.rs — 32 audit cases (28 original + 4 added in fixes)
  • scripts/gen_embed_parity_goldens.py, scripts/ci.sh — generator + CI wiring