v0.2.3
Highlights
Quality parity with MLX achieved. Lattice's WikiText-2 perplexity on Qwen3.5-0.8B is now within 0.029 PPL of MLX (15.89 vs 15.86), down from a 0.77 PPL gap in v0.2.1. The remaining gap is within the f32↔bf16 numerical-precision band.
Three independent quality fixes landed this cycle:
-
RoPE pairing — stride-half, not interleaved (#96). Apply_partial_rope was rotating consecutive pairs
(2i, 2i+1)— the GPT-J convention. Qwen3.5 is trained with stride-half pairs(i, half+i)— HF transformersrotate_half, MLXnn.RoPE(traditional=False). The comment "Qwen3.5 uses mrope_interleaved=true" was misread: that config field controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Verified against MLX with max-diff 8e-6 (stride-half) vs 67.5 (interleaved). PPL: 16.62 → 15.89. -
FP16 lm_head for tied-embedding Q8 path. Lattice's per-row symmetric Q8 quantization of the embedding produced sharper logits than MLX (max 25.5 vs 11.1). Switching the tied lm_head to use the FP16 embedding buffer already loaded for embedding lookup brings per-position distributions in line.
-
Asymmetric Q4 quantization. Q4Block was symmetric-only (
scale = abs_max/7). Added asymmetric mode (scale + bias,scale = (max - min)/15). 0.77 PPL improvement on unrotated Q4. Critically, asymmetric BREAKS QuaRot (Hadamard rotation zero-centers weights, so bias adds noise) — QuaRot paths keep symmetric quantization.
WikiText-2 Perplexity (Qwen3.5-0.8B, window=512, stride=256)
| Engine | Quant | PPL | Δ vs MLX gold |
|---|---|---|---|
| MLX | FP16 / Q8 g64 | 15.86 | (reference) |
| Lattice | FP16 / Q8 (auto-quant) | 15.89 | +0.029 |
| MLX | Q4 g64 | 18.18 | +2.32 |
| Lattice | Q4 asymmetric | 19.27 | +3.41 |
What's New
Inference
- XGrammar structured output engine (ADR-046) — grammar-constrained decoding with token-level acceptance masks
- Continuous batching with chunked prefill (ADR-048) — multi-sequence serving with prefix sharing
- Prefix-shared paged KV cache (ADR-047) — IndexMap-backed LRU with O(1) page reuse across sequences
- MoE Metal dispatch with expert coalescing (ADR-053) — top-k expert routing on GPU
- Vision encoder module (ADR-049) — ViT path for Qwen3-VL (text + image)
- Self-speculative decoding via GDN draft heads — uses the model's own linear-attention layers as draft model
- Probabilistic rejection sampling for speculative decoding (ADR-050) — accept/reject with target-model verification probabilities
- MTP on QuaRot Q4 via counter-rotation — multi-token prediction now works alongside rotated quantization
Quality & Numerics
- RoPE stride-half pairing (#96) — closes 0.74 PPL gap
- FP16 lm_head for tied-embedding Q8 path — removes per-row Q8 lm_head distortion
- Asymmetric Q4 quantization — 0.77 PPL improvement on unrotated Q4
Performance
- CPU SIMD optimizations + parity regression tests (#80)
- Embed SIMD optimizations + correctness hardening (#79)
- Metal GPU: zero-copy decode + MLP fusion (#81)
- CI perf regression tracking (ADR-058, #83) —
bench-regression.ymlgates PRs touching CPU kernel paths againstperf-baselinesbranch (>7% CI-lower-bound regression blocks merge)
Fine-tuning
- LoRA full-lifecycle consumer API (ADR-057 D1-D5, #65) — load, save_peft_safetensors, online single-event SGD via
adapt_step, module validation, consumer docs - Adam/AdamW optimizer + LoRA gradient computation (#90) — full backward pass for low-rank adapters
Transport
- Online drift detection via Sinkhorn (ADR-055) — Wasserstein distance between rolling distributions
Bench & Tooling
- Reproducible bench scripts:
bench-compare.sh(origin/main vs HEAD A/B),bench_apples_to_apples.sh,bench_apples_precise.sh,bench_q4_apples.sh,bench_quality.sh - Logit comparison harness:
scripts/compare_logits.py— windowed PPL evaluation, per-position argmax agreement vs MLX - WikiText-2 fixture committed (
docs/bench_results/wiki.test.raw) for reproducible PPL bench - Codex review automation:
scripts/codex_review_pr76*.sh— adversarial review rounds with self-verification
Fixes
- RoPE pairing convention — stride-half not interleaved (#96)
- Metal GPU build repair + zero-copy decode + MLP fusion (#81)
- embed bench:
data()accessor after field privatization (#95) - kv_cache: replace
VecDequeLRU withIndexMapfor correct insertion-order semantics - batch: stop throttling decode by free GDN slots — running sequences already own theirs
- grammar: handle
advance()rejection, gate MTP bypass, remove unsafe impls - generate: close external-review functional gaps (#41)
- bench: correct decode-throughput methodology — honest e2e numbers (#40)
- inference: document and skip MTP emission for QuaRot rotation basis mismatch (#42)
- LoRA hook in batch prefill path (#43)
Internal
- ADR-046 XGrammar structured output engine
- ADR-047 Prefix-shared paged KV cache
- ADR-048 Continuous batching with chunked prefill
- ADR-049 Vision encoder module for Qwen3-VL
- ADR-050 Probabilistic rejection sampling
- ADR-053 MoE Metal dispatch with expert coalescing
- ADR-055 Online drift detection via Sinkhorn
- ADR-056 LoRA full-lifecycle consumer API
- ADR-057 LoRA full-lifecycle implementation (D1-D5)
- ADR-058 CPU perf regression tracking
Crates Published
lattice-inference0.2.3lattice-embed0.2.3lattice-fann0.2.3lattice-tune0.2.3lattice-transport0.2.3
Note on v0.2.2
v0.2.2 was published to crates.io on 2026-05-20 and yanked on 2026-05-25 — it shipped the broken interleaved RoPE pairing fixed in #96. v0.2.3 is the first release that includes the RoPE fix. Use 0.2.3 or later. The GitHub tag v0.2.2 remains for history but no crates.io install from it.
Diff Stats
129 files changed, 29,524 insertions(+), 1,012 deletions(-) since v0.2.1.