Release v0.2.3 · ohdearquant/lattice

Highlights

Quality parity with MLX achieved. Lattice's WikiText-2 perplexity on Qwen3.5-0.8B is now within 0.029 PPL of MLX (15.89 vs 15.86), down from a 0.77 PPL gap in v0.2.1. The remaining gap is within the f32↔bf16 numerical-precision band.

Three independent quality fixes landed this cycle:

RoPE pairing — stride-half, not interleaved (#96). Apply_partial_rope was rotating consecutive pairs (2i, 2i+1) — the GPT-J convention. Qwen3.5 is trained with stride-half pairs (i, half+i) — HF transformers rotate_half, MLX nn.RoPE(traditional=False). The comment "Qwen3.5 uses mrope_interleaved=true" was misread: that config field controls multimodal-position section interleaving (M-RoPE for video/image tokens), not the 1-D text pairing convention. Verified against MLX with max-diff 8e-6 (stride-half) vs 67.5 (interleaved). PPL: 16.62 → 15.89.
FP16 lm_head for tied-embedding Q8 path. Lattice's per-row symmetric Q8 quantization of the embedding produced sharper logits than MLX (max 25.5 vs 11.1). Switching the tied lm_head to use the FP16 embedding buffer already loaded for embedding lookup brings per-position distributions in line.
Asymmetric Q4 quantization. Q4Block was symmetric-only (scale = abs_max/7). Added asymmetric mode (scale + bias, scale = (max - min)/15). 0.77 PPL improvement on unrotated Q4. Critically, asymmetric BREAKS QuaRot (Hadamard rotation zero-centers weights, so bias adds noise) — QuaRot paths keep symmetric quantization.

WikiText-2 Perplexity (Qwen3.5-0.8B, window=512, stride=256)

Engine	Quant	PPL	Δ vs MLX gold
MLX	FP16 / Q8 g64	15.86	(reference)
Lattice	FP16 / Q8 (auto-quant)	15.89	+0.029
MLX	Q4 g64	18.18	+2.32
Lattice	Q4 asymmetric	19.27	+3.41

What's New

Inference

XGrammar structured output engine (ADR-046) — grammar-constrained decoding with token-level acceptance masks
Continuous batching with chunked prefill (ADR-048) — multi-sequence serving with prefix sharing
Prefix-shared paged KV cache (ADR-047) — IndexMap-backed LRU with O(1) page reuse across sequences
MoE Metal dispatch with expert coalescing (ADR-053) — top-k expert routing on GPU
Vision encoder module (ADR-049) — ViT path for Qwen3-VL (text + image)
Self-speculative decoding via GDN draft heads — uses the model's own linear-attention layers as draft model
Probabilistic rejection sampling for speculative decoding (ADR-050) — accept/reject with target-model verification probabilities
MTP on QuaRot Q4 via counter-rotation — multi-token prediction now works alongside rotated quantization

Quality & Numerics

RoPE stride-half pairing (#96) — closes 0.74 PPL gap
FP16 lm_head for tied-embedding Q8 path — removes per-row Q8 lm_head distortion
Asymmetric Q4 quantization — 0.77 PPL improvement on unrotated Q4

Performance

CPU SIMD optimizations + parity regression tests (#80)
Embed SIMD optimizations + correctness hardening (#79)
Metal GPU: zero-copy decode + MLP fusion (#81)
CI perf regression tracking (ADR-058, #83) — bench-regression.yml gates PRs touching CPU kernel paths against perf-baselines branch (>7% CI-lower-bound regression blocks merge)

Fine-tuning

LoRA full-lifecycle consumer API (ADR-057 D1-D5, #65) — load, save_peft_safetensors, online single-event SGD via adapt_step, module validation, consumer docs
Adam/AdamW optimizer + LoRA gradient computation (#90) — full backward pass for low-rank adapters

Transport

Online drift detection via Sinkhorn (ADR-055) — Wasserstein distance between rolling distributions

Bench & Tooling

Reproducible bench scripts: bench-compare.sh (origin/main vs HEAD A/B), bench_apples_to_apples.sh, bench_apples_precise.sh, bench_q4_apples.sh, bench_quality.sh
Logit comparison harness: scripts/compare_logits.py — windowed PPL evaluation, per-position argmax agreement vs MLX
WikiText-2 fixture committed (docs/bench_results/wiki.test.raw) for reproducible PPL bench
Codex review automation: scripts/codex_review_pr76*.sh — adversarial review rounds with self-verification

Fixes

RoPE pairing convention — stride-half not interleaved (#96)
Metal GPU build repair + zero-copy decode + MLP fusion (#81)
embed bench: data() accessor after field privatization (#95)
kv_cache: replace VecDeque LRU with IndexMap for correct insertion-order semantics
batch: stop throttling decode by free GDN slots — running sequences already own theirs
grammar: handle advance() rejection, gate MTP bypass, remove unsafe impls
generate: close external-review functional gaps (#41)
bench: correct decode-throughput methodology — honest e2e numbers (#40)
inference: document and skip MTP emission for QuaRot rotation basis mismatch (#42)
LoRA hook in batch prefill path (#43)

Internal

ADR-046 XGrammar structured output engine
ADR-047 Prefix-shared paged KV cache
ADR-048 Continuous batching with chunked prefill
ADR-049 Vision encoder module for Qwen3-VL
ADR-050 Probabilistic rejection sampling
ADR-053 MoE Metal dispatch with expert coalescing
ADR-055 Online drift detection via Sinkhorn
ADR-056 LoRA full-lifecycle consumer API
ADR-057 LoRA full-lifecycle implementation (D1-D5)
ADR-058 CPU perf regression tracking

Crates Published

lattice-inference 0.2.3
lattice-embed 0.2.3
lattice-fann 0.2.3
lattice-tune 0.2.3
lattice-transport 0.2.3

Note on v0.2.2

v0.2.2 was published to crates.io on 2026-05-20 and yanked on 2026-05-25 — it shipped the broken interleaved RoPE pairing fixed in #96. v0.2.3 is the first release that includes the RoPE fix. Use 0.2.3 or later. The GitHub tag v0.2.2 remains for history but no crates.io install from it.

Diff Stats

129 files changed, 29,524 insertions(+), 1,012 deletions(-) since v0.2.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.3

Choose a tag to compare

Sorry, something went wrong.