v0.3.0 — Seed-Round Research Foundation + 160 tok/s Restored
Major release: 5 new ADR research features landed (Phase 1), embedding quality reached HF parity, and the Metal decode regression is fixed — 160 tok/s on Qwen3.5-0.8B Q8 (2× Ollama, MLX PPL parity).
Performance
- fix(inference): wide f16 GEMV kernel restores 160 tok/s decode (#151) — Root-caused lm_head dispatching 151K threadgroups through NR=1 kernel. New
gemv_decode_wide_f16(NR=4) cuts to 38K TGs. Zero quality regression (PPL 20.60 vs MLX 20.67). - perf(embed): amortize int8/int4 query quantization + NEON normalize (#139) — Embedding search throughput improved via quantization amortization and ARM NEON-optimized normalization.
New Features (ADR-059–063 Phase 1)
- feat(inference): AttentionKind enum for unified dispatch (#146, ADR-059) — Type-safe attention variant routing (GDN, GQA, NSA, Differential) replacing string matching.
- feat(inference): ShortGPT block influence scoring (#133, ADR-060) — Layer importance scoring for structured pruning research.
- feat(inference): MetricsMode + online softmax entropy (#129, ADR-061) — Per-token entropy and confidence metrics computed online during decode.
- feat(inference): f16 KV cache infrastructure (#130, ADR-062) — Half-precision KV cache storage halving memory footprint. Infrastructure landed (Phase 1); wiring into default decode path planned for v0.3.1.
- feat(inference): unified lattice CLI with chat + serve (#147, ADR-063) — Single
latticebinary replacing separatechat_metal/servebinaries.
Embedding Quality
- fix(embed): correct double-EOS and trailing-ws tokenization (#143) — Qwen3-Embedding forward parity with HuggingFace reference.
- fix(inference): TemplateProcessing IDs authoritative for SP BOS/EOS (#117) — Fixes SentencePiece tokenization edge cases.
- test(ci): HF tokenizer-parity gate in CI (#140) — Automated regression prevention for tokenizer drift.
Model Support
- feat(quarot): Qwen3 rotation plan + required-tensor-names (#137) — QuaRot quantization support expanded to Qwen3 family.
- feat(tune): validate LoRA adapter dims vs model config (#136) — Fail-fast on shape mismatch before GPU allocation.
- feat(tune): Adam/AdamW optimizer + LoRA gradient computation (#90) — Training loop foundation for fine-tuning.
Correctness & Stability
- fix(inference): ShardedQwenBacking drop order via ManuallyDrop (#135) — Prevents UAF on model unload.
- fix(inference): clarify (1+γ) RMSNorm in golden-logit test (#134) — Test precision improved.
- fix(ci): clippy --all-targets across all 5 crates (#144, #150) — 15 lints fixed, pre-commit tightened.
Documentation
- docs(adr): ADR-059–063 seed-round research ADRs (#128) — Formal specifications for the 5 research features above.
- docs(inference): MLX Metal kernel study + optimization roadmap (#141) — Competitive analysis guiding kernel development.
- docs: consolidated models, attention variants & features reference (#142)
Tooling
ppl_metalbinary — GPU-accelerated perplexity evaluationpyproject.toml— Python dev dependencies tracked (mlx, datasets, numpy)- CI bench-regression uses
--quick+--save-baselinefor robustness
Measurements
| Metric | v0.2.5 | v0.3.0 |
|---|---|---|
| Decode throughput (Qwen3.5-0.8B Q8) | 133 tok/s | 160 tok/s |
| PPL (wikitext-2, 2048 tok) | 20.60 | 20.60 (MLX: 20.67) |
| Embedding tokenizer parity (HF) | 30/32 | 32/32 |
f16 KV cache (2× memory reduction) infrastructure shipped; default-path activation planned for v0.3.1.