Release v0.3.0 — 160 tok/s decode restored (2× Ollama) · ohdearquant/lattice

v0.3.0 — Seed-Round Research Foundation + 160 tok/s Restored

Major release: 5 new ADR research features landed (Phase 1), embedding quality reached HF parity, and the Metal decode regression is fixed — 160 tok/s on Qwen3.5-0.8B Q8 (2× Ollama, MLX PPL parity).

Performance

fix(inference): wide f16 GEMV kernel restores 160 tok/s decode (#151) — Root-caused lm_head dispatching 151K threadgroups through NR=1 kernel. New gemv_decode_wide_f16 (NR=4) cuts to 38K TGs. Zero quality regression (PPL 20.60 vs MLX 20.67).
perf(embed): amortize int8/int4 query quantization + NEON normalize (#139) — Embedding search throughput improved via quantization amortization and ARM NEON-optimized normalization.

New Features (ADR-059–063 Phase 1)

feat(inference): AttentionKind enum for unified dispatch (#146, ADR-059) — Type-safe attention variant routing (GDN, GQA, NSA, Differential) replacing string matching.
feat(inference): ShortGPT block influence scoring (#133, ADR-060) — Layer importance scoring for structured pruning research.
feat(inference): MetricsMode + online softmax entropy (#129, ADR-061) — Per-token entropy and confidence metrics computed online during decode.
feat(inference): f16 KV cache infrastructure (#130, ADR-062) — Half-precision KV cache storage halving memory footprint. Infrastructure landed (Phase 1); wiring into default decode path planned for v0.3.1.
feat(inference): unified lattice CLI with chat + serve (#147, ADR-063) — Single lattice binary replacing separate chat_metal/serve binaries.

Embedding Quality

fix(embed): correct double-EOS and trailing-ws tokenization (#143) — Qwen3-Embedding forward parity with HuggingFace reference.
fix(inference): TemplateProcessing IDs authoritative for SP BOS/EOS (#117) — Fixes SentencePiece tokenization edge cases.
test(ci): HF tokenizer-parity gate in CI (#140) — Automated regression prevention for tokenizer drift.

Model Support

feat(quarot): Qwen3 rotation plan + required-tensor-names (#137) — QuaRot quantization support expanded to Qwen3 family.
feat(tune): validate LoRA adapter dims vs model config (#136) — Fail-fast on shape mismatch before GPU allocation.
feat(tune): Adam/AdamW optimizer + LoRA gradient computation (#90) — Training loop foundation for fine-tuning.

Correctness & Stability

fix(inference): ShardedQwenBacking drop order via ManuallyDrop (#135) — Prevents UAF on model unload.
fix(inference): clarify (1+γ) RMSNorm in golden-logit test (#134) — Test precision improved.
fix(ci): clippy --all-targets across all 5 crates (#144, #150) — 15 lints fixed, pre-commit tightened.

Documentation

docs(adr): ADR-059–063 seed-round research ADRs (#128) — Formal specifications for the 5 research features above.
docs(inference): MLX Metal kernel study + optimization roadmap (#141) — Competitive analysis guiding kernel development.
docs: consolidated models, attention variants & features reference (#142)

Tooling

ppl_metal binary — GPU-accelerated perplexity evaluation
pyproject.toml — Python dev dependencies tracked (mlx, datasets, numpy)
CI bench-regression uses --quick + --save-baseline for robustness

Measurements

Metric	v0.2.5	v0.3.0
Decode throughput (Qwen3.5-0.8B Q8)	133 tok/s	160 tok/s
PPL (wikitext-2, 2048 tok)	20.60	20.60 (MLX: 20.67)
Embedding tokenizer parity (HF)	30/32	32/32

f16 KV cache (2× memory reduction) infrastructure shipped; default-path activation planned for v0.3.1.

Full Changelog

v0.2.5...v0.3.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.0 — 160 tok/s decode restored (2× Ollama)

Choose a tag to compare

Sorry, something went wrong.