Skip to content

v0.3.0 — 160 tok/s decode restored (2× Ollama)

Latest

Choose a tag to compare

@ohdearquant ohdearquant released this 31 May 13:53
· 13 commits to main since this release

v0.3.0 — Seed-Round Research Foundation + 160 tok/s Restored

Major release: 5 new ADR research features landed (Phase 1), embedding quality reached HF parity, and the Metal decode regression is fixed — 160 tok/s on Qwen3.5-0.8B Q8 (2× Ollama, MLX PPL parity).


Performance

  • fix(inference): wide f16 GEMV kernel restores 160 tok/s decode (#151) — Root-caused lm_head dispatching 151K threadgroups through NR=1 kernel. New gemv_decode_wide_f16 (NR=4) cuts to 38K TGs. Zero quality regression (PPL 20.60 vs MLX 20.67).
  • perf(embed): amortize int8/int4 query quantization + NEON normalize (#139) — Embedding search throughput improved via quantization amortization and ARM NEON-optimized normalization.

New Features (ADR-059–063 Phase 1)

  • feat(inference): AttentionKind enum for unified dispatch (#146, ADR-059) — Type-safe attention variant routing (GDN, GQA, NSA, Differential) replacing string matching.
  • feat(inference): ShortGPT block influence scoring (#133, ADR-060) — Layer importance scoring for structured pruning research.
  • feat(inference): MetricsMode + online softmax entropy (#129, ADR-061) — Per-token entropy and confidence metrics computed online during decode.
  • feat(inference): f16 KV cache infrastructure (#130, ADR-062) — Half-precision KV cache storage halving memory footprint. Infrastructure landed (Phase 1); wiring into default decode path planned for v0.3.1.
  • feat(inference): unified lattice CLI with chat + serve (#147, ADR-063) — Single lattice binary replacing separate chat_metal/serve binaries.

Embedding Quality

  • fix(embed): correct double-EOS and trailing-ws tokenization (#143) — Qwen3-Embedding forward parity with HuggingFace reference.
  • fix(inference): TemplateProcessing IDs authoritative for SP BOS/EOS (#117) — Fixes SentencePiece tokenization edge cases.
  • test(ci): HF tokenizer-parity gate in CI (#140) — Automated regression prevention for tokenizer drift.

Model Support

  • feat(quarot): Qwen3 rotation plan + required-tensor-names (#137) — QuaRot quantization support expanded to Qwen3 family.
  • feat(tune): validate LoRA adapter dims vs model config (#136) — Fail-fast on shape mismatch before GPU allocation.
  • feat(tune): Adam/AdamW optimizer + LoRA gradient computation (#90) — Training loop foundation for fine-tuning.

Correctness & Stability

  • fix(inference): ShardedQwenBacking drop order via ManuallyDrop (#135) — Prevents UAF on model unload.
  • fix(inference): clarify (1+γ) RMSNorm in golden-logit test (#134) — Test precision improved.
  • fix(ci): clippy --all-targets across all 5 crates (#144, #150) — 15 lints fixed, pre-commit tightened.

Documentation

  • docs(adr): ADR-059–063 seed-round research ADRs (#128) — Formal specifications for the 5 research features above.
  • docs(inference): MLX Metal kernel study + optimization roadmap (#141) — Competitive analysis guiding kernel development.
  • docs: consolidated models, attention variants & features reference (#142)

Tooling

  • ppl_metal binary — GPU-accelerated perplexity evaluation
  • pyproject.toml — Python dev dependencies tracked (mlx, datasets, numpy)
  • CI bench-regression uses --quick + --save-baseline for robustness

Measurements

Metric v0.2.5 v0.3.0
Decode throughput (Qwen3.5-0.8B Q8) 133 tok/s 160 tok/s
PPL (wikitext-2, 2048 tok) 20.60 20.60 (MLX: 20.67)
Embedding tokenizer parity (HF) 30/32 32/32

f16 KV cache (2× memory reduction) infrastructure shipped; default-path activation planned for v0.3.1.

Full Changelog

v0.2.5...v0.3.0