Skip to content

0.4.5.dev1

Pre-release
Pre-release

Choose a tag to compare

@jundot jundot released this 28 Jun 17:18

This development release focuses on major prefill speedups for GLM-5.2 and MiniMax M3 through custom kernels, API-visible model presets/profiles, and VLM/cache hardening after 0.4.4.

Highlights

  • Major prefill speedups for GLM-5.2 and MiniMax M3 through custom kernels. oMLX now includes GLM MoE DSA / Sparse MLA native kernels and MiniMax M3 sparse-attention acceleration, with build support in the Python package and macOS app. (#1984)
  • API-visible model profiles and refreshed global presets. Profiles can now be exposed as <model>:<profile> or <alias>:<profile> in /v1/models and served through the same loaded engine, while the built-in presets now include MiniMax-M3 and GLM-5.2. by @pablomoralesm in #1838

Performance Snapshot

Model: GLM-5.2-oQ4 (418.1 GB)
Machine: Mac Studio, M3 Ultra, 512 GB unified memory
Workload: single request, 128 generated tokens

Context Baseline PP oMLX 0.4.5 PP PP vs baseline Baseline TG oMLX 0.4.5 TG TG vs baseline
1k 186.8 tok/s 187.7 tok/s 1.00x (+0.5%) 15.6 tok/s 15.6 tok/s +0.0%
4k 187.4 tok/s 212.2 tok/s 1.13x (+13.2%) 14.7 tok/s 14.9 tok/s +1.4%
8k 164.1 tok/s 192.8 tok/s 1.17x (+17.5%) 14.4 tok/s 14.9 tok/s +3.5%
16k 128.1 tok/s 178.9 tok/s 1.40x (+39.7%) 14.4 tok/s 14.7 tok/s +2.1%
32k 87.7 tok/s 174.4 tok/s 1.99x (+98.9%) 14.1 tok/s 14.5 tok/s +2.8%

Model: MiniMax-M3-oQ3 (187.3 GB)
Workload: single request, 128 generated tokens

Context Baseline PP oMLX 0.4.5 PP PP vs baseline Baseline TG oMLX 0.4.5 TG TG vs baseline
1k 325.3 tok/s 349.6 tok/s 1.07x (+7.5%) 28.5 tok/s 29.7 tok/s +4.2%
4k 351.0 tok/s 359.4 tok/s 1.02x (+2.4%) 20.1 tok/s 20.4 tok/s +1.5%
8k 332.1 tok/s 343.8 tok/s 1.04x (+3.5%) 20.1 tok/s 20.0 tok/s -0.5%
16k 293.7 tok/s 340.9 tok/s 1.16x (+16.1%) 19.0 tok/s 19.7 tok/s +3.7%
32k 228.1 tok/s 327.1 tok/s 1.43x (+43.4%) 18.8 tok/s 19.0 tok/s +1.1%
64k 158.8 tok/s 307.7 tok/s 1.94x (+93.8%) 16.0 tok/s 17.5 tok/s +9.4%

New Features

  • Added GLM-5.2 bundled Sparse MLA DSA custom kernels, including DSA indexer, exact block attention, q8 V-up, and fused MoE support. (#1984)
  • Added MiniMax M3 sparse-attention acceleration and adaptive long-prefill sizing.
  • Added API-visible model profiles for OpenAI-compatible clients, with web and macOS UI support. by @pablomoralesm in #1838
  • Updated global model presets for MiniMax-M3 and GLM-5.2.
  • Added Brazilian Portuguese admin UI localization. by @victor-torres in #1919
  • Added Gemma 4 QAT model support in the quantization tool. by @kreeger in #1690
  • Added native Qwen2ForCausalLM embedding serving for models such as jina-code-embeddings and gte-Qwen2. by @JimStenstrom in #1720
  • Modernized macOS app internals with native SwiftUI controls and Observation-based view models. by @Stv-X in #1891 and #1952

Bug Fixes

  • Fixed head_dim=256 long-context prefill OOM by routing eligible prefill through the tiled SDPA256 path. by @StevePierce in #2025
  • Fixed false VLM preflight rejections by counting actual image tokens instead of charging every image at the max-pixels ceiling. by @fqx in #1994
  • Fixed VLM teardown memory reclaim by dropping wrapper/model references before final MLX reclaim. by @zwcf5200 in #2010
  • Fixed SSD cache limit enforcement across model switches and composite CacheList / nested nstate SSD serialization. by @apcooley in #1939
  • Fixed unsafe in-flight model unload races, tuned tiered Memory Guard thresholds, and improved MiniMax M3 long-generation cache materialization.
  • Fixed Gemma 4 parenthesized call:name(...) tool calls. by @richgoodson in #1886
  • Fixed Cohere2 MoE streamed tool arguments with literal control characters and unsafe BPE streaming detokenization. by @ttapper in #1931
  • Fixed /v1/responses system-message fallback and missing reasoning output. by @imi4u36d in #1923
  • Fixed MCP stdio configs with cwd. by @JimStenstrom in #1987
  • Fixed CLI bootstrap base-path loading for non-default installs. by @bspaulding in #1936
  • Fixed Gemma4 Unified oQ sanitize proxy handling for audio-capable VLM checkpoints.
  • Fixed Gemma4 E2B/E4B shared-KV VLM checkpoint loading so affected models no longer fall back to text-only LLM loading.
  • Fixed Gemma E4B streaming output leaking raw <pad> / <eos> stop tokens.
  • Fixed MiniMax M3 oQ sanitize paths for proxy sensitivity and compatibility patch ordering.
  • Fixed admin/macOS UI issues including clipped chat action buttons, stale auto-start toggle state, nested local model display names, and the About docs link. by @shreyash0k in #2000 and @ryan-gustafson in #1949

New Contributors

Thank you to @pablomoralesm, @ttapper, @victor-torres, @bspaulding, @ryan-gustafson, @apcooley, @shreyash0k, @StevePierce, and @zwcf5200 for making their first contributions since 0.4.4.

Full Changelog: v0.4.4...v0.4.5.dev1