Release 0.4.5.dev1 · jundot/omlx

This development release focuses on major prefill speedups for GLM-5.2 and MiniMax M3 through custom kernels, API-visible model presets/profiles, and VLM/cache hardening after 0.4.4.

Highlights

Major prefill speedups for GLM-5.2 and MiniMax M3 through custom kernels. oMLX now includes GLM MoE DSA / Sparse MLA native kernels and MiniMax M3 sparse-attention acceleration, with build support in the Python package and macOS app. (#1984)
API-visible model profiles and refreshed global presets. Profiles can now be exposed as <model>:<profile> or <alias>:<profile> in /v1/models and served through the same loaded engine, while the built-in presets now include MiniMax-M3 and GLM-5.2. by @pablomoralesm in #1838

Performance Snapshot

Model: GLM-5.2-oQ4 (418.1 GB)
Machine: Mac Studio, M3 Ultra, 512 GB unified memory
Workload: single request, 128 generated tokens

Context	Baseline PP	oMLX 0.4.5 PP	PP vs baseline	Baseline TG	oMLX 0.4.5 TG	TG vs baseline
1k	186.8 tok/s	187.7 tok/s	1.00x (+0.5%)	15.6 tok/s	15.6 tok/s	+0.0%
4k	187.4 tok/s	212.2 tok/s	1.13x (+13.2%)	14.7 tok/s	14.9 tok/s	+1.4%
8k	164.1 tok/s	192.8 tok/s	1.17x (+17.5%)	14.4 tok/s	14.9 tok/s	+3.5%
16k	128.1 tok/s	178.9 tok/s	1.40x (+39.7%)	14.4 tok/s	14.7 tok/s	+2.1%
32k	87.7 tok/s	174.4 tok/s	1.99x (+98.9%)	14.1 tok/s	14.5 tok/s	+2.8%

Model: MiniMax-M3-oQ3 (187.3 GB)
Workload: single request, 128 generated tokens

Context	Baseline PP	oMLX 0.4.5 PP	PP vs baseline	Baseline TG	oMLX 0.4.5 TG	TG vs baseline
1k	325.3 tok/s	349.6 tok/s	1.07x (+7.5%)	28.5 tok/s	29.7 tok/s	+4.2%
4k	351.0 tok/s	359.4 tok/s	1.02x (+2.4%)	20.1 tok/s	20.4 tok/s	+1.5%
8k	332.1 tok/s	343.8 tok/s	1.04x (+3.5%)	20.1 tok/s	20.0 tok/s	-0.5%
16k	293.7 tok/s	340.9 tok/s	1.16x (+16.1%)	19.0 tok/s	19.7 tok/s	+3.7%
32k	228.1 tok/s	327.1 tok/s	1.43x (+43.4%)	18.8 tok/s	19.0 tok/s	+1.1%
64k	158.8 tok/s	307.7 tok/s	1.94x (+93.8%)	16.0 tok/s	17.5 tok/s	+9.4%

New Features

Added GLM-5.2 bundled Sparse MLA DSA custom kernels, including DSA indexer, exact block attention, q8 V-up, and fused MoE support. (#1984)
Added MiniMax M3 sparse-attention acceleration and adaptive long-prefill sizing.
Added API-visible model profiles for OpenAI-compatible clients, with web and macOS UI support. by @pablomoralesm in #1838
Updated global model presets for MiniMax-M3 and GLM-5.2.
Added Brazilian Portuguese admin UI localization. by @victor-torres in #1919
Added Gemma 4 QAT model support in the quantization tool. by @kreeger in #1690
Added native Qwen2ForCausalLM embedding serving for models such as jina-code-embeddings and gte-Qwen2. by @JimStenstrom in #1720
Modernized macOS app internals with native SwiftUI controls and Observation-based view models. by @Stv-X in #1891 and #1952

Bug Fixes

Fixed head_dim=256 long-context prefill OOM by routing eligible prefill through the tiled SDPA256 path. by @StevePierce in #2025
Fixed false VLM preflight rejections by counting actual image tokens instead of charging every image at the max-pixels ceiling. by @fqx in #1994
Fixed VLM teardown memory reclaim by dropping wrapper/model references before final MLX reclaim. by @zwcf5200 in #2010
Fixed SSD cache limit enforcement across model switches and composite CacheList / nested nstate SSD serialization. by @apcooley in #1939
Fixed unsafe in-flight model unload races, tuned tiered Memory Guard thresholds, and improved MiniMax M3 long-generation cache materialization.
Fixed Gemma 4 parenthesized call:name(...) tool calls. by @richgoodson in #1886
Fixed Cohere2 MoE streamed tool arguments with literal control characters and unsafe BPE streaming detokenization. by @ttapper in #1931
Fixed /v1/responses system-message fallback and missing reasoning output. by @imi4u36d in #1923
Fixed MCP stdio configs with cwd. by @JimStenstrom in #1987
Fixed CLI bootstrap base-path loading for non-default installs. by @bspaulding in #1936
Fixed Gemma4 Unified oQ sanitize proxy handling for audio-capable VLM checkpoints.
Fixed Gemma4 E2B/E4B shared-KV VLM checkpoint loading so affected models no longer fall back to text-only LLM loading.
Fixed Gemma E4B streaming output leaking raw <pad> / <eos> stop tokens.
Fixed MiniMax M3 oQ sanitize paths for proxy sensitivity and compatibility patch ordering.
Fixed admin/macOS UI issues including clipped chat action buttons, stale auto-start toggle state, nested local model display names, and the About docs link. by @shreyash0k in #2000 and @ryan-gustafson in #1949

New Contributors

Thank you to @pablomoralesm, @ttapper, @victor-torres, @bspaulding, @ryan-gustafson, @apcooley, @shreyash0k, @StevePierce, and @zwcf5200 for making their first contributions since 0.4.4.

Full Changelog: v0.4.4...v0.4.5.dev1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.4.5.dev1

Choose a tag to compare

Sorry, something went wrong.