This hotfix release focuses on macOS 27 compatibility, Throughput recovery for Qwopus and other affected models, and Memory Guard optimization and correctness fixes.
Highlights
- Added macOS 27 beta compatibility. oMLX now handles the larger
HOST_VM_INFO64response shape used by macOS 27 and avoids fragilepsutilmemory-stat paths on macOS. (#1748, #1749) - Fixed slow Streaming decode on Qwopus and related models. Active Memory Guard polling no longer calls MLX/Metal telemetry from the background thread during active requests, removing a major source of decode stalls. (#1745)
- Improved decode performance. In a single-run Qwen 3.6-35B-A3B
tg512check, throughput improved from 77.5 to 79.0 tok/s (+1.9%) compared with 0.4.2. - Improved per-model MTP behavior. MTP decode eligibility is now stored on each loaded model instance, so loading a non-MTP model later no longer disables MTP decode on an already-loaded Qwen/Qwopus MTP model. (#1758)
- Optimized Memory Guard preflight estimates. TurboQuant KV, hybrid cache models, fused SDPA, and tiled SDPA scratch memory are now accounted for more accurately, reducing false rejections and avoiding unsafe underestimates. (#1763, #1764)
Improvements and Fixes
- Fixed Memory Guard active-request polling so scheduler-recorded MLX memory samples are reused instead of querying MLX telemetry from the enforcer loop.
- Fixed macOS memory detection so system memory and process enforcement remain stable when
HOST_VM_INFO64sizing changes on macOS 27 beta. - Fixed TurboQuant KV preflight accounting so Memory Guard no longer overestimates KV peak memory by several times on TurboQuant-enabled models. (#1763)
- Fixed preflight support for hybrid
ArraysCachemodels with TurboQuant enabled. - Fixed fused SDPA memory estimation so MLX fused attention is treated as linear-memory for all
head_dimvalues where applicable. by @fqx (#1764) - Added tiled SDPA scratch accounting for high-head-dimension prefill paths so large VLM/Qwen-style models are guarded more accurately.
- Fixed prefill Memory Guard errors to return a client-visible failure path instead of surfacing as an internal server failure.
- Fixed DFlash fallback scheduler resolution and bumped
dflash-mlxfor the Qwen wrapper compatibility fix. - Fixed Llama 4 batch cache offsets. (#1752)
- Fixed
max_completion_tokenshandling as an alias formax_tokens. (#1759) - Fixed Harmony encoding loading by retrying transient tokenizer/encoding load failures.
- Fixed stored MarkItDown file placeholders so existing uploaded-file references remain usable after 0.4.2. (#1750)
- Fixed
logits_processors=Nonehandling to avoid mlx-lm crashes. by @monroewilliams (#1747) - Added Thaw menu bar manager support. by @youvegotmoxie (#1743)
- Bumped the
mlx-lm,mlx-vlm, anddflash-mlxpins to include upstream compatibility fixes used by this hotfix.
Thanks
Thanks to @Collinw24, @ritbl, @orangeseasun205, @smkzw, @fqx, @monroewilliams, and @youvegotmoxie for the reports and fixes that shaped this release.
New Contributors
Thank you to @youvegotmoxie for making their first contribution in this release.
Full Changelog: v0.4.2...v0.4.3