Release 0.4.4 · jundot/omlx

This release rolls up the 0.4.4 release candidates and focuses on MiniMax M3 support, DiffusionGemma support, DeepSeek V4 oQ/MTP support, stronger macOS 27 compatibility, safer MTP batching, cache-reuse correctness, and Memory Guard hardening.

Highlights

Early support for MiniMax M3 via the upstream mlx-vlm PR. oMLX now tracks the not-yet-merged MiniMax M3 work from Blaizzy/mlx-vlm#1374, originally contributed by @ivanfioravanti, so MiniMax M3 / MiniMax M3 VL can be tried before that support lands upstream. This includes native-text VLM adaptation, MiniMax position handling, sparse-attention left-padding fixes, tool-call marker handling, and related prefix/cache support.
Added DiffusionGemma and expanded speculative decoding support. oMLX can now serve DiffusionGemma through the mlx-vlm path, and VLM MTP can use an external Qwen MTP drafter.
Stronger macOS 27 compatibility. oMLX now uses a macOS memory stats compatibility layer for newer HOST_VM_INFO64 layouts, keeping Memory Guard decisions and admin memory telemetry stable on newer macOS releases. (#1749, #1835)
Added DeepSeek V4 oQ quantization and MTP support. This includes fractional oQ levels, pre-quantized DeepSeek V4 oQ tensors, and safer DeepSeek V4 MTP loading and rollback behavior.
Improved agent cache reuse and cache correctness. Paged SSD cache, prefix-cache restore, rotating-family cache handling, and MiniMax M3 partial-cache resume are now safer for repeated agent-style workloads. by @cfbraun in #1815 and @hojin12312 in #1807
Made native MTP batching safer. Native MTP decode now realigns batch rows and defers unsafe late-join rows, avoiding speculative batching across mismatched cache positions. by @efortin in #1824 and @richgoodson in #1845
Strengthened Memory Guard and hot-cache behavior. oMLX now has better preflight accounting, binding-ceiling diagnostics, and hot-cache pressure handling. by @cfbraun in #1452 and @isaac-cf-wong in #1863
Improved Gemma 4, Harmony, Codex App, and Hermes integration behavior. Tool-call parsing is more robust, malformed Harmony channels are preserved, Codex App Desktop launch is available, and Hermes now launches through the correct hermes chat flow. by @richgoodson in #1854, @jimicze in #1852, and @fparrav in #1878

Improvements and Fixes

Added MiniMax M3 native-text VLM support, sparse-attention patching, position-id handling, output parsing, tool-call filtering, and cache/type-handler support.
Exposed nested VLM language models through the oQ sanitize-plan proxy so MiniMax-style nested VLMs can be quantized. by @gilby in #1881
Added VLM MTP support with an external Qwen MTP drafter and fixed VLM MTP benchmark / mRoPE adapter routing paths. by @imi4u36d (#1791, #1813, and #1839)
Fixed the native VLM MTP drafter picker for qwen3_5_mtp models. by @chenqianhe in #1860
Enabled tool calling on the serial diffusion lane. by @scubamount in #1837
Fixed SSD cache invalidation for stale layer-cache signatures and rotating-tip cache payload handling. by @cfbraun in #1815 and @hojin12312 in #1807
Added a prefix-cache divergence probe and improved DFlash cached-token reporting / pre-load admission accounting. by @popfido in #1784, #1768, and #1766
Fixed engine-pool settle waits so other serving engines are not delayed unnecessarily. by @JimStenstrom in #1785
Fixed DFlash prefill Memory Guard enforcement on the primary path. by @JimStenstrom in #1770
Fixed Gemma 4 MCP-namespaced and single-quoted tool calls. by @richgoodson in #1854
Fixed /v1/completions thinking_budget forwarding and hardening. by @richgoodson in #1844 and @efortin in #1821
Fixed row-aligned samplers/logits processors after batch-row removal. by @efortin in #1824 and @richgoodson in #1845
Fixed non-ASCII configured API-key validation and safer rejected-key logging. by @richgoodson in #1804 and #1751
Added faithful BGE serving on MLX and improved native embedding/reranker behavior. by @paalolav in #1767
Added TTS language forwarding and NeMo ASR model detection. by @apetersson in #1773 and @scaryrawr in #1742
Fixed Gemma 4 Unified detection as VLM. by @FaisalFehad in #1744
Added Codex App Desktop integration and fixed Hermes launch command handling. by @jimicze in #1852 and @fparrav in #1878

Changes Since 0.4.4rc2

Fixed MiniMax M3 partial-cache resume for long-context workloads by trimming partial cache hits back to a safe 2048-token boundary before resuming prefill. (#1888)
Fixed a cache-reuse timing issue for coding-agent style follow-up requests, where a new request arriving immediately after the previous one could miss reusable cache because the prior cache write had not finished yet.
Limited SSD cache preloading to the amount of hot-cache space that can actually be used, avoiding unnecessary preload work under memory pressure.
Improved scheduler recovery when cache loading stalls, so blocked admissions are cleared and new requests can resume normally.
Changed MarkItDown's Show as model integration setting to default off, while keeping attachment preprocessing enabled by default.

Thanks

Special thanks to @ivanfioravanti for the initial MiniMax M3 support PR in mlx-vlm, and to @Blaizzy for the mlx-vlm work that makes this path possible.

Thanks to @bbongtree1004, @richgoodson, @paalolav, @JimStenstrom, @apetersson, @popfido, @FaisalFehad, @scaryrawr, @hojin12312, @efortin, @imi4u36d, @cfbraun, @fparrav, @gilby, @jimicze, @isaac-cf-wong, @scubamount, and @chenqianhe for the reports and fixes that shaped this release.

New Contributors

Thank you to @paalolav, @hojin12312, @efortin, @jimicze, @isaac-cf-wong, and @gilby for making their first contributions since 0.4.3.

Full Changelog: v0.4.3...v0.4.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.4.4

Choose a tag to compare

Sorry, something went wrong.