Skip to content

0.4.4

Latest

Choose a tag to compare

@jundot jundot released this 16 Jun 09:55
· 2 commits to main since this release

This release rolls up the 0.4.4 release candidates and focuses on MiniMax M3 support, DiffusionGemma support, DeepSeek V4 oQ/MTP support, stronger macOS 27 compatibility, safer MTP batching, cache-reuse correctness, and Memory Guard hardening.

Highlights

  • Early support for MiniMax M3 via the upstream mlx-vlm PR. oMLX now tracks the not-yet-merged MiniMax M3 work from Blaizzy/mlx-vlm#1374, originally contributed by @ivanfioravanti, so MiniMax M3 / MiniMax M3 VL can be tried before that support lands upstream. This includes native-text VLM adaptation, MiniMax position handling, sparse-attention left-padding fixes, tool-call marker handling, and related prefix/cache support.
  • Added DiffusionGemma and expanded speculative decoding support. oMLX can now serve DiffusionGemma through the mlx-vlm path, and VLM MTP can use an external Qwen MTP drafter.
  • Stronger macOS 27 compatibility. oMLX now uses a macOS memory stats compatibility layer for newer HOST_VM_INFO64 layouts, keeping Memory Guard decisions and admin memory telemetry stable on newer macOS releases. (#1749, #1835)
  • Added DeepSeek V4 oQ quantization and MTP support. This includes fractional oQ levels, pre-quantized DeepSeek V4 oQ tensors, and safer DeepSeek V4 MTP loading and rollback behavior.
  • Improved agent cache reuse and cache correctness. Paged SSD cache, prefix-cache restore, rotating-family cache handling, and MiniMax M3 partial-cache resume are now safer for repeated agent-style workloads. by @cfbraun in #1815 and @hojin12312 in #1807
  • Made native MTP batching safer. Native MTP decode now realigns batch rows and defers unsafe late-join rows, avoiding speculative batching across mismatched cache positions. by @efortin in #1824 and @richgoodson in #1845
  • Strengthened Memory Guard and hot-cache behavior. oMLX now has better preflight accounting, binding-ceiling diagnostics, and hot-cache pressure handling. by @cfbraun in #1452 and @isaac-cf-wong in #1863
  • Improved Gemma 4, Harmony, Codex App, and Hermes integration behavior. Tool-call parsing is more robust, malformed Harmony channels are preserved, Codex App Desktop launch is available, and Hermes now launches through the correct hermes chat flow. by @richgoodson in #1854, @jimicze in #1852, and @fparrav in #1878

Improvements and Fixes

  • Added MiniMax M3 native-text VLM support, sparse-attention patching, position-id handling, output parsing, tool-call filtering, and cache/type-handler support.
  • Exposed nested VLM language models through the oQ sanitize-plan proxy so MiniMax-style nested VLMs can be quantized. by @gilby in #1881
  • Added VLM MTP support with an external Qwen MTP drafter and fixed VLM MTP benchmark / mRoPE adapter routing paths. by @imi4u36d (#1791, #1813, and #1839)
  • Fixed the native VLM MTP drafter picker for qwen3_5_mtp models. by @chenqianhe in #1860
  • Enabled tool calling on the serial diffusion lane. by @scubamount in #1837
  • Fixed SSD cache invalidation for stale layer-cache signatures and rotating-tip cache payload handling. by @cfbraun in #1815 and @hojin12312 in #1807
  • Added a prefix-cache divergence probe and improved DFlash cached-token reporting / pre-load admission accounting. by @popfido in #1784, #1768, and #1766
  • Fixed engine-pool settle waits so other serving engines are not delayed unnecessarily. by @JimStenstrom in #1785
  • Fixed DFlash prefill Memory Guard enforcement on the primary path. by @JimStenstrom in #1770
  • Fixed Gemma 4 MCP-namespaced and single-quoted tool calls. by @richgoodson in #1854
  • Fixed /v1/completions thinking_budget forwarding and hardening. by @richgoodson in #1844 and @efortin in #1821
  • Fixed row-aligned samplers/logits processors after batch-row removal. by @efortin in #1824 and @richgoodson in #1845
  • Fixed non-ASCII configured API-key validation and safer rejected-key logging. by @richgoodson in #1804 and #1751
  • Added faithful BGE serving on MLX and improved native embedding/reranker behavior. by @paalolav in #1767
  • Added TTS language forwarding and NeMo ASR model detection. by @apetersson in #1773 and @scaryrawr in #1742
  • Fixed Gemma 4 Unified detection as VLM. by @FaisalFehad in #1744
  • Added Codex App Desktop integration and fixed Hermes launch command handling. by @jimicze in #1852 and @fparrav in #1878

Changes Since 0.4.4rc2

  • Fixed MiniMax M3 partial-cache resume for long-context workloads by trimming partial cache hits back to a safe 2048-token boundary before resuming prefill. (#1888)
  • Fixed a cache-reuse timing issue for coding-agent style follow-up requests, where a new request arriving immediately after the previous one could miss reusable cache because the prior cache write had not finished yet.
  • Limited SSD cache preloading to the amount of hot-cache space that can actually be used, avoiding unnecessary preload work under memory pressure.
  • Improved scheduler recovery when cache loading stalls, so blocked admissions are cleared and new requests can resume normally.
  • Changed MarkItDown's Show as model integration setting to default off, while keeping attachment preprocessing enabled by default.

Thanks

Special thanks to @ivanfioravanti for the initial MiniMax M3 support PR in mlx-vlm, and to @Blaizzy for the mlx-vlm work that makes this path possible.

Thanks to @bbongtree1004, @richgoodson, @paalolav, @JimStenstrom, @apetersson, @popfido, @FaisalFehad, @scaryrawr, @hojin12312, @efortin, @imi4u36d, @cfbraun, @fparrav, @gilby, @jimicze, @isaac-cf-wong, @scubamount, and @chenqianhe for the reports and fixes that shaped this release.

New Contributors

Thank you to @paalolav, @hojin12312, @efortin, @jimicze, @isaac-cf-wong, and @gilby for making their first contributions since 0.4.3.

Full Changelog: v0.4.3...v0.4.4