0.4.4.dev1
Pre-release
Pre-release
This development release improves memory-pressure handling, paged SSD cache reliability, native embedding/reranker serving, DFlash memory/cache accounting, and audio model support.
- Improved emergency memory handling for pinned workloads. Active requests are only aborted as a last resort when memory exceeds the real ceiling.
- Improved paged SSD cache write-back reliability. Dirty hot-cache blocks now fall back to inline SSD writes instead of being dropped.
- Improved API-key log safety. Rejected API keys are logged as fingerprints instead of raw values. by @richgoodson in #1751
- Improved native BGE/XLM-R/BERT serving. bf16 reranker loads, embedding eval mode, and CLS pooling are handled correctly. by @paalolav in #1767
- Improved DFlash prefill memory guarding. DFlash primary mode now applies the prefill memory guard before admission. by @JimStenstrom in #1770
- Improved native embedding and reranker inference. Native paths now match shared serving behavior more closely.
- Improved DFlash preflight memory safety. Unsafe MLX telemetry calls were removed from the preflight guard path.
- Added TTS language forwarding. The audio speech
languagefield now reaches mlx-audiolang_code. by @apetersson in #1773 - Improved DFlash cache accounting. Prefix-cache hits are reported in
prompt_tokens_details.cached_tokens. by @popfido in #1768 - Fixed TTS argument forwarding. TTS engine argument order is preserved when language is forwarded.
- Improved Gemma 4 Unified discovery.
gemma4_unifiedmodels are detected as VLMs even withoutvision_config. by @FaisalFehad in #1744 - Improved NeMo ASR discovery. NeMo ASR models are detected as speech-to-text models. by @scaryrawr in #1742
- Improved pre-load memory admission. Tracked model memory now participates in LRU eviction decisions before loading another model. by @popfido in #1766