Summary
Running mlx-community/GLM-5.1-MXFP4-Q8 (378GB, 78 layers, 256 MoE experts) distributed across two M3 Ultra 256GB Mac Studios yields only 0.27 tok/s decode speed. Single-node benchmarks on 512GB machines report 15-16 tok/s. We've systematically eliminated every variable except the MXFP4 matmul kernel performance.
Environment
- Hardware: 2× M3 Ultra Mac Studio, 256GB unified memory each
- Connection: Thunderbolt 5 direct link (also tested Wi-Fi — no difference)
- macOS: 26.4
- MLX: 0.31.1
- mlx_lm: 0.31.2
- Python: 3.11.15
- Model:
mlx-community/GLM-5.1-MXFP4-Q8 (~378GB)
- Architecture: DeepSeek-V3.2 (
DeepseekV32Model)
Reproduction
Two-node distributed launch via:
mlx.launch --hostfile hostfile.json --backend jaccl script.py
The script uses model.model.pipeline(g) with a monkey-patched DeepseekV32Model.__call__ that adds:
1. CPU barrier before all_gather: mx.eval(mx.distributed.all_sum(mx.array(1.0), stream=mx.cpu)) — prevents idle-rank Metal GPU timeout
2. Per-layer mx.eval(h) to keep Metal command buffers under the ~45s AGX watchdog limit
Without the monkey-patch, inference crashes with MTLCommandBuffer Execution Error - GPU Timeout.
Systematic Testing
We ran 5 controlled tests to isolate the bottleneck:
| Test | Config | Result | Conclusion |
| ------------------ | ---------------------------------------------- | ------------- | ----------------------------------- |
| Baseline | Pipeline parallelism, eval every 4 layers | 0.270 tok/s | — |
| Lazy decode | Pipeline, eval only at rank boundary | 0.259 tok/s | Eval overhead is NOT the bottleneck |
| Aggressive eval | Pipeline, eval every 39 layers (once per rank) | 0.259 tok/s | Confirms eval is irrelevant |
| Tensor parallelism | model.shard(), both ranks run all 78 layers | 0.270 tok/s | Parallelism strategy doesn't matter |
| RDMA vs TCP | JACCL RDMA backend vs ring TCP | No difference | Network is NOT the bottleneck |
All tests converge on 0.26-0.27 tok/s regardless of configuration.
Key Observations
1. Small model works perfectly: mlx-community/Llama-3.2-3B-Instruct-4bit runs at 329 tok/s on the same cluster.
2. RSS is correct: ~180GB per rank for pipeline. No OOM or memory pressure.
3. Metal GPU timeout is NOT configurable on macOS — no sysctl, env var, or boot arg can extend the ~45s limit. The CPU barrier + per-layer eval is required.
4. The 0.27 tok/s appears to be a hard floor — something structural pins the speed regardless of parallelism, eval frequency, or network transport.
5. MXFP4 is the only variable we cannot A/B test — the only other GLM-5.1 quant (inferencerlabs/GLM-5.1-MLX-4.8bit) uses a custom architecture incompatible with standard mlx_lm.
Hypothesis
The MXFP4 (microscaling FP4) quantization format may lack optimized Metal kernels for the matmul patterns used by DeepSeek-V3.2's MoE architecture (256 experts, gated routing). The dequantization overhead per matmul may dominate compute time.
The 15-16 tok/s benchmarks from Inferencer use proprietary custom Metal kernels and a modified MLX build — not reproducible with standard mlx_lm.
Questions
1. Are there known performance limitations with MXFP4 matmul kernels on Apple Silicon, particularly for MoE gated dispatch patterns?
2. Is there a standard mlx_lm-compatible quantization of GLM-5.1 that uses integer quantization (Q4/Q8) instead of MXFP4?
3. Are MXFP4 kernel optimizations on the roadmap?[5:46 PM]4. Is the CPU barrier + per-layer eval pattern the recommended workaround for Metal GPU timeout on large models, or is there a better approach?
Summary
Running
mlx-community/GLM-5.1-MXFP4-Q8(378GB, 78 layers, 256 MoE experts) distributed across two M3 Ultra 256GB Mac Studios yields only 0.27 tok/s decode speed. Single-node benchmarks on 512GB machines report 15-16 tok/s. We've systematically eliminated every variable except the MXFP4 matmul kernel performance.Environment
mlx-community/GLM-5.1-MXFP4-Q8(~378GB)DeepseekV32Model)Reproduction
Two-node distributed launch via: