MXFP4 quantization extremely slow on distributed GLM-5.1 (DeepSeek-V3.2 arch) — 0.27 tok/s vs 16 tok/s expected

## Summary

Running `mlx-community/GLM-5.1-MXFP4-Q8` (378GB, 78 layers, 256 MoE experts) distributed across two M3 Ultra 256GB Mac Studios yields only **0.27 tok/s decode speed**. Single-node benchmarks on 512GB machines report 15-16 tok/s. We've systematically eliminated every variable except the MXFP4 matmul kernel performance.

## Environment

- **Hardware:** 2× M3 Ultra Mac Studio, 256GB unified memory each
- **Connection:** Thunderbolt 5 direct link (also tested Wi-Fi — no difference)
- **macOS:** 26.4
- **MLX:** 0.31.1
- **mlx_lm:** 0.31.2
- **Python:** 3.11.15
- **Model:** `mlx-community/GLM-5.1-MXFP4-Q8` (~378GB)
- **Architecture:** DeepSeek-V3.2 (`DeepseekV32Model`)

## Reproduction

Two-node distributed launch via:
```bash
mlx.launch --hostfile hostfile.json --backend jaccl script.py
The script uses model.model.pipeline(g) with a monkey-patched DeepseekV32Model.__call__ that adds:

1. CPU barrier before all_gather: mx.eval(mx.distributed.all_sum(mx.array(1.0), stream=mx.cpu)) — prevents idle-rank Metal GPU timeout
2. Per-layer mx.eval(h) to keep Metal command buffers under the ~45s AGX watchdog limit

Without the monkey-patch, inference crashes with MTLCommandBuffer Execution Error - GPU Timeout.

Systematic Testing

We ran 5 controlled tests to isolate the bottleneck:

| Test               | Config                                         | Result        | Conclusion                          |
| ------------------ | ---------------------------------------------- | ------------- | ----------------------------------- |
| Baseline           | Pipeline parallelism, eval every 4 layers      | 0.270 tok/s   | —                                   |
| Lazy decode        | Pipeline, eval only at rank boundary           | 0.259 tok/s   | Eval overhead is NOT the bottleneck |
| Aggressive eval    | Pipeline, eval every 39 layers (once per rank) | 0.259 tok/s   | Confirms eval is irrelevant         |
| Tensor parallelism | model.shard(), both ranks run all 78 layers    | 0.270 tok/s   | Parallelism strategy doesn't matter |
| RDMA vs TCP        | JACCL RDMA backend vs ring TCP                 | No difference | Network is NOT the bottleneck       |
All tests converge on 0.26-0.27 tok/s regardless of configuration.

Key Observations

1. Small model works perfectly: mlx-community/Llama-3.2-3B-Instruct-4bit runs at 329 tok/s on the same cluster.
2. RSS is correct: ~180GB per rank for pipeline. No OOM or memory pressure.
3. Metal GPU timeout is NOT configurable on macOS — no sysctl, env var, or boot arg can extend the ~45s limit. The CPU barrier + per-layer eval is required.
4. The 0.27 tok/s appears to be a hard floor — something structural pins the speed regardless of parallelism, eval frequency, or network transport.
5. MXFP4 is the only variable we cannot A/B test — the only other GLM-5.1 quant (inferencerlabs/GLM-5.1-MLX-4.8bit) uses a custom architecture incompatible with standard mlx_lm.

Hypothesis

The MXFP4 (microscaling FP4) quantization format may lack optimized Metal kernels for the matmul patterns used by DeepSeek-V3.2's MoE architecture (256 experts, gated routing). The dequantization overhead per matmul may dominate compute time.

The 15-16 tok/s benchmarks from Inferencer use proprietary custom Metal kernels and a modified MLX build — not reproducible with standard mlx_lm.

Questions

1. Are there known performance limitations with MXFP4 matmul kernels on Apple Silicon, particularly for MoE gated dispatch patterns?
2. Is there a standard mlx_lm-compatible quantization of GLM-5.1 that uses integer quantization (Q4/Q8) instead of MXFP4?
3. Are MXFP4 kernel optimizations on the roadmap?[5:46 PM]4. Is the CPU barrier + per-layer eval pattern the recommended workaround for Metal GPU timeout on large models, or is there a better approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MXFP4 quantization extremely slow on distributed GLM-5.1 (DeepSeek-V3.2 arch) — 0.27 tok/s vs 16 tok/s expected #3402

Summary

Environment

Reproduction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MXFP4 quantization extremely slow on distributed GLM-5.1 (DeepSeek-V3.2 arch) — 0.27 tok/s vs 16 tok/s expected #3402

Description

Summary

Environment

Reproduction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions