Skip to content

MXFP4 quantization extremely slow on distributed GLM-5.1 (DeepSeek-V3.2 arch) — 0.27 tok/s vs 16 tok/s expected #3402

@playtheace

Description

@playtheace

Summary

Running mlx-community/GLM-5.1-MXFP4-Q8 (378GB, 78 layers, 256 MoE experts) distributed across two M3 Ultra 256GB Mac Studios yields only 0.27 tok/s decode speed. Single-node benchmarks on 512GB machines report 15-16 tok/s. We've systematically eliminated every variable except the MXFP4 matmul kernel performance.

Environment

  • Hardware: 2× M3 Ultra Mac Studio, 256GB unified memory each
  • Connection: Thunderbolt 5 direct link (also tested Wi-Fi — no difference)
  • macOS: 26.4
  • MLX: 0.31.1
  • mlx_lm: 0.31.2
  • Python: 3.11.15
  • Model: mlx-community/GLM-5.1-MXFP4-Q8 (~378GB)
  • Architecture: DeepSeek-V3.2 (DeepseekV32Model)

Reproduction

Two-node distributed launch via:

mlx.launch --hostfile hostfile.json --backend jaccl script.py
The script uses model.model.pipeline(g) with a monkey-patched DeepseekV32Model.__call__ that adds:

1. CPU barrier before all_gather: mx.eval(mx.distributed.all_sum(mx.array(1.0), stream=mx.cpu)) — prevents idle-rank Metal GPU timeout
2. Per-layer mx.eval(h) to keep Metal command buffers under the ~45s AGX watchdog limit

Without the monkey-patch, inference crashes with MTLCommandBuffer Execution Error - GPU Timeout.

Systematic Testing

We ran 5 controlled tests to isolate the bottleneck:

| Test               | Config                                         | Result        | Conclusion                          |
| ------------------ | ---------------------------------------------- | ------------- | ----------------------------------- |
| Baseline           | Pipeline parallelism, eval every 4 layers      | 0.270 tok/s   ||
| Lazy decode        | Pipeline, eval only at rank boundary           | 0.259 tok/s   | Eval overhead is NOT the bottleneck |
| Aggressive eval    | Pipeline, eval every 39 layers (once per rank) | 0.259 tok/s   | Confirms eval is irrelevant         |
| Tensor parallelism | model.shard(), both ranks run all 78 layers    | 0.270 tok/s   | Parallelism strategy doesn't matter |
| RDMA vs TCP        | JACCL RDMA backend vs ring TCP                 | No difference | Network is NOT the bottleneck       |
All tests converge on 0.26-0.27 tok/s regardless of configuration.

Key Observations

1. Small model works perfectly: mlx-community/Llama-3.2-3B-Instruct-4bit runs at 329 tok/s on the same cluster.
2. RSS is correct: ~180GB per rank for pipeline. No OOM or memory pressure.
3. Metal GPU timeout is NOT configurable on macOS — no sysctl, env var, or boot arg can extend the ~45s limit. The CPU barrier + per-layer eval is required.
4. The 0.27 tok/s appears to be a hard floor — something structural pins the speed regardless of parallelism, eval frequency, or network transport.
5. MXFP4 is the only variable we cannot A/B test — the only other GLM-5.1 quant (inferencerlabs/GLM-5.1-MLX-4.8bit) uses a custom architecture incompatible with standard mlx_lm.

Hypothesis

The MXFP4 (microscaling FP4) quantization format may lack optimized Metal kernels for the matmul patterns used by DeepSeek-V3.2's MoE architecture (256 experts, gated routing). The dequantization overhead per matmul may dominate compute time.

The 15-16 tok/s benchmarks from Inferencer use proprietary custom Metal kernels and a modified MLX build — not reproducible with standard mlx_lm.

Questions

1. Are there known performance limitations with MXFP4 matmul kernels on Apple Silicon, particularly for MoE gated dispatch patterns?
2. Is there a standard mlx_lm-compatible quantization of GLM-5.1 that uses integer quantization (Q4/Q8) instead of MXFP4?
3. Are MXFP4 kernel optimizations on the roadmap?[5:46 PM]4. Is the CPU barrier + per-layer eval pattern the recommended workaround for Metal GPU timeout on large models, or is there a better approach?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions