4-bit `gather_qmm` weight-reuse GEMM tops out ~80 GB/s at small MoE M on M5 Pro - is tile tuning of `gather_qmm_rhs_nax` feasible? #3691

programVeins · 2026-06-15T10:59:45Z

programVeins
Jun 15, 2026

Measured-findings discussion, not a bug.

On an M5 Pro (48 GB, macOS 27 beta), single-stream 4-bit MoE is bottlenecked by streaming expert weights, and the weight-reuse GEMM at the M the MoE actually uses (~16 rows/expert) streams at only ~80 GB/s - vs ~220 GB/s for qmv (M≤2) and ~263 GB/s elementwise on the same machine. Motivating workload: diffusiongemma-26B-A4B-it-4bit (128 experts/top-8, 256-token canvas → M ≈ 256·8/128 ≈ 16 rows/expert; ~12.8 GB of 4-bit weights read per denoising step). MLX@a6ec712 with the #3632 kernel-name fix applied so the NAX gather path loads.

A 4-bit quantized_matmul weight-reuse sweep (fixed weights, vary M) gives a clean two-regime curve. The cliff lands exactly where reuse starts forcing threadgroup staging, and the plateau is precisely the M the MoE needs:

M (rows reused/weight)	GB/s	kernel	note
1	220	`qmv`	no reuse - direct stream
2	190	`qmv`
4	107	transition	reuse begins
8	56	`qmm`	staging begins
16	80	`qmm` (incl. NAX `matmul2d`)	the MoE's M
32 / 64	84 / 80	`qmm`	staged plateau

import time, mlx.core as mx
def bench(fn, n=10, warm=3):
    for _ in range(warm): mx.eval(fn())
    mx.synchronize(); t=time.perf_counter()
    for _ in range(n): mx.eval(fn())
    mx.synchronize(); return (time.perf_counter()-t)/n
H, NB = 2816, 1408*128
w = mx.random.normal((NB, H)).astype(mx.bfloat16)
wq, s, b = mx.quantize(w, group_size=64, bits=4); nb = wq.nbytes+s.nbytes+b.nbytes
for M in (1,2,4,8,16,32,64):
    x = mx.random.normal((1, M, H)).astype(mx.bfloat16)
    dt = bench(lambda: mx.quantized_matmul(x, wq, s, b, transpose=True, group_size=64, bits=4))
    print(f"M={M:3d}: {nb/dt/1e9:6.1f} GB/s")

Important: the NAX tensor-op path is on this curve, not above it - the mpp::tensor_ops::matmul2d ((16,32,16) descriptor in steel/gemm/nax.h) is what's selected at M=16–64 and measures the same ~80 GB/s. So this isn't "the generic kernel is slow, NAX would fix it." I also tried 5 custom kernels (register-resident multi-row qmv, L2-weight-sharing, capacity-padded NAX gather, simdgroup_half8x8, a JIT matmul2d) to beat it; best was ~84 GB/s. The L2-sharing idea was disproven (simdgroups don't stay in lockstep, so each re-reads weights from DRAM - which is why reuse needs explicit staging). gather_qmm_rhs_nax hardcodes bm=bn=bk=64, wm=wn=2 with a literal // TODO: Tune the block sizes.

Question: is small-M tile tuning of gather_qmm_rhs_nax/gather_qmm_t_nax tractable for these shapes (M≈16, E=128, K=2816, N∈{1408,704}, gs=64) - e.g. smaller bm + more N-tiling, or a fused dequant→matmul2d staging that keeps a staged weight row resident across more activation rows - or is ~80 GB/s the understood matmul2d staged ceiling on this generation? If it's the known ceiling, that's a useful answer too: it confirms single-stream 4-bit MoE on M5 is bandwidth-bound here and the real levers are lower-bit experts or multi-canvas batching, not kernel tuning. Happy to run any tile/shape sweep on M5 Pro and post numbers. (NAX path requires the #3632 bk32→bk64 fix to load - this is a +1 confirmation of that PR, not a competing fix.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4-bit `gather_qmm` weight-reuse GEMM tops out ~80 GB/s at small MoE M on M5 Pro - is tile tuning of `gather_qmm_rhs_nax` feasible? #3691

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

4-bit gather_qmm weight-reuse GEMM tops out ~80 GB/s at small MoE M on M5 Pro - is tile tuning of gather_qmm_rhs_nax feasible? #3691

Uh oh!

programVeins Jun 15, 2026

Replies: 0 comments

4-bit `gather_qmm` weight-reuse GEMM tops out ~80 GB/s at small MoE M on M5 Pro - is tile tuning of `gather_qmm_rhs_nax` feasible? #3691

programVeins
Jun 15, 2026