Description
Quantized Gemma 4 26B-A4B models (MoE, 128 experts top-8) produce garbage output on the base M4 chip (10 GPU cores), while working correctly on M3 Ultra (80 GPU cores). The issue affects all quantized variants of this model — including community PLE-safe models — but does not affect:
- Dense quantized models (Qwen2.5 0.5B 4-bit)
- Gemma 4 dense bf16 (E2B)
- Other MoE models with fewer experts (Qwen3 30B-A3B, 64 experts, 4-bit)
This suggests a bug in the gather_mm Metal kernel dispatch when handling 128-expert quantized MoE on GPUs with a low core count.
Reproduction
Tested on M4 Mac Mini (10 GPU cores, 24 GB, macOS 26.3.1). Clean venv, pip-only install:
python3.13 -m venv ~/gemma4-test
source ~/gemma4-test/bin/activate
pip install mlx==0.31.1 mlx-lm==0.31.2 mlx-vlm==0.4.4
1. WORKS — Dense quantized
python3 -m mlx_lm generate
--model mlx-community/Qwen2.5-0.5B-Instruct-4bit
--prompt 'Hello, what is 2+2?' --max-tokens 20
Output: "2+2 equals 4." ✅
2. WORKS — MoE 64 experts, quantized
python3 -m mlx_lm generate
--model mlx-community/Qwen3-30B-A3B-4bit
--prompt 'Hello' --max-tokens 20
Output: "<think>Okay, the user is asking..." ✅ (51 tok/s)
3. WORKS — Gemma 4 dense bf16
python3 -m mlx_vlm generate
--model mlx-community/gemma-4-e2b-it-bf16
--prompt 'What is the capital of Korea?' --max-tokens 20
Output: "The capital of Korea is Seoul." ✅
4. FAILS — Gemma 4 MoE 128 experts, quantized
python3 -c "
from mlx_lm import load, generate
model, tok = load('FakeRockert543/gemma-4-26b-a4b-it-MLX-4bit')
r = generate(model, tok, prompt='What is the capital of Korea?', max_tokens=20, verbose=True)
"
Output: "는가가?는가?는가?는가?는가?는가?" ❌ GARBAGE
28.6 tok/s, peak 15.3 GB
The same model files (MD5 verified) produce correct output on M3 Ultra (80 GPU cores):
# Same command on M3 Ultra:
# Output: "The capital of Korea is **Seoul**." ✅
# 107 tok/s
Cross-Validation Matrix
| Model |
Architecture |
Experts |
Quantization |
M3 Ultra (80 cores) |
M4 base (10 cores) |
| Qwen2.5 0.5B |
Dense |
— |
4-bit |
✅ |
✅ |
| Gemma 4 E2B |
Dense |
— |
bf16 |
✅ |
✅ |
| Qwen3 30B-A3B |
MoE |
64 top-8 |
4-bit |
✅ |
✅ |
| Gemma 4 26B-A4B |
MoE |
128 top-8 |
4-bit |
✅ |
❌ |
| Gemma 4 26B-A4B |
MoE |
128 top-8 |
mixed 2/3/4-bit |
✅ |
❌ |
Hypotheses Tested and Rejected
Before concluding this is a hardware-specific kernel issue, we systematically ruled out:
- File corruption — MD5 hashes match between M3 and M4
- MLX version — Both 0.31.1
- Python version — Tested 3.13 and 3.14, both fail
- transformers version — 5.5.2 and 5.5.3, both fail
- Package conflicts — Clean venv with pip-only install, still fails
- LoRA adapter / KV cache — Base model without any options, still fails
- macOS version — Both 26.3.1
- Metal version — Both Metal 4
MLX basic operations (quantize, dequantize, quantized_matmul) work correctly on M4:
import mlx.core as mx
w = mx.random.normal((64, 64))
q, s, b = mx.quantize(w, bits=2, group_size=64)
r = mx.quantized_matmul(mx.random.normal((1, 64)), q, s, b, transpose=True, group_size=64, bits=2)
# Works fine on M4
Suspected Root Cause
Gemma 4 26B uses SwitchGLU with mx.gather_mm for 128-expert top-8 routing. The Metal kernel for this operation appears to malfunction specifically on the base M4's 10-core GPU. The 128-expert gather pattern may create threadgroup dispatch configurations that are incorrect at low GPU core counts.
Key evidence:
- Qwen3 MoE (64 experts) works on M4 → expert count matters
- Gemma 4 bf16 works on M4 → quantization + MoE combination triggers it
- FakeRocket543 tested on M4 Max (40 cores) successfully → core count matters
Environment
Machine A (Working)
Chip: Apple M3 Ultra
GPU: 80 cores, Metal 4
Memory: 512 GB
macOS: 26.3.1 (25D771280a)
MLX: 0.31.1
mlx_lm: 0.31.2
Machine B (Broken)
Chip: Apple M4
GPU: 10 cores, Metal 4
Memory: 24 GB
macOS: 26.3.1 (25D2128)
MLX: 0.31.1
mlx_lm: 0.31.2
Untested Chips
- M4 Pro (20 cores) — would help isolate the core-count threshold
- M3 base (10 cores) — would confirm if it's M4-specific or core-count-specific
- M1/M2 variants
Related Issues
Description
Quantized Gemma 4 26B-A4B models (MoE, 128 experts top-8) produce garbage output on the base M4 chip (10 GPU cores), while working correctly on M3 Ultra (80 GPU cores). The issue affects all quantized variants of this model — including community PLE-safe models — but does not affect:
This suggests a bug in the
gather_mmMetal kernel dispatch when handling 128-expert quantized MoE on GPUs with a low core count.Reproduction
Tested on M4 Mac Mini (10 GPU cores, 24 GB, macOS 26.3.1). Clean venv, pip-only install:
The same model files (MD5 verified) produce correct output on M3 Ultra (80 GPU cores):
Cross-Validation Matrix
Hypotheses Tested and Rejected
Before concluding this is a hardware-specific kernel issue, we systematically ruled out:
MLX basic operations (quantize, dequantize, quantized_matmul) work correctly on M4:
Suspected Root Cause
Gemma 4 26B uses
SwitchGLUwithmx.gather_mmfor 128-expert top-8 routing. The Metal kernel for this operation appears to malfunction specifically on the base M4's 10-core GPU. The 128-expert gather pattern may create threadgroup dispatch configurations that are incorrect at low GPU core counts.Key evidence:
Environment
Machine A (Working)
Machine B (Broken)
Untested Chips
Related Issues