[Bug] Quantized Gemma 4 26B MoE (128 experts) produces garbage output on base M4 chip (10 GPU cores)

<html><head></head><body>

<h2>Description</h2>
<p>Quantized Gemma 4 26B-A4B models (MoE, 128 experts top-8) produce garbage output on the base M4 chip (10 GPU cores), while working correctly on M3 Ultra (80 GPU cores). The issue affects <strong>all</strong> quantized variants of this model — including community PLE-safe models — but does <strong>not</strong> affect:</p>
<ul>
<li>Dense quantized models (Qwen2.5 0.5B 4-bit)</li>
<li>Gemma 4 dense bf16 (E2B)</li>
<li>Other MoE models with fewer experts (Qwen3 30B-A3B, 64 experts, 4-bit)</li>
</ul>
<p>This suggests a bug in the <code>gather_mm</code> Metal kernel dispatch when handling 128-expert quantized MoE on GPUs with a low core count.</p>
<h2>Reproduction</h2>
<p>Tested on M4 Mac Mini (10 GPU cores, 24 GB, macOS 26.3.1). Clean venv, pip-only install:</p>
<pre><code class="language-bash">python3.13 -m venv ~/gemma4-test
source ~/gemma4-test/bin/activate
pip install mlx==0.31.1 mlx-lm==0.31.2 mlx-vlm==0.4.4

# 1. WORKS — Dense quantized
python3 -m mlx_lm generate \
    --model mlx-community/Qwen2.5-0.5B-Instruct-4bit \
    --prompt 'Hello, what is 2+2?' --max-tokens 20
# Output: "2+2 equals 4." ✅

# 2. WORKS — MoE 64 experts, quantized
python3 -m mlx_lm generate \
    --model mlx-community/Qwen3-30B-A3B-4bit \
    --prompt 'Hello' --max-tokens 20
# Output: "&lt;think&gt;Okay, the user is asking..." ✅ (51 tok/s)

# 3. WORKS — Gemma 4 dense bf16
python3 -m mlx_vlm generate \
    --model mlx-community/gemma-4-e2b-it-bf16 \
    --prompt 'What is the capital of Korea?' --max-tokens 20
# Output: "The capital of Korea is **Seoul**." ✅

# 4. FAILS — Gemma 4 MoE 128 experts, quantized
python3 -c "
from mlx_lm import load, generate
model, tok = load('FakeRockert543/gemma-4-26b-a4b-it-MLX-4bit')
r = generate(model, tok, prompt='What is the capital of Korea?', max_tokens=20, verbose=True)
"
# Output: "는가가?는가?는가?는가?는가?는가?" ❌ GARBAGE
# 28.6 tok/s, peak 15.3 GB
</code></pre>
<p>The same model files (MD5 verified) produce correct output on M3 Ultra (80 GPU cores):</p>
<pre><code># Same command on M3 Ultra:
# Output: "The capital of Korea is **Seoul**." ✅
# 107 tok/s
</code></pre>
<h2>Cross-Validation Matrix</h2>

Model | Architecture | Experts | Quantization | M3 Ultra (80 cores) | M4 base (10 cores)
-- | -- | -- | -- | -- | --
Qwen2.5 0.5B | Dense | — | 4-bit | ✅ | ✅
Gemma 4 E2B | Dense | — | bf16 | ✅ | ✅
Qwen3 30B-A3B | MoE | 64 top-8 | 4-bit | ✅ | ✅
Gemma 4 26B-A4B | MoE | 128 top-8 | 4-bit | ✅ | ❌
Gemma 4 26B-A4B | MoE | 128 top-8 | mixed 2/3/4-bit | ✅ | ❌


<h2>Hypotheses Tested and Rejected</h2>
<p>Before concluding this is a hardware-specific kernel issue, we systematically ruled out:</p>
<ol>
<li><strong>File corruption</strong> — MD5 hashes match between M3 and M4</li>
<li><strong>MLX version</strong> — Both 0.31.1</li>
<li><strong>Python version</strong> — Tested 3.13 and 3.14, both fail</li>
<li><strong>transformers version</strong> — 5.5.2 and 5.5.3, both fail</li>
<li><strong>Package conflicts</strong> — Clean venv with pip-only install, still fails</li>
<li><strong>LoRA adapter / KV cache</strong> — Base model without any options, still fails</li>
<li><strong>macOS version</strong> — Both 26.3.1</li>
<li><strong>Metal version</strong> — Both Metal 4</li>
</ol>
<p>MLX basic operations (quantize, dequantize, quantized_matmul) work correctly on M4:</p>
<pre><code class="language-python">import mlx.core as mx
w = mx.random.normal((64, 64))
q, s, b = mx.quantize(w, bits=2, group_size=64)
r = mx.quantized_matmul(mx.random.normal((1, 64)), q, s, b, transpose=True, group_size=64, bits=2)
# Works fine on M4
</code></pre>
<h2>Suspected Root Cause</h2>
<p>Gemma 4 26B uses <code>SwitchGLU</code> with <code>mx.gather_mm</code> for 128-expert top-8 routing. The Metal kernel for this operation appears to malfunction specifically on the base M4's 10-core GPU. The 128-expert gather pattern may create threadgroup dispatch configurations that are incorrect at low GPU core counts.</p>
<p>Key evidence:</p>
<ul>
<li>Qwen3 MoE (64 experts) works on M4 → expert count matters</li>
<li>Gemma 4 bf16 works on M4 → quantization + MoE combination triggers it</li>
<li>FakeRocket543 tested on M4 <strong>Max</strong> (40 cores) successfully → core count matters</li>
</ul>
<h2>Environment</h2>
<p><strong>Machine A (Working)</strong></p>
<pre><code>Chip:    Apple M3 Ultra
GPU:     80 cores, Metal 4
Memory:  512 GB
macOS:   26.3.1 (25D771280a)
MLX:     0.31.1
mlx_lm:  0.31.2
</code></pre>
<p><strong>Machine B (Broken)</strong></p>
<pre><code>Chip:    Apple M4
GPU:     10 cores, Metal 4
Memory:  24 GB
macOS:   26.3.1 (25D2128)
MLX:     0.31.1
mlx_lm:  0.31.2
</code></pre>
<h2>Untested Chips</h2>
<ul>
<li>M4 Pro (20 cores) — would help isolate the core-count threshold</li>
<li>M3 base (10 cores) — would confirm if it's M4-specific or core-count-specific</li>
<li>M1/M2 variants</li>
</ul>
<h2>Related Issues</h2>
<ul>
<li>Blaizzy/mlx-vlm#895 — Gemma 4 MLX 4-bit quantization reported as unusable</li>
<li>Blaizzy/mlx-vlm#778 — MoE quantization garbage output (Qwen3.5, different root cause)</li>
<li>FakeRocket543/mlx-gemma4 — PLE-safe quantization (tested on M4 Max only)</li>
</ul></body></html>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Quantized Gemma 4 26B MoE (128 experts) produces garbage output on base M4 chip (10 GPU cores) #3393

Description

Reproduction

1. WORKS — Dense quantized

Output: "2+2 equals 4." ✅

2. WORKS — MoE 64 experts, quantized

Output: "<think>Okay, the user is asking..." ✅ (51 tok/s)

3. WORKS — Gemma 4 dense bf16

Output: "The capital of Korea is Seoul." ✅

4. FAILS — Gemma 4 MoE 128 experts, quantized

Output: "는가가?는가?는가?는가?는가?는가?" ❌ GARBAGE

28.6 tok/s, peak 15.3 GB

Cross-Validation Matrix

Hypotheses Tested and Rejected

Suspected Root Cause

Environment

Untested Chips

Related Issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Architecture	Experts	Quantization	M3 Ultra (80 cores)	M4 base (10 cores)
Qwen2.5 0.5B	Dense	—	4-bit	✅	✅
Gemma 4 E2B	Dense	—	bf16	✅	✅
Qwen3 30B-A3B	MoE	64 top-8	4-bit	✅	✅
Gemma 4 26B-A4B	MoE	128 top-8	4-bit	✅	❌
Gemma 4 26B-A4B	MoE	128 top-8	mixed 2/3/4-bit	✅	❌

[Bug] Quantized Gemma 4 26B MoE (128 experts) produces garbage output on base M4 chip (10 GPU cores) #3393

Description

Description

Reproduction

1. WORKS — Dense quantized

Output: "2+2 equals 4." ✅

2. WORKS — MoE 64 experts, quantized

Output: "<think>Okay, the user is asking..." ✅ (51 tok/s)

3. WORKS — Gemma 4 dense bf16

Output: "The capital of Korea is Seoul." ✅

4. FAILS — Gemma 4 MoE 128 experts, quantized

Output: "는가가?는가?는가?는가?는가?는가?" ❌ GARBAGE

28.6 tok/s, peak 15.3 GB

Cross-Validation Matrix

Hypotheses Tested and Rejected

Suspected Root Cause

Environment

Untested Chips

Related Issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions