Skip to content

[Bug] Quantized Gemma 4 26B MoE (128 experts) produces garbage output on base M4 chip (10 GPU cores) #3393

@avlp12

Description

@avlp12

Description

Quantized Gemma 4 26B-A4B models (MoE, 128 experts top-8) produce garbage output on the base M4 chip (10 GPU cores), while working correctly on M3 Ultra (80 GPU cores). The issue affects all quantized variants of this model — including community PLE-safe models — but does not affect:

  • Dense quantized models (Qwen2.5 0.5B 4-bit)
  • Gemma 4 dense bf16 (E2B)
  • Other MoE models with fewer experts (Qwen3 30B-A3B, 64 experts, 4-bit)

This suggests a bug in the gather_mm Metal kernel dispatch when handling 128-expert quantized MoE on GPUs with a low core count.

Reproduction

Tested on M4 Mac Mini (10 GPU cores, 24 GB, macOS 26.3.1). Clean venv, pip-only install:

python3.13 -m venv ~/gemma4-test
source ~/gemma4-test/bin/activate
pip install mlx==0.31.1 mlx-lm==0.31.2 mlx-vlm==0.4.4

1. WORKS — Dense quantized

python3 -m mlx_lm generate
--model mlx-community/Qwen2.5-0.5B-Instruct-4bit
--prompt 'Hello, what is 2+2?' --max-tokens 20

Output: "2+2 equals 4." ✅

2. WORKS — MoE 64 experts, quantized

python3 -m mlx_lm generate
--model mlx-community/Qwen3-30B-A3B-4bit
--prompt 'Hello' --max-tokens 20

Output: "<think>Okay, the user is asking..." ✅ (51 tok/s)

3. WORKS — Gemma 4 dense bf16

python3 -m mlx_vlm generate
--model mlx-community/gemma-4-e2b-it-bf16
--prompt 'What is the capital of Korea?' --max-tokens 20

Output: "The capital of Korea is Seoul." ✅

4. FAILS — Gemma 4 MoE 128 experts, quantized

python3 -c "
from mlx_lm import load, generate
model, tok = load('FakeRockert543/gemma-4-26b-a4b-it-MLX-4bit')
r = generate(model, tok, prompt='What is the capital of Korea?', max_tokens=20, verbose=True)
"

Output: "는가가?는가?는가?는가?는가?는가?" ❌ GARBAGE

28.6 tok/s, peak 15.3 GB

The same model files (MD5 verified) produce correct output on M3 Ultra (80 GPU cores):

# Same command on M3 Ultra:
# Output: "The capital of Korea is **Seoul**." ✅
# 107 tok/s

Cross-Validation Matrix

Model Architecture Experts Quantization M3 Ultra (80 cores) M4 base (10 cores)
Qwen2.5 0.5B Dense 4-bit
Gemma 4 E2B Dense bf16
Qwen3 30B-A3B MoE 64 top-8 4-bit
Gemma 4 26B-A4B MoE 128 top-8 4-bit
Gemma 4 26B-A4B MoE 128 top-8 mixed 2/3/4-bit

Hypotheses Tested and Rejected

Before concluding this is a hardware-specific kernel issue, we systematically ruled out:

  1. File corruption — MD5 hashes match between M3 and M4
  2. MLX version — Both 0.31.1
  3. Python version — Tested 3.13 and 3.14, both fail
  4. transformers version — 5.5.2 and 5.5.3, both fail
  5. Package conflicts — Clean venv with pip-only install, still fails
  6. LoRA adapter / KV cache — Base model without any options, still fails
  7. macOS version — Both 26.3.1
  8. Metal version — Both Metal 4

MLX basic operations (quantize, dequantize, quantized_matmul) work correctly on M4:

import mlx.core as mx
w = mx.random.normal((64, 64))
q, s, b = mx.quantize(w, bits=2, group_size=64)
r = mx.quantized_matmul(mx.random.normal((1, 64)), q, s, b, transpose=True, group_size=64, bits=2)
# Works fine on M4

Suspected Root Cause

Gemma 4 26B uses SwitchGLU with mx.gather_mm for 128-expert top-8 routing. The Metal kernel for this operation appears to malfunction specifically on the base M4's 10-core GPU. The 128-expert gather pattern may create threadgroup dispatch configurations that are incorrect at low GPU core counts.

Key evidence:

  • Qwen3 MoE (64 experts) works on M4 → expert count matters
  • Gemma 4 bf16 works on M4 → quantization + MoE combination triggers it
  • FakeRocket543 tested on M4 Max (40 cores) successfully → core count matters

Environment

Machine A (Working)

Chip:    Apple M3 Ultra
GPU:     80 cores, Metal 4
Memory:  512 GB
macOS:   26.3.1 (25D771280a)
MLX:     0.31.1
mlx_lm:  0.31.2

Machine B (Broken)

Chip:    Apple M4
GPU:     10 cores, Metal 4
Memory:  24 GB
macOS:   26.3.1 (25D2128)
MLX:     0.31.1
mlx_lm:  0.31.2

Untested Chips

  • M4 Pro (20 cores) — would help isolate the core-count threshold
  • M3 base (10 cores) — would confirm if it's M4-specific or core-count-specific
  • M1/M2 variants

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions