Fix RoPE convention for Qwen3-MoE and other NEOX-family architectures by pekkah · Pull Request #7 · pekkah/SharpInference

pekkah · 2026-04-29T06:59:18Z

Summary

Fixes Issue #6 — Qwen3-Coder-30B-A3B (and any other Qwen/Phi/Gemma/Falcon model) generated <|endoftext|> or <|im_end|> as the first token on many short prompts. Originally suspected as a Q4_K_M quantization artifact, the actual cause was a RoPE convention mismatch.

SharpInference was applying LLaMA-style interleaved RoPE (pairs (2i, 2i+1)) to all architectures. Qwen2/Qwen3 (and Phi, Gemma, Falcon, etc.) require NEOX-style rotation (pairs offset by headDim/2). The wrong rotation produced subtly-incorrect attention output that compounded layer-by-layer until the residual landed in a degenerate region where the LM head predicted EOT with high confidence.

Changes

ModelHyperparams.IsNeoxRope — dispatched per architecture string from general.architecture GGUF metadata. NEOX list mirrors llama.cpp/src/llama-model.cpp:llama_model_rope_type() (full enumeration: 60+ archs).
SimdKernels.ApplyRoPECachedNeox — new SIMD (AVX/FMA) kernel for NEOX rotation.
Shaders.RoPENeox — new Vulkan compute shader for NEOX rotation.
IComputeBackend.RoPE — added optional bool neox = false parameter; CpuBackend, VulkanBackend, CudaBackend updated.
ForwardPass.ApplyRope() helper — dispatches between interleaved and NEOX based on _hp.IsNeoxRope. All RoPE call sites updated. Same dispatch added to GpuForwardPass and HybridForwardPass (both GPU and CPU layers).
Tests — split CpuBackendTests.RoPE_PreservesNorm into two tests (interleaved and NEOX). The existing test was silently validating NEOX semantics because CpuBackend.RoPE was NEOX-only while the engine used interleaved everywhere — that inconsistency is now resolved.
Diagnostic instrumentation — added SHARPI_TRACE_NORMS and SHARPI_TRACE_ROUTERS env-var-gated logging used to localize this bug. Kept for future debugging; zero overhead when disabled.
Design doc — replaced the incorrect "Q4_K_M quirk" attribution in Phase 11 with the actual root cause and fix description.

Why is the convention picked automatically?

GGUF has no dedicated rope-type metadata key. general.architecture is the only signal, and llama.cpp itself hardcodes the convention per architecture. We mirror their full mapping. Special rope variants (MROPE for Qwen2VL, IMROPE for Qwen3VL family, conditional GLM4) are explicitly noted as not yet supported and would need their own dispatch + kernels.

Test plan

Qwen3-Coder-30B-A3B previously-failing WGSL/PBR shader prompt — produces clean output
Qwen3-Coder-30B-A3B Python fibonacci (V18 from bisection) — produces clean output
Qwen3-Coder-30B-A3B + TurboQuant (--tq) — produces clean output
SmolLM2 (LLaMA arch, interleaved RoPE — control) — unchanged on CPU, GPU at 164 t/s decode
Full test suite: 206/208 pass; 2 failures are pre-existing partial-Llama-3.1-70B file issues unrelated to RoPE
New RoPE_Neox_PreservesNorm test added; existing RoPE_PreservesNorm reframed for the default interleaved convention
Not yet covered: GPU/Vulkan path on a NEOX-family model (Qwen3 8B etc.) — MoE GPU is gated by issue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2, and no Qwen3 8B dense file is checked in locally. The new RoPENeox Vulkan shader compiles and dispatches correctly but has not been end-to-end exercised against a reference.

🤖 Generated with Claude Code

Issue #6: Qwen3-Coder-30B-A3B (and other Qwen/Phi/Gemma/Falcon models) generated <|endoftext|> or <|im_end|> as the first token on many short prompts. Originally suspected as a Q4_K_M quantization artifact, the actual cause was a RoPE convention mismatch. SharpInference applied LLaMA-style interleaved RoPE (rotates dim pairs (2i, 2i+1)) to all architectures. Qwen, Phi, Gemma, Falcon, etc. require NEOX-style rotation (pairs offset by headDim/2). The mismatch produced subtly-wrong attention output that compounded layer-by-layer; cumulative direction error eventually pushed the residual into a degenerate region where the LM head predicted EOT with high confidence. Changes: - ModelHyperparams.IsNeoxRope: dispatched per architecture string, mirrors llama.cpp's llama_model_rope_type() (NEOX list copied verbatim). - SimdKernels.ApplyRoPECachedNeox: new SIMD kernel for NEOX rotation. - Shaders.RoPENeox: new Vulkan compute shader for NEOX rotation. - IComputeBackend.RoPE / CpuBackend / VulkanBackend / CudaBackend: added optional bool neox parameter (defaults to false = LLaMA). - ForwardPass.ApplyRope() helper: dispatches between interleaved and NEOX kernels based on _hp.IsNeoxRope. - GpuForwardPass / HybridForwardPass: pass _hp.IsNeoxRope through. - CpuBackendTests: split RoPE_PreservesNorm into two tests covering both conventions; existing test was actually validating NEOX semantics (CpuBackend.RoPE was silently NEOX-only despite the engine using interleaved everywhere — that inconsistency is now resolved). - Added env-var-gated diagnostic instrumentation (SHARPI_TRACE_NORMS and SHARPI_TRACE_ROUTERS) used to localize this bug, kept for future debugging (zero overhead when disabled). - Updated Phase 11 design doc: replaced the incorrect "Q4_K_M quirk" attribution with the actual root cause. Verification: - Qwen3-Coder-30B-A3B: previously-failing WGSL/PBR and Python prompts now produce clean output on CPU and CPU+TurboQuant. - SmolLM2 (LLaMA arch, interleaved): unchanged, GPU still 164 t/s. - 206/208 tests pass (2 pre-existing partial-Llama-3.1-70B model file failures, unrelated to RoPE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cross-engine top-1 diff vs llama.cpp b8585 on Qwen3-8B Q4_K_M with matching chat template (--jinja, "You are a helpful assistant.") and --temp 0: 24-token prefill identical, 60-token greedy decode byte-identical through n-predict. Confirms PR #7's NEOX RoPE fix is correct on both CPU and Vulkan paths. - VulkanShaderTests: add RoPENeoxMatchesCpu unit test and a dense-hybrid smoke test that asserts coherent decode (no all-EOS / NaN logits) - HybridForwardPass: keep workaround comment for #19/#3 documenting why embed + Phase-5 norm/output stay on CPU - scripts/xcheck-llamacpp.ps1: reusable cross-engine capture script - Design doc Phase 2b re-validation note updated with the actual diff result (replacing the prior "deferred" placeholder)

pekkah merged commit de80df3 into master Apr 29, 2026
1 check passed

pekkah deleted the fix/qwen-neox-rope branch April 29, 2026 07:06

pekkah mentioned this pull request Apr 29, 2026

Forward-pass divergence on Qwen3-Coder 30B-A3B for some prompts (top-1 logit goes to <|endoftext|>) #6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RoPE convention for Qwen3-MoE and other NEOX-family architectures#7

Fix RoPE convention for Qwen3-MoE and other NEOX-family architectures#7
pekkah merged 1 commit into
masterfrom
fix/qwen-neox-rope

pekkah commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pekkah commented Apr 29, 2026

Summary

Changes

Why is the convention picked automatically?

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant