Skip to content

Fix RoPE convention for Qwen3-MoE and other NEOX-family architectures#7

Merged
pekkah merged 1 commit into
masterfrom
fix/qwen-neox-rope
Apr 29, 2026
Merged

Fix RoPE convention for Qwen3-MoE and other NEOX-family architectures#7
pekkah merged 1 commit into
masterfrom
fix/qwen-neox-rope

Conversation

@pekkah
Copy link
Copy Markdown
Owner

@pekkah pekkah commented Apr 29, 2026

Summary

Fixes Issue #6 — Qwen3-Coder-30B-A3B (and any other Qwen/Phi/Gemma/Falcon model) generated <|endoftext|> or <|im_end|> as the first token on many short prompts. Originally suspected as a Q4_K_M quantization artifact, the actual cause was a RoPE convention mismatch.

SharpInference was applying LLaMA-style interleaved RoPE (pairs (2i, 2i+1)) to all architectures. Qwen2/Qwen3 (and Phi, Gemma, Falcon, etc.) require NEOX-style rotation (pairs offset by headDim/2). The wrong rotation produced subtly-incorrect attention output that compounded layer-by-layer until the residual landed in a degenerate region where the LM head predicted EOT with high confidence.

Changes

  • ModelHyperparams.IsNeoxRope — dispatched per architecture string from general.architecture GGUF metadata. NEOX list mirrors llama.cpp/src/llama-model.cpp:llama_model_rope_type() (full enumeration: 60+ archs).
  • SimdKernels.ApplyRoPECachedNeox — new SIMD (AVX/FMA) kernel for NEOX rotation.
  • Shaders.RoPENeox — new Vulkan compute shader for NEOX rotation.
  • IComputeBackend.RoPE — added optional bool neox = false parameter; CpuBackend, VulkanBackend, CudaBackend updated.
  • ForwardPass.ApplyRope() helper — dispatches between interleaved and NEOX based on _hp.IsNeoxRope. All RoPE call sites updated. Same dispatch added to GpuForwardPass and HybridForwardPass (both GPU and CPU layers).
  • Tests — split CpuBackendTests.RoPE_PreservesNorm into two tests (interleaved and NEOX). The existing test was silently validating NEOX semantics because CpuBackend.RoPE was NEOX-only while the engine used interleaved everywhere — that inconsistency is now resolved.
  • Diagnostic instrumentation — added SHARPI_TRACE_NORMS and SHARPI_TRACE_ROUTERS env-var-gated logging used to localize this bug. Kept for future debugging; zero overhead when disabled.
  • Design doc — replaced the incorrect "Q4_K_M quirk" attribution in Phase 11 with the actual root cause and fix description.

Why is the convention picked automatically?

GGUF has no dedicated rope-type metadata key. general.architecture is the only signal, and llama.cpp itself hardcodes the convention per architecture. We mirror their full mapping. Special rope variants (MROPE for Qwen2VL, IMROPE for Qwen3VL family, conditional GLM4) are explicitly noted as not yet supported and would need their own dispatch + kernels.

Test plan

  • Qwen3-Coder-30B-A3B previously-failing WGSL/PBR shader prompt — produces clean output
  • Qwen3-Coder-30B-A3B Python fibonacci (V18 from bisection) — produces clean output
  • Qwen3-Coder-30B-A3B + TurboQuant (--tq) — produces clean output
  • SmolLM2 (LLaMA arch, interleaved RoPE — control) — unchanged on CPU, GPU at 164 t/s decode
  • Full test suite: 206/208 pass; 2 failures are pre-existing partial-Llama-3.1-70B file issues unrelated to RoPE
  • New RoPE_Neox_PreservesNorm test added; existing RoPE_PreservesNorm reframed for the default interleaved convention
  • Not yet covered: GPU/Vulkan path on a NEOX-family model (Qwen3 8B etc.) — MoE GPU is gated by issue Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2, and no Qwen3 8B dense file is checked in locally. The new RoPENeox Vulkan shader compiles and dispatches correctly but has not been end-to-end exercised against a reference.

🤖 Generated with Claude Code

Issue #6: Qwen3-Coder-30B-A3B (and other Qwen/Phi/Gemma/Falcon models)
generated <|endoftext|> or <|im_end|> as the first token on many short
prompts. Originally suspected as a Q4_K_M quantization artifact, the
actual cause was a RoPE convention mismatch.

SharpInference applied LLaMA-style interleaved RoPE (rotates dim pairs
(2i, 2i+1)) to all architectures. Qwen, Phi, Gemma, Falcon, etc. require
NEOX-style rotation (pairs offset by headDim/2). The mismatch produced
subtly-wrong attention output that compounded layer-by-layer; cumulative
direction error eventually pushed the residual into a degenerate region
where the LM head predicted EOT with high confidence.

Changes:
- ModelHyperparams.IsNeoxRope: dispatched per architecture string,
  mirrors llama.cpp's llama_model_rope_type() (NEOX list copied verbatim).
- SimdKernels.ApplyRoPECachedNeox: new SIMD kernel for NEOX rotation.
- Shaders.RoPENeox: new Vulkan compute shader for NEOX rotation.
- IComputeBackend.RoPE / CpuBackend / VulkanBackend / CudaBackend:
  added optional bool neox parameter (defaults to false = LLaMA).
- ForwardPass.ApplyRope() helper: dispatches between interleaved and
  NEOX kernels based on _hp.IsNeoxRope.
- GpuForwardPass / HybridForwardPass: pass _hp.IsNeoxRope through.
- CpuBackendTests: split RoPE_PreservesNorm into two tests covering
  both conventions; existing test was actually validating NEOX
  semantics (CpuBackend.RoPE was silently NEOX-only despite the
  engine using interleaved everywhere — that inconsistency is now
  resolved).
- Added env-var-gated diagnostic instrumentation (SHARPI_TRACE_NORMS
  and SHARPI_TRACE_ROUTERS) used to localize this bug, kept for
  future debugging (zero overhead when disabled).
- Updated Phase 11 design doc: replaced the incorrect "Q4_K_M quirk"
  attribution with the actual root cause.

Verification:
- Qwen3-Coder-30B-A3B: previously-failing WGSL/PBR and Python prompts
  now produce clean output on CPU and CPU+TurboQuant.
- SmolLM2 (LLaMA arch, interleaved): unchanged, GPU still 164 t/s.
- 206/208 tests pass (2 pre-existing partial-Llama-3.1-70B model
  file failures, unrelated to RoPE).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pekkah pekkah merged commit de80df3 into master Apr 29, 2026
1 check passed
@pekkah pekkah deleted the fix/qwen-neox-rope branch April 29, 2026 07:06
pekkah added a commit that referenced this pull request May 18, 2026
Cross-engine top-1 diff vs llama.cpp b8585 on Qwen3-8B Q4_K_M with matching
chat template (--jinja, "You are a helpful assistant.") and --temp 0: 24-token
prefill identical, 60-token greedy decode byte-identical through n-predict.
Confirms PR #7's NEOX RoPE fix is correct on both CPU and Vulkan paths.

- VulkanShaderTests: add RoPENeoxMatchesCpu unit test and a dense-hybrid
  smoke test that asserts coherent decode (no all-EOS / NaN logits)
- HybridForwardPass: keep workaround comment for #19/#3 documenting why
  embed + Phase-5 norm/output stay on CPU
- scripts/xcheck-llamacpp.ps1: reusable cross-engine capture script
- Design doc Phase 2b re-validation note updated with the actual diff
  result (replacing the prior "deferred" placeholder)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant