Fix RoPE convention for Qwen3-MoE and other NEOX-family architectures#7
Merged
Conversation
Issue #6: Qwen3-Coder-30B-A3B (and other Qwen/Phi/Gemma/Falcon models) generated <|endoftext|> or <|im_end|> as the first token on many short prompts. Originally suspected as a Q4_K_M quantization artifact, the actual cause was a RoPE convention mismatch. SharpInference applied LLaMA-style interleaved RoPE (rotates dim pairs (2i, 2i+1)) to all architectures. Qwen, Phi, Gemma, Falcon, etc. require NEOX-style rotation (pairs offset by headDim/2). The mismatch produced subtly-wrong attention output that compounded layer-by-layer; cumulative direction error eventually pushed the residual into a degenerate region where the LM head predicted EOT with high confidence. Changes: - ModelHyperparams.IsNeoxRope: dispatched per architecture string, mirrors llama.cpp's llama_model_rope_type() (NEOX list copied verbatim). - SimdKernels.ApplyRoPECachedNeox: new SIMD kernel for NEOX rotation. - Shaders.RoPENeox: new Vulkan compute shader for NEOX rotation. - IComputeBackend.RoPE / CpuBackend / VulkanBackend / CudaBackend: added optional bool neox parameter (defaults to false = LLaMA). - ForwardPass.ApplyRope() helper: dispatches between interleaved and NEOX kernels based on _hp.IsNeoxRope. - GpuForwardPass / HybridForwardPass: pass _hp.IsNeoxRope through. - CpuBackendTests: split RoPE_PreservesNorm into two tests covering both conventions; existing test was actually validating NEOX semantics (CpuBackend.RoPE was silently NEOX-only despite the engine using interleaved everywhere — that inconsistency is now resolved). - Added env-var-gated diagnostic instrumentation (SHARPI_TRACE_NORMS and SHARPI_TRACE_ROUTERS) used to localize this bug, kept for future debugging (zero overhead when disabled). - Updated Phase 11 design doc: replaced the incorrect "Q4_K_M quirk" attribution with the actual root cause. Verification: - Qwen3-Coder-30B-A3B: previously-failing WGSL/PBR and Python prompts now produce clean output on CPU and CPU+TurboQuant. - SmolLM2 (LLaMA arch, interleaved): unchanged, GPU still 164 t/s. - 206/208 tests pass (2 pre-existing partial-Llama-3.1-70B model file failures, unrelated to RoPE). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced Apr 29, 2026
pekkah
added a commit
that referenced
this pull request
May 18, 2026
Cross-engine top-1 diff vs llama.cpp b8585 on Qwen3-8B Q4_K_M with matching chat template (--jinja, "You are a helpful assistant.") and --temp 0: 24-token prefill identical, 60-token greedy decode byte-identical through n-predict. Confirms PR #7's NEOX RoPE fix is correct on both CPU and Vulkan paths. - VulkanShaderTests: add RoPENeoxMatchesCpu unit test and a dense-hybrid smoke test that asserts coherent decode (no all-EOS / NaN logits) - HybridForwardPass: keep workaround comment for #19/#3 documenting why embed + Phase-5 norm/output stay on CPU - scripts/xcheck-llamacpp.ps1: reusable cross-engine capture script - Design doc Phase 2b re-validation note updated with the actual diff result (replacing the prior "deferred" placeholder)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes Issue #6 — Qwen3-Coder-30B-A3B (and any other Qwen/Phi/Gemma/Falcon model) generated
<|endoftext|>or<|im_end|>as the first token on many short prompts. Originally suspected as a Q4_K_M quantization artifact, the actual cause was a RoPE convention mismatch.SharpInference was applying LLaMA-style interleaved RoPE (pairs
(2i, 2i+1)) to all architectures. Qwen2/Qwen3 (and Phi, Gemma, Falcon, etc.) require NEOX-style rotation (pairs offset byheadDim/2). The wrong rotation produced subtly-incorrect attention output that compounded layer-by-layer until the residual landed in a degenerate region where the LM head predicted EOT with high confidence.Changes
ModelHyperparams.IsNeoxRope— dispatched per architecture string fromgeneral.architectureGGUF metadata. NEOX list mirrorsllama.cpp/src/llama-model.cpp:llama_model_rope_type()(full enumeration: 60+ archs).SimdKernels.ApplyRoPECachedNeox— new SIMD (AVX/FMA) kernel for NEOX rotation.Shaders.RoPENeox— new Vulkan compute shader for NEOX rotation.IComputeBackend.RoPE— added optionalbool neox = falseparameter;CpuBackend,VulkanBackend,CudaBackendupdated.ForwardPass.ApplyRope()helper — dispatches between interleaved and NEOX based on_hp.IsNeoxRope. All RoPE call sites updated. Same dispatch added toGpuForwardPassandHybridForwardPass(both GPU and CPU layers).CpuBackendTests.RoPE_PreservesNorminto two tests (interleaved and NEOX). The existing test was silently validating NEOX semantics becauseCpuBackend.RoPEwas NEOX-only while the engine used interleaved everywhere — that inconsistency is now resolved.SHARPI_TRACE_NORMSandSHARPI_TRACE_ROUTERSenv-var-gated logging used to localize this bug. Kept for future debugging; zero overhead when disabled.Why is the convention picked automatically?
GGUF has no dedicated rope-type metadata key.
general.architectureis the only signal, andllama.cppitself hardcodes the convention per architecture. We mirror their full mapping. Special rope variants (MROPE for Qwen2VL, IMROPE for Qwen3VL family, conditional GLM4) are explicitly noted as not yet supported and would need their own dispatch + kernels.Test plan
--tq) — produces clean outputRoPE_Neox_PreservesNormtest added; existingRoPE_PreservesNormreframed for the default interleaved conventionRoPENeoxVulkan shader compiles and dispatches correctly but has not been end-to-end exercised against a reference.🤖 Generated with Claude Code