Summary
HybridForwardPass produces NaN/degenerate output when running MoE models (qwen3moe, e.g. Qwen3-Coder 30B-A3B-Instruct) with any number of GPU layers (-g N for N >= 1). Non-MoE models on the same hybrid path work correctly.
Repro
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -g 1 --tq -p "Hello" --temp 0
Result: 0 tokens decoded (logits are NaN, sampler picks token 0 = <|endoftext|>).
With more GPU layers (-g -1): degenerate output like "111111 = 1 numbers = ...".
CPU-only path works correctly for the same model:
dotnet run --project src/SharpInference.Cli -c Release -- \
-m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Hello" --temp 0
Already fixed in adjacent areas
The following hybrid-path bugs were fixed and verified with SmolLM2 (non-MoE):
- Missing compute-stage barrier at the end of
HybridForwardPass.GpuLayer between AddInPlace(_gpuHidden, _gpuResidual) and the next layer's residual copy
vkCmdCopyBuffer (transfer-stage) used for inter-layer copies while every barrier is compute->compute - replaced with RecordComputeCopy
- GPU embedding-lookup shader writes garbage when invoked from
HybridForwardPass (root cause unclear - identical shader works in GpuForwardPass); worked around by forcing CPU embedding+output via ShouldKeepFixedWeightsOnCpu always returning true
After those fixes, SmolLM2 hybrid produces coherent output at all -g N values. Qwen3-Coder MoE still fails - the bug is MoE-specific.
Likely culprits (not yet investigated)
HybridForwardPass.GpuMoeFfn (lines ~1209-1325) does several things differently from non-MoE:
- Splits the command buffer mid-layer to download router logits (
EndRecordAndSubmit -> Download -> BeginRecord)
- Uses
ExpertSlotManager.TryGetCached and falls back to CPU computation for cache-missed experts, with the result uploaded via pinned host-coherent memory (_gpuFallbackContrib)
- Per-expert dispatches reuse
_gpuFfnGate/_gpuFfnUp scratch buffers across iterations of the active-experts loop
_gpuPinnedNorm is populated by an in-record CopyGpuBuffer then read after submit via MapPinned
Worth checking: barrier coverage between the Clear(_gpuHidden) (line ~1273) and the per-expert AddScaledInPlace accumulations, and whether the CPU-fallback contribution is correctly synchronized with the GPU expert outputs.
Workaround
CPU-only with TurboQuant gives ~14 t/s decode on this hardware, which is usable.
Disabling broken paths
Until this is fixed, the CLI should refuse to run MoE models on the hybrid GPU path with a clear error message, rather than silently producing NaN or garbled output.
Summary
HybridForwardPassproduces NaN/degenerate output when running MoE models (qwen3moe, e.g. Qwen3-Coder 30B-A3B-Instruct) with any number of GPU layers (-g Nfor N >= 1). Non-MoE models on the same hybrid path work correctly.Repro
dotnet run --project src/SharpInference.Cli -c Release -- \ -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -g 1 --tq -p "Hello" --temp 0Result: 0 tokens decoded (logits are NaN, sampler picks token 0 =
<|endoftext|>).With more GPU layers (
-g -1): degenerate output like"111111 = 1 numbers = ...".CPU-only path works correctly for the same model:
dotnet run --project src/SharpInference.Cli -c Release -- \ -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Hello" --temp 0Already fixed in adjacent areas
The following hybrid-path bugs were fixed and verified with SmolLM2 (non-MoE):
HybridForwardPass.GpuLayerbetweenAddInPlace(_gpuHidden, _gpuResidual)and the next layer's residual copyvkCmdCopyBuffer(transfer-stage) used for inter-layer copies while every barrier is compute->compute - replaced withRecordComputeCopyHybridForwardPass(root cause unclear - identical shader works inGpuForwardPass); worked around by forcing CPU embedding+output viaShouldKeepFixedWeightsOnCpualways returning trueAfter those fixes, SmolLM2 hybrid produces coherent output at all
-g Nvalues. Qwen3-Coder MoE still fails - the bug is MoE-specific.Likely culprits (not yet investigated)
HybridForwardPass.GpuMoeFfn(lines ~1209-1325) does several things differently from non-MoE:EndRecordAndSubmit->Download->BeginRecord)ExpertSlotManager.TryGetCachedand falls back to CPU computation for cache-missed experts, with the result uploaded via pinned host-coherent memory (_gpuFallbackContrib)_gpuFfnGate/_gpuFfnUpscratch buffers across iterations of the active-experts loop_gpuPinnedNormis populated by an in-recordCopyGpuBufferthen read after submit viaMapPinnedWorth checking: barrier coverage between the
Clear(_gpuHidden)(line ~1273) and the per-expertAddScaledInPlaceaccumulations, and whether the CPU-fallback contribution is correctly synchronized with the GPU expert outputs.Workaround
CPU-only with TurboQuant gives ~14 t/s decode on this hardware, which is usable.
Disabling broken paths
Until this is fixed, the CLI should refuse to run MoE models on the hybrid GPU path with a clear error message, rather than silently producing NaN or garbled output.