Skip to content

Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2

@pekkah

Description

@pekkah

Summary

HybridForwardPass produces NaN/degenerate output when running MoE models (qwen3moe, e.g. Qwen3-Coder 30B-A3B-Instruct) with any number of GPU layers (-g N for N >= 1). Non-MoE models on the same hybrid path work correctly.

Repro

dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -g 1 --tq -p "Hello" --temp 0

Result: 0 tokens decoded (logits are NaN, sampler picks token 0 = <|endoftext|>).
With more GPU layers (-g -1): degenerate output like "111111 = 1 numbers = ...".

CPU-only path works correctly for the same model:

dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Hello" --temp 0

Already fixed in adjacent areas

The following hybrid-path bugs were fixed and verified with SmolLM2 (non-MoE):

  • Missing compute-stage barrier at the end of HybridForwardPass.GpuLayer between AddInPlace(_gpuHidden, _gpuResidual) and the next layer's residual copy
  • vkCmdCopyBuffer (transfer-stage) used for inter-layer copies while every barrier is compute->compute - replaced with RecordComputeCopy
  • GPU embedding-lookup shader writes garbage when invoked from HybridForwardPass (root cause unclear - identical shader works in GpuForwardPass); worked around by forcing CPU embedding+output via ShouldKeepFixedWeightsOnCpu always returning true

After those fixes, SmolLM2 hybrid produces coherent output at all -g N values. Qwen3-Coder MoE still fails - the bug is MoE-specific.

Likely culprits (not yet investigated)

HybridForwardPass.GpuMoeFfn (lines ~1209-1325) does several things differently from non-MoE:

  • Splits the command buffer mid-layer to download router logits (EndRecordAndSubmit -> Download -> BeginRecord)
  • Uses ExpertSlotManager.TryGetCached and falls back to CPU computation for cache-missed experts, with the result uploaded via pinned host-coherent memory (_gpuFallbackContrib)
  • Per-expert dispatches reuse _gpuFfnGate/_gpuFfnUp scratch buffers across iterations of the active-experts loop
  • _gpuPinnedNorm is populated by an in-record CopyGpuBuffer then read after submit via MapPinned

Worth checking: barrier coverage between the Clear(_gpuHidden) (line ~1273) and the per-expert AddScaledInPlace accumulations, and whether the CPU-fallback contribution is correctly synchronized with the GPU expert outputs.

Workaround

CPU-only with TurboQuant gives ~14 t/s decode on this hardware, which is usable.

Disabling broken paths

Until this is fixed, the CLI should refuse to run MoE models on the hybrid GPU path with a clear error message, rather than silently producing NaN or garbled output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions