Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn)

## Summary

`HybridForwardPass` produces NaN/degenerate output when running MoE models (qwen3moe, e.g. Qwen3-Coder 30B-A3B-Instruct) with any number of GPU layers (`-g N` for N >= 1). Non-MoE models on the same hybrid path work correctly.

## Repro

```bash
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf -g 1 --tq -p "Hello" --temp 0
```

Result: 0 tokens decoded (logits are NaN, sampler picks token 0 = `<|endoftext|>`).
With more GPU layers (`-g -1`): degenerate output like `"111111 = 1 numbers = ..."`.

CPU-only path works correctly for the same model:
```bash
dotnet run --project src/SharpInference.Cli -c Release -- \
  -m models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf --tq -p "Hello" --temp 0
```

## Already fixed in adjacent areas

The following hybrid-path bugs were fixed and verified with SmolLM2 (non-MoE):

- Missing compute-stage barrier at the end of `HybridForwardPass.GpuLayer` between `AddInPlace(_gpuHidden, _gpuResidual)` and the next layer's residual copy
- `vkCmdCopyBuffer` (transfer-stage) used for inter-layer copies while every barrier is compute->compute - replaced with `RecordComputeCopy`
- GPU embedding-lookup shader writes garbage when invoked from `HybridForwardPass` (root cause unclear - identical shader works in `GpuForwardPass`); worked around by forcing CPU embedding+output via `ShouldKeepFixedWeightsOnCpu` always returning true

After those fixes, SmolLM2 hybrid produces coherent output at all `-g N` values. Qwen3-Coder MoE still fails - the bug is MoE-specific.

## Likely culprits (not yet investigated)

`HybridForwardPass.GpuMoeFfn` (lines ~1209-1325) does several things differently from non-MoE:

- Splits the command buffer mid-layer to download router logits (`EndRecordAndSubmit` -> `Download` -> `BeginRecord`)
- Uses `ExpertSlotManager.TryGetCached` and falls back to CPU computation for cache-missed experts, with the result uploaded via pinned host-coherent memory (`_gpuFallbackContrib`)
- Per-expert dispatches reuse `_gpuFfnGate`/`_gpuFfnUp` scratch buffers across iterations of the active-experts loop
- `_gpuPinnedNorm` is populated by an in-record `CopyGpuBuffer` then read after submit via `MapPinned`

Worth checking: barrier coverage between the `Clear(_gpuHidden)` (line ~1273) and the per-expert `AddScaledInPlace` accumulations, and whether the CPU-fallback contribution is correctly synchronized with the GPU expert outputs.

## Workaround

CPU-only with TurboQuant gives ~14 t/s decode on this hardware, which is usable.

## Disabling broken paths

Until this is fixed, the CLI should refuse to run MoE models on the hybrid GPU path with a clear error message, rather than silently producing NaN or garbled output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2

Summary

Repro

Already fixed in adjacent areas

Likely culprits (not yet investigated)

Workaround

Disabling broken paths

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2

Description

Summary

Repro

Already fixed in adjacent areas

Likely culprits (not yet investigated)

Workaround

Disabling broken paths

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions