Forward-pass divergence on Qwen3-Coder 30B-A3B for some prompts (top-1 logit goes to <|endoftext|>)

## Summary

For some prompts, SharpInference's CPU forward pass for Qwen3-Coder 30B-A3B Q4_K_M produces a top-1 logit on `<|endoftext|>` (token 151643) at the very first decode step, with a confidence gap large enough that even non-greedy sampling picks it. llama.cpp on the exact same model + prompt + temp 0 produces coherent content.

## Reproduction

Same model file: `models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf`, CPU only, `--temp 0`.

**Prompt:** `Write a short wgsl shader for pbr rendering. Output only the code.`

### SharpInference

```
> sharpi-cli ... -p "Write a short wgsl shader for pbr rendering. Output only the code." --temp 0 --verbose-prompt -n 64

Prompt tokens (24): 151644, 872, 198, 7985, 264, 2805, 63581, 3226, 21013, 369, 281, 1323, 20898, 13, 9258, 1172, 279, 2038, 13, 151645, 198, 151644, 77091, 198
[DBG] tok=0 next=151643('<|endoftext|>') stop=True top5:151643(27.95) 73594(22.63) 151645(20.20) 151644(19.41) 264(13.92)

Decode: 0 tokens, 0.0 t/s
```

### llama.cpp (same prompt, same model)

```
user
Write a short wgsl shader for pbr rendering. Output only the code.
assistant
```wgsl
@group(0) @binding(0)
var<uniform> camera: CameraUniform;

@group(0) @binding(1
```

(generation cut off by user; got 31 useful content tokens)

## Other prompts

- `"Write a one-line Python function that adds two numbers."` works correctly in SharpInference (~21 t/s, coherent code).
- `"Hello"` gives degenerate single-token output that we already partly mitigated by fixing the Jinja template (#2-related). After all template fixes, a "Hello"-style prompt still terminates after a few tokens — same divergence pattern.

## Hypotheses to investigate

1. **MoE expert dispatch numerics.** Small drift compounds over 48 layers × 8 active experts/token. Worth comparing logits at each layer between SharpInference (CPU) and llama.cpp (CPU) for one bad prompt.
2. **Q4_K_M dequantization rounding** in the FFN matmul kernels (`SimdKernels.MatVec`).
3. **Sampler / stop-token list.** `<|endoftext|>` is added to stops on top of `tokenizer.EosTokenId` (`<|im_end|>`); llama.cpp's default stop set may be narrower. Worth confirming whether the difference is *generation* (we never produce content tokens at all) vs *stopping* (we generate but cut off too early). Verbose-prompt output above shows the **logits themselves** differ, so this is more than a stop-list issue, but the stop list is still a candidate for related cases.
4. **Chat-template render bytes.** llama.cpp's display strips special tokens; spot-check that the actual *token IDs* fed to its forward pass match ours exactly.

## Workaround

Use a more directive prompt (`"Output just the code."` instead of `"Output only the code."`); for now CPU + `--tq` works well on most coding prompts at ~21 t/s.

## Setup notes for repro

- llama.cpp installed via `.\scripts\setup-llamacpp.ps1` (b8585, CPU build)
- Use `llama-completion`, NOT `llama-cli` — `--no-conversation` no longer exists in this build; `llama-cli` is interactive-only

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward-pass divergence on Qwen3-Coder 30B-A3B for some prompts (top-1 logit goes to <|endoftext|>) #6

Summary

Reproduction

SharpInference

llama.cpp (same prompt, same model)

Other prompts

Hypotheses to investigate

Workaround

Setup notes for repro

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Forward-pass divergence on Qwen3-Coder 30B-A3B for some prompts (top-1 logit goes to <|endoftext|>) #6

Description

Summary

Reproduction

SharpInference

llama.cpp (same prompt, same model)

Other prompts

Hypotheses to investigate

Workaround

Setup notes for repro

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions