Skip to content

Forward-pass divergence on Qwen3-Coder 30B-A3B for some prompts (top-1 logit goes to <|endoftext|>) #6

@pekkah

Description

@pekkah

Summary

For some prompts, SharpInference's CPU forward pass for Qwen3-Coder 30B-A3B Q4_K_M produces a top-1 logit on <|endoftext|> (token 151643) at the very first decode step, with a confidence gap large enough that even non-greedy sampling picks it. llama.cpp on the exact same model + prompt + temp 0 produces coherent content.

Reproduction

Same model file: models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, CPU only, --temp 0.

Prompt: Write a short wgsl shader for pbr rendering. Output only the code.

SharpInference

> sharpi-cli ... -p "Write a short wgsl shader for pbr rendering. Output only the code." --temp 0 --verbose-prompt -n 64

Prompt tokens (24): 151644, 872, 198, 7985, 264, 2805, 63581, 3226, 21013, 369, 281, 1323, 20898, 13, 9258, 1172, 279, 2038, 13, 151645, 198, 151644, 77091, 198
[DBG] tok=0 next=151643('<|endoftext|>') stop=True top5:151643(27.95) 73594(22.63) 151645(20.20) 151644(19.41) 264(13.92)

Decode: 0 tokens, 0.0 t/s

llama.cpp (same prompt, same model)

user
Write a short wgsl shader for pbr rendering. Output only the code.
assistant
```wgsl
@group(0) @binding(0)
var<uniform> camera: CameraUniform;

@group(0) @binding(1

(generation cut off by user; got 31 useful content tokens)

Other prompts

  • "Write a one-line Python function that adds two numbers." works correctly in SharpInference (~21 t/s, coherent code).
  • "Hello" gives degenerate single-token output that we already partly mitigated by fixing the Jinja template (Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2-related). After all template fixes, a "Hello"-style prompt still terminates after a few tokens — same divergence pattern.

Hypotheses to investigate

  1. MoE expert dispatch numerics. Small drift compounds over 48 layers × 8 active experts/token. Worth comparing logits at each layer between SharpInference (CPU) and llama.cpp (CPU) for one bad prompt.
  2. Q4_K_M dequantization rounding in the FFN matmul kernels (SimdKernels.MatVec).
  3. Sampler / stop-token list. <|endoftext|> is added to stops on top of tokenizer.EosTokenId (<|im_end|>); llama.cpp's default stop set may be narrower. Worth confirming whether the difference is generation (we never produce content tokens at all) vs stopping (we generate but cut off too early). Verbose-prompt output above shows the logits themselves differ, so this is more than a stop-list issue, but the stop list is still a candidate for related cases.
  4. Chat-template render bytes. llama.cpp's display strips special tokens; spot-check that the actual token IDs fed to its forward pass match ours exactly.

Workaround

Use a more directive prompt ("Output just the code." instead of "Output only the code."); for now CPU + --tq works well on most coding prompts at ~21 t/s.

Setup notes for repro

  • llama.cpp installed via .\scripts\setup-llamacpp.ps1 (b8585, CPU build)
  • Use llama-completion, NOT llama-cli--no-conversation no longer exists in this build; llama-cli is interactive-only

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions