You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some prompts, SharpInference's CPU forward pass for Qwen3-Coder 30B-A3B Q4_K_M produces a top-1 logit on <|endoftext|> (token 151643) at the very first decode step, with a confidence gap large enough that even non-greedy sampling picks it. llama.cpp on the exact same model + prompt + temp 0 produces coherent content.
Reproduction
Same model file: models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, CPU only, --temp 0.
Prompt:Write a short wgsl shader for pbr rendering. Output only the code.
user
Write a short wgsl shader for pbr rendering. Output only the code.
assistant
```wgsl
@group(0) @binding(0)
var<uniform> camera: CameraUniform;
@group(0) @binding(1
(generation cut off by user; got 31 useful content tokens)
Other prompts
"Write a one-line Python function that adds two numbers." works correctly in SharpInference (~21 t/s, coherent code).
"Hello" gives degenerate single-token output that we already partly mitigated by fixing the Jinja template (Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2-related). After all template fixes, a "Hello"-style prompt still terminates after a few tokens — same divergence pattern.
Hypotheses to investigate
MoE expert dispatch numerics. Small drift compounds over 48 layers × 8 active experts/token. Worth comparing logits at each layer between SharpInference (CPU) and llama.cpp (CPU) for one bad prompt.
Q4_K_M dequantization rounding in the FFN matmul kernels (SimdKernels.MatVec).
Sampler / stop-token list.<|endoftext|> is added to stops on top of tokenizer.EosTokenId (<|im_end|>); llama.cpp's default stop set may be narrower. Worth confirming whether the difference is generation (we never produce content tokens at all) vs stopping (we generate but cut off too early). Verbose-prompt output above shows the logits themselves differ, so this is more than a stop-list issue, but the stop list is still a candidate for related cases.
Chat-template render bytes. llama.cpp's display strips special tokens; spot-check that the actual token IDs fed to its forward pass match ours exactly.
Workaround
Use a more directive prompt ("Output just the code." instead of "Output only the code."); for now CPU + --tq works well on most coding prompts at ~21 t/s.
Setup notes for repro
llama.cpp installed via .\scripts\setup-llamacpp.ps1 (b8585, CPU build)
Use llama-completion, NOT llama-cli — --no-conversation no longer exists in this build; llama-cli is interactive-only
Summary
For some prompts, SharpInference's CPU forward pass for Qwen3-Coder 30B-A3B Q4_K_M produces a top-1 logit on
<|endoftext|>(token 151643) at the very first decode step, with a confidence gap large enough that even non-greedy sampling picks it. llama.cpp on the exact same model + prompt + temp 0 produces coherent content.Reproduction
Same model file:
models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf, CPU only,--temp 0.Prompt:
Write a short wgsl shader for pbr rendering. Output only the code.SharpInference
llama.cpp (same prompt, same model)
(generation cut off by user; got 31 useful content tokens)
Other prompts
"Write a one-line Python function that adds two numbers."works correctly in SharpInference (~21 t/s, coherent code)."Hello"gives degenerate single-token output that we already partly mitigated by fixing the Jinja template (Hybrid GPU+CPU path broken for MoE models (GpuMoeFfn) #2-related). After all template fixes, a "Hello"-style prompt still terminates after a few tokens — same divergence pattern.Hypotheses to investigate
SimdKernels.MatVec).<|endoftext|>is added to stops on top oftokenizer.EosTokenId(<|im_end|>); llama.cpp's default stop set may be narrower. Worth confirming whether the difference is generation (we never produce content tokens at all) vs stopping (we generate but cut off too early). Verbose-prompt output above shows the logits themselves differ, so this is more than a stop-list issue, but the stop list is still a candidate for related cases.Workaround
Use a more directive prompt (
"Output just the code."instead of"Output only the code."); for now CPU +--tqworks well on most coding prompts at ~21 t/s.Setup notes for repro
.\scripts\setup-llamacpp.ps1(b8585, CPU build)llama-completion, NOTllama-cli—--no-conversationno longer exists in this build;llama-cliis interactive-only