You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While reviewing the MTP foundation (commits a28bac4, 7f6a4bc, 82547e9 — #25, #29), I noticed that the MTP attention block's KV cache is NEVER populated during prefill. Only `MtpForward` writes to it, and `MtpForward` is only called during decode (via `MtpDecoder`). Result: the FIRST decode-step `MtpForward` call attends over an empty MTP KV cache — its attention scores only see its own freshly-written K/V at position P.
llama.cpp's MTP integration (per PR #20533 references) likely runs the MTP head over the full prompt too during prefill, so that the MTP attention at decode-step 1 sees K/V for positions 0..P-1.
Why we didn't catch this earlier
The CPU smoke test (`HybridGdnForwardPass_Qwen35Mtp_MtpHeadProducesWellFormedLogits`) calls `MtpForward` once and asserts finite + non-degenerate logits. With empty MTP KV, the attention has only position P to attend to (self-attention), the dot product yields a scalar, softmax produces 1.0, and the output is the value vector itself — finite and non-degenerate. The test passes by accident.
The self-parity test (`InferenceEngine_MtpGreedy_MatchesBaselineGreedy_OnCpu`) compares MTP-on vs MTP-off greedy emissions. Since the emitted sequence is always `argmax(main_logits)` (the verified target), the MTP attention's K/V content doesn't affect what gets emitted. The test passes correctly.
So both correctness gates we have today are insensitive to this bug.
What it actually breaks
MTP draft quality at decode-steps 1..K. Without prompt context, the MTP head is missing most of its training-time conditioning. Acceptance rate will be lower than llama.cpp's, meaning when speedup work (Batched main verify + per-token GDN snapshot ring — realize the MTP >=1.3x speedup #30) lands the realised speedup will be below the model's potential.
llama.cpp greedy parity (MTP greedy parity vs llama.cpp --spec-type draft-mtp #31). llama.cpp populates MTP KV during prefill (almost certainly — that's how the head was trained). Our acceptance pattern will diverge from llama.cpp's, even though emitted sequences should still match under greedy.
N>1 MTP draft chains. When the decoder eventually drafts multiple tokens ahead, the second draft (`mtp(t2_draft, P+1, ...)`) relies on the MTP cache having the prompt's K/V. Without prefill, that chain starts from empty.
Scope
Add a `PrefillMtp(IReadOnlyList tokens)` hook on `IForwardPass` (default no-op).
Implement on `HybridGdnForwardPass` (CPU) and `CudaHybridGdnForwardPass` (GPU): for each prompt token `t_i`, call `MtpForward(t_i, i, h_{i-1})` where `h_{i-1}` is the previous main forward's `LastHidden` (zero-vector for i=0). DON'T capture the returned logits — we only need the side effect of populating the MTP KV cache.
Wire into `InferenceEngine.GenerateChunksAsync`: after the main prefill, if `HasMtpHead`, run `PrefillMtp` for the suffix tokens.
Cost: ~1.6% extra per prefill token (1 MTP forward = ~1/64 main forward for 27B-MTP). For long prompts this is non-trivial; gate behind a check like "prefill MTP only if MTP will actually be used".
Acceptance criteria
`PrefillMtp` hook added to `IForwardPass` with safe default.
CPU + CUDA hybrid implementations populate the MTP KV cache for every prompt position.
Self-parity test still passes (we know emitted sequence shouldn't change).
Background
While reviewing the MTP foundation (commits a28bac4, 7f6a4bc, 82547e9 — #25, #29), I noticed that the MTP attention block's KV cache is NEVER populated during prefill. Only `MtpForward` writes to it, and `MtpForward` is only called during decode (via `MtpDecoder`). Result: the FIRST decode-step `MtpForward` call attends over an empty MTP KV cache — its attention scores only see its own freshly-written K/V at position P.
llama.cpp's MTP integration (per PR #20533 references) likely runs the MTP head over the full prompt too during prefill, so that the MTP attention at decode-step 1 sees K/V for positions 0..P-1.
Why we didn't catch this earlier
So both correctness gates we have today are insensitive to this bug.
What it actually breaks
Scope
Acceptance criteria
Out of scope
Related