MTP attention KV cache is empty during decode-step 1 (prefill doesn't populate it)

## Background

While reviewing the MTP foundation (commits a28bac4, 7f6a4bc, 82547e9 — #25, #29), I noticed that the MTP attention block's KV cache is NEVER populated during prefill. Only \`MtpForward\` writes to it, and \`MtpForward\` is only called during decode (via \`MtpDecoder\`). Result: the FIRST decode-step \`MtpForward\` call attends over an empty MTP KV cache — its attention scores only see its own freshly-written K/V at position P.

llama.cpp's MTP integration (per PR #20533 references) likely runs the MTP head over the full prompt too during prefill, so that the MTP attention at decode-step 1 sees K/V for positions 0..P-1.

## Why we didn't catch this earlier

- The CPU smoke test (\`HybridGdnForwardPass_Qwen35Mtp_MtpHeadProducesWellFormedLogits\`) calls \`MtpForward\` once and asserts finite + non-degenerate logits. With empty MTP KV, the attention has only position P to attend to (self-attention), the dot product yields a scalar, softmax produces 1.0, and the output is the value vector itself — finite and non-degenerate. The test passes by accident.
- The self-parity test (\`InferenceEngine_MtpGreedy_MatchesBaselineGreedy_OnCpu\`) compares MTP-on vs MTP-off greedy emissions. Since the emitted sequence is always \`argmax(main_logits)\` (the verified target), the MTP attention's K/V content doesn't affect what gets emitted. The test passes correctly.

So both correctness gates we have today are insensitive to this bug.

## What it actually breaks

1. **MTP draft quality at decode-steps 1..K.** Without prompt context, the MTP head is missing most of its training-time conditioning. Acceptance rate will be lower than llama.cpp's, meaning when speedup work (#30) lands the realised speedup will be below the model's potential.
2. **llama.cpp greedy parity (#31).** llama.cpp populates MTP KV during prefill (almost certainly — that's how the head was trained). Our acceptance pattern will diverge from llama.cpp's, even though emitted sequences should still match under greedy.
3. **N>1 MTP draft chains.** When the decoder eventually drafts multiple tokens ahead, the second draft (\`mtp(t2_draft, P+1, ...)\`) relies on the MTP cache having the prompt's K/V. Without prefill, that chain starts from empty.

## Scope

1. Add a \`PrefillMtp(IReadOnlyList<int> tokens)\` hook on \`IForwardPass\` (default no-op).
2. Implement on \`HybridGdnForwardPass\` (CPU) and \`CudaHybridGdnForwardPass\` (GPU): for each prompt token \`t_i\`, call \`MtpForward(t_i, i, h_{i-1})\` where \`h_{i-1}\` is the previous main forward's \`LastHidden\` (zero-vector for i=0). DON'T capture the returned logits — we only need the side effect of populating the MTP KV cache.
3. Wire into \`InferenceEngine.GenerateChunksAsync\`: after the main prefill, if \`HasMtpHead\`, run \`PrefillMtp\` for the suffix tokens.
4. Cost: ~1.6% extra per prefill token (1 MTP forward = ~1/64 main forward for 27B-MTP). For long prompts this is non-trivial; gate behind a check like \"prefill MTP only if MTP will actually be used\".

## Acceptance criteria

- [ ] \`PrefillMtp\` hook added to \`IForwardPass\` with safe default.
- [ ] CPU + CUDA hybrid implementations populate the MTP KV cache for every prompt position.
- [ ] Self-parity test still passes (we know emitted sequence shouldn't change).
- [ ] When #31 (llama.cpp parity) lands, MTP-enabled byte-identical greedy parity holds for >=60 tokens.
- [ ] When #30 (batched verify) lands, observed acceptance rate is in line with llama.cpp's reported numbers (>=70% for greedy on standard prompts).

## Out of scope

- Optimising the prefill MTP cost (batched MTP prefill, parallel KV writes) — solve once measured.

## Related

- Parent: #25
- Sibling: #29 (CUDA hybrid MTP — closed; this is a known correctness gap, not a regression).
- Blocks: #31 (llama.cpp parity) — acceptance-rate parity needs this; emitted-sequence parity probably doesn't.
- Affects: #30 (batched verify speedup) — peak acceptance rate, and thus peak speedup, depends on this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTP attention KV cache is empty during decode-step 1 (prefill doesn't populate it) #33

Background

Why we didn't catch this earlier

What it actually breaks

Scope

Acceptance criteria

Out of scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MTP attention KV cache is empty during decode-step 1 (prefill doesn't populate it) #33

Description

Background

Why we didn't catch this earlier

What it actually breaks

Scope

Acceptance criteria

Out of scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions