MTP / qwen3_next_mtp self-speculation for Qwen3.6 hybrid GDN

## Summary

Add Multi-Token Prediction (MTP / NEXTN) self-speculative decoding for the
hybrid GDN+attn+MoE family so Qwen3.6-27B-MTP (and any later sibling that
ships MTP heads) decodes at the same ~1.5–2× the dense-model serving stacks
already get. The 35B-A3B variant in-tree doesn't ship MTP heads — confirmed
via `list-tensors`, only one `output.weight`, no `mtp.*`/`nextn.*` tensors —
but `unsloth/Qwen3.6-27B-MTP-GGUF` does, with the MTP head trained alongside
the main model. llama.cpp drives it as `--spec-type draft-mtp
--spec-draft-n-max 2`; vLLM's config method name is `qwen3_next_mtp`; SGLang
uses `--speculative-algo NEXTN`.

Quoted speedup: ~1.5–2× decode, no accuracy loss. For the current CUDA
hybrid path this would be the largest single win available — bigger than
CUDA Graphs or NVRTC kernel fusion, both of which only reduce host-side
launch overhead (~3–5 ms/token), while MTP reduces the number of forward
passes per accepted token.

## Why we can't just turn this on today

`HybridGdnForwardPass.SupportsPartialRewind => false`
(HybridGdnForwardPass.cs:395) and the matching CUDA path
(CudaHybridGdnForwardPass.cs:692). The existing `SpeculativeDecoder` is
gated on `SupportsPartialRewind` and throws otherwise
(SpeculativeDecoder.cs:43-50). GDN destructively updates a
[num_v_heads × head_dim × head_dim] scan state per layer (~16 MiB × 30
GDN layers ≈ 480 MiB), so any verify-and-rollback scheme has to be able to
restore that state at per-token granularity, not just at the end-of-decode
boundary that the issue #21 snapshot covers (commit 7633f8a).

## Work breakdown

1. **GGUF parser** — register `qwen3_next_mtp` (or whatever metadata key
   the 27B-MTP GGUF uses) and load the MTP head tensors. Pull the GGUF,
   run `list-tensors`, and document the exact tensor naming under
   `docs/qwen35moe-plan.md` before writing code.

2. **Per-token GDN state snapshot/restore.** End-of-decode `CaptureSnapshot`
   already exists for prefix-cache reuse, but on the CUDA hybrid path it
   downloads ~480 MiB host-side every snapshot. For per-token use we need
   either:
     - an on-device snapshot ring (keep N copies in VRAM, swap pointers on
       accept/reject), or
     - a delta-state mechanism that records only the rank-1 update each
       step and replays the rejected suffix in reverse.
   The ring is simpler; cost is N × 480 MiB VRAM. At 12 GB this is fine for
   small N (N=2 → ~1 GB).

3. **Batched prefill for GDN.** Verify-pass runs N candidate tokens through
   one forward. `HybridGdnForwardPass.Prefill` walks tokens sequentially
   today (explicit v1 limitation, HybridGdnForwardPass.cs:325). Need either
   a real batched recurrence or a "verify-only" path that runs the existing
   recurrence N times into a scratch state, comparing to the main path.
   For attention layers, batching is straightforward; for GDN the rank-1
   update is per-token so the "verify in scratch" approach is fine.

4. **MTP forward + verify wiring.** New engine path or a flag on
   `InferenceEngine`. Existing `SpeculativeDecoder` is built for a separate
   draft model — MTP is single-model, so this is a sibling implementation,
   not a refactor.

## Acceptance criteria

- [ ] `list-tensors` on the 27B-MTP GGUF documented in the design doc.
- [ ] CPU `HybridGdnForwardPass` decodes the 27B-MTP variant with MTP
      enabled and produces byte-identical greedy output vs llama.cpp
      `--spec-type draft-mtp` for ≥60-token decode (one of the standard
      benchmark prompts).
- [ ] CUDA hybrid path lands the same parity check.
- [ ] `bench-textgen.ps1` row added to `bench-all.ps1` covering
      `qwen36-27b-mtp-cuda`; README perf table updated.
- [ ] Measured decode speedup vs the same model without MTP (run with
      `--spec-draft-n-max 0` or equivalent) is ≥ 1.3× — below that the
      complexity isn't worth the maintenance cost.
- [ ] Mirrors prior practice: a `SHARPI_DISABLE_MTP=1` env switch for
      bisecting parity regressions, like the existing
      `SHARPI_BYPASS_GDN`/`SHARPI_CPU_GDN` knobs.

## Out of scope

- Adding MTP heads to the 35B-A3B-UD weights (this would be a training
  project, not an inference change).
- Multi-step draft (N > 2). Start with N=2 to match llama.cpp's default;
  raise once acceptance rates are measured.
- Vulkan hybrid support. The qwen35moe Vulkan hybrid path is already a
  known broken row (README ⚠ note); fix that first under its own issue
  before extending it to MTP.

## References

- Model card: https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF
- Prior context in this repo: `memory/feedback_qwen36_perf_attempts.md`
  in the local agent memory (not checked in).
- llama.cpp PR/docs on `draft-mtp` for the exact verify-and-accept algorithm.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTP / qwen3_next_mtp self-speculation for Qwen3.6 hybrid GDN #25

Summary

Why we can't just turn this on today

Work breakdown

Acceptance criteria

Out of scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

MTP / qwen3_next_mtp self-speculation for Qwen3.6 hybrid GDN #25

Description

Summary

Why we can't just turn this on today

Work breakdown

Acceptance criteria

Out of scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions