Summary
Add Multi-Token Prediction (MTP / NEXTN) self-speculative decoding for the
hybrid GDN+attn+MoE family so Qwen3.6-27B-MTP (and any later sibling that
ships MTP heads) decodes at the same ~1.5–2× the dense-model serving stacks
already get. The 35B-A3B variant in-tree doesn't ship MTP heads — confirmed
via list-tensors, only one output.weight, no mtp.*/nextn.* tensors —
but unsloth/Qwen3.6-27B-MTP-GGUF does, with the MTP head trained alongside
the main model. llama.cpp drives it as --spec-type draft-mtp --spec-draft-n-max 2; vLLM's config method name is qwen3_next_mtp; SGLang
uses --speculative-algo NEXTN.
Quoted speedup: ~1.5–2× decode, no accuracy loss. For the current CUDA
hybrid path this would be the largest single win available — bigger than
CUDA Graphs or NVRTC kernel fusion, both of which only reduce host-side
launch overhead (~3–5 ms/token), while MTP reduces the number of forward
passes per accepted token.
Why we can't just turn this on today
HybridGdnForwardPass.SupportsPartialRewind => false
(HybridGdnForwardPass.cs:395) and the matching CUDA path
(CudaHybridGdnForwardPass.cs:692). The existing SpeculativeDecoder is
gated on SupportsPartialRewind and throws otherwise
(SpeculativeDecoder.cs:43-50). GDN destructively updates a
[num_v_heads × head_dim × head_dim] scan state per layer (~16 MiB × 30
GDN layers ≈ 480 MiB), so any verify-and-rollback scheme has to be able to
restore that state at per-token granularity, not just at the end-of-decode
boundary that the issue #21 snapshot covers (commit 7633f8a).
Work breakdown
-
GGUF parser — register qwen3_next_mtp (or whatever metadata key
the 27B-MTP GGUF uses) and load the MTP head tensors. Pull the GGUF,
run list-tensors, and document the exact tensor naming under
docs/qwen35moe-plan.md before writing code.
-
Per-token GDN state snapshot/restore. End-of-decode CaptureSnapshot
already exists for prefix-cache reuse, but on the CUDA hybrid path it
downloads ~480 MiB host-side every snapshot. For per-token use we need
either:
- an on-device snapshot ring (keep N copies in VRAM, swap pointers on
accept/reject), or
- a delta-state mechanism that records only the rank-1 update each
step and replays the rejected suffix in reverse.
The ring is simpler; cost is N × 480 MiB VRAM. At 12 GB this is fine for
small N (N=2 → ~1 GB).
-
Batched prefill for GDN. Verify-pass runs N candidate tokens through
one forward. HybridGdnForwardPass.Prefill walks tokens sequentially
today (explicit v1 limitation, HybridGdnForwardPass.cs:325). Need either
a real batched recurrence or a "verify-only" path that runs the existing
recurrence N times into a scratch state, comparing to the main path.
For attention layers, batching is straightforward; for GDN the rank-1
update is per-token so the "verify in scratch" approach is fine.
-
MTP forward + verify wiring. New engine path or a flag on
InferenceEngine. Existing SpeculativeDecoder is built for a separate
draft model — MTP is single-model, so this is a sibling implementation,
not a refactor.
Acceptance criteria
Out of scope
- Adding MTP heads to the 35B-A3B-UD weights (this would be a training
project, not an inference change).
- Multi-step draft (N > 2). Start with N=2 to match llama.cpp's default;
raise once acceptance rates are measured.
- Vulkan hybrid support. The qwen35moe Vulkan hybrid path is already a
known broken row (README ⚠ note); fix that first under its own issue
before extending it to MTP.
References
Summary
Add Multi-Token Prediction (MTP / NEXTN) self-speculative decoding for the
hybrid GDN+attn+MoE family so Qwen3.6-27B-MTP (and any later sibling that
ships MTP heads) decodes at the same ~1.5–2× the dense-model serving stacks
already get. The 35B-A3B variant in-tree doesn't ship MTP heads — confirmed
via
list-tensors, only oneoutput.weight, nomtp.*/nextn.*tensors —but
unsloth/Qwen3.6-27B-MTP-GGUFdoes, with the MTP head trained alongsidethe main model. llama.cpp drives it as
--spec-type draft-mtp --spec-draft-n-max 2; vLLM's config method name isqwen3_next_mtp; SGLanguses
--speculative-algo NEXTN.Quoted speedup: ~1.5–2× decode, no accuracy loss. For the current CUDA
hybrid path this would be the largest single win available — bigger than
CUDA Graphs or NVRTC kernel fusion, both of which only reduce host-side
launch overhead (~3–5 ms/token), while MTP reduces the number of forward
passes per accepted token.
Why we can't just turn this on today
HybridGdnForwardPass.SupportsPartialRewind => false(HybridGdnForwardPass.cs:395) and the matching CUDA path
(CudaHybridGdnForwardPass.cs:692). The existing
SpeculativeDecoderisgated on
SupportsPartialRewindand throws otherwise(SpeculativeDecoder.cs:43-50). GDN destructively updates a
[num_v_heads × head_dim × head_dim] scan state per layer (~16 MiB × 30
GDN layers ≈ 480 MiB), so any verify-and-rollback scheme has to be able to
restore that state at per-token granularity, not just at the end-of-decode
boundary that the issue #21 snapshot covers (commit 7633f8a).
Work breakdown
GGUF parser — register
qwen3_next_mtp(or whatever metadata keythe 27B-MTP GGUF uses) and load the MTP head tensors. Pull the GGUF,
run
list-tensors, and document the exact tensor naming underdocs/qwen35moe-plan.mdbefore writing code.Per-token GDN state snapshot/restore. End-of-decode
CaptureSnapshotalready exists for prefix-cache reuse, but on the CUDA hybrid path it
downloads ~480 MiB host-side every snapshot. For per-token use we need
either:
accept/reject), or
step and replays the rejected suffix in reverse.
The ring is simpler; cost is N × 480 MiB VRAM. At 12 GB this is fine for
small N (N=2 → ~1 GB).
Batched prefill for GDN. Verify-pass runs N candidate tokens through
one forward.
HybridGdnForwardPass.Prefillwalks tokens sequentiallytoday (explicit v1 limitation, HybridGdnForwardPass.cs:325). Need either
a real batched recurrence or a "verify-only" path that runs the existing
recurrence N times into a scratch state, comparing to the main path.
For attention layers, batching is straightforward; for GDN the rank-1
update is per-token so the "verify in scratch" approach is fine.
MTP forward + verify wiring. New engine path or a flag on
InferenceEngine. ExistingSpeculativeDecoderis built for a separatedraft model — MTP is single-model, so this is a sibling implementation,
not a refactor.
Acceptance criteria
list-tensorson the 27B-MTP GGUF documented in the design doc.HybridGdnForwardPassdecodes the 27B-MTP variant with MTPenabled and produces byte-identical greedy output vs llama.cpp
--spec-type draft-mtpfor ≥60-token decode (one of the standardbenchmark prompts).
bench-textgen.ps1row added tobench-all.ps1coveringqwen36-27b-mtp-cuda; README perf table updated.--spec-draft-n-max 0or equivalent) is ≥ 1.3× — below that thecomplexity isn't worth the maintenance cost.
SHARPI_DISABLE_MTP=1env switch forbisecting parity regressions, like the existing
SHARPI_BYPASS_GDN/SHARPI_CPU_GDNknobs.Out of scope
project, not an inference change).
raise once acceptance rates are measured.
known broken row (README ⚠ note); fix that first under its own issue
before extending it to MTP.
References
memory/feedback_qwen36_perf_attempts.mdin the local agent memory (not checked in).
draft-mtpfor the exact verify-and-accept algorithm.