Skip to content

MTP / qwen3_next_mtp self-speculation for Qwen3.6 hybrid GDN #25

@pekkah

Description

@pekkah

Summary

Add Multi-Token Prediction (MTP / NEXTN) self-speculative decoding for the
hybrid GDN+attn+MoE family so Qwen3.6-27B-MTP (and any later sibling that
ships MTP heads) decodes at the same ~1.5–2× the dense-model serving stacks
already get. The 35B-A3B variant in-tree doesn't ship MTP heads — confirmed
via list-tensors, only one output.weight, no mtp.*/nextn.* tensors —
but unsloth/Qwen3.6-27B-MTP-GGUF does, with the MTP head trained alongside
the main model. llama.cpp drives it as --spec-type draft-mtp --spec-draft-n-max 2; vLLM's config method name is qwen3_next_mtp; SGLang
uses --speculative-algo NEXTN.

Quoted speedup: ~1.5–2× decode, no accuracy loss. For the current CUDA
hybrid path this would be the largest single win available — bigger than
CUDA Graphs or NVRTC kernel fusion, both of which only reduce host-side
launch overhead (~3–5 ms/token), while MTP reduces the number of forward
passes per accepted token.

Why we can't just turn this on today

HybridGdnForwardPass.SupportsPartialRewind => false
(HybridGdnForwardPass.cs:395) and the matching CUDA path
(CudaHybridGdnForwardPass.cs:692). The existing SpeculativeDecoder is
gated on SupportsPartialRewind and throws otherwise
(SpeculativeDecoder.cs:43-50). GDN destructively updates a
[num_v_heads × head_dim × head_dim] scan state per layer (~16 MiB × 30
GDN layers ≈ 480 MiB), so any verify-and-rollback scheme has to be able to
restore that state at per-token granularity, not just at the end-of-decode
boundary that the issue #21 snapshot covers (commit 7633f8a).

Work breakdown

  1. GGUF parser — register qwen3_next_mtp (or whatever metadata key
    the 27B-MTP GGUF uses) and load the MTP head tensors. Pull the GGUF,
    run list-tensors, and document the exact tensor naming under
    docs/qwen35moe-plan.md before writing code.

  2. Per-token GDN state snapshot/restore. End-of-decode CaptureSnapshot
    already exists for prefix-cache reuse, but on the CUDA hybrid path it
    downloads ~480 MiB host-side every snapshot. For per-token use we need
    either:

    • an on-device snapshot ring (keep N copies in VRAM, swap pointers on
      accept/reject), or
    • a delta-state mechanism that records only the rank-1 update each
      step and replays the rejected suffix in reverse.
      The ring is simpler; cost is N × 480 MiB VRAM. At 12 GB this is fine for
      small N (N=2 → ~1 GB).
  3. Batched prefill for GDN. Verify-pass runs N candidate tokens through
    one forward. HybridGdnForwardPass.Prefill walks tokens sequentially
    today (explicit v1 limitation, HybridGdnForwardPass.cs:325). Need either
    a real batched recurrence or a "verify-only" path that runs the existing
    recurrence N times into a scratch state, comparing to the main path.
    For attention layers, batching is straightforward; for GDN the rank-1
    update is per-token so the "verify in scratch" approach is fine.

  4. MTP forward + verify wiring. New engine path or a flag on
    InferenceEngine. Existing SpeculativeDecoder is built for a separate
    draft model — MTP is single-model, so this is a sibling implementation,
    not a refactor.

Acceptance criteria

  • list-tensors on the 27B-MTP GGUF documented in the design doc.
  • CPU HybridGdnForwardPass decodes the 27B-MTP variant with MTP
    enabled and produces byte-identical greedy output vs llama.cpp
    --spec-type draft-mtp for ≥60-token decode (one of the standard
    benchmark prompts).
  • CUDA hybrid path lands the same parity check.
  • bench-textgen.ps1 row added to bench-all.ps1 covering
    qwen36-27b-mtp-cuda; README perf table updated.
  • Measured decode speedup vs the same model without MTP (run with
    --spec-draft-n-max 0 or equivalent) is ≥ 1.3× — below that the
    complexity isn't worth the maintenance cost.
  • Mirrors prior practice: a SHARPI_DISABLE_MTP=1 env switch for
    bisecting parity regressions, like the existing
    SHARPI_BYPASS_GDN/SHARPI_CPU_GDN knobs.

Out of scope

  • Adding MTP heads to the 35B-A3B-UD weights (this would be a training
    project, not an inference change).
  • Multi-step draft (N > 2). Start with N=2 to match llama.cpp's default;
    raise once acceptance rates are measured.
  • Vulkan hybrid support. The qwen35moe Vulkan hybrid path is already a
    known broken row (README ⚠ note); fix that first under its own issue
    before extending it to MTP.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions