Skip to content

feat: wire MTP speculative decoding into the offline mlxcel generate path #166

@inureyes

Description

@inureyes

Wire the MTP speculative-decoding round loop into the offline mlxcel generate path. Today MTP speculative decode only runs in the server burst path; the offline CLI returns a deferred error for every target.

Current state

src/commands/generate.rs (~lines 899-926): when --draft-kind mtp is passed to offline mlxcel generate, it errors:

"--draft-kind mtp is plumbed end-to-end but the offline `mlxcel generate` path
 does not yet construct the kind-specific MtpGenerator round loop on this target model..."

So a one-shot CLI user cannot get the MTP speedup (~1.2-1.87x, measured) without standing up a server. The server burst path already drives MtpGenerator via Gemma4MtpTargetAdapter / Gemma4UnifiedMtpTargetAdapter; the offline path just never builds that loop.

Goal

Construct and drive the MtpGenerator round loop in the offline generate path for MTP-capable targets, reusing the same per-target MtpTarget adapters the server uses (src/models/gemma4_mtp_target.rs), so mlxcel generate --draft-kind mtp performs speculative decode.

Touchpoints

  • src/commands/generate.rs — the --draft-kind mtp branch (replace the error with the real loop).
  • src/models/gemma4_mtp_target.rs — reuse Gemma4MtpTargetAdapter / Gemma4UnifiedMtpTargetAdapter.
  • src/lib/mlxcel-core/src/speculative/mtp/MtpGenerator.
  • The same B=1 default-on / MLXCEL_ENABLE_MTP_B1 semantics as the server should apply.

Acceptance criteria

  • mlxcel generate -m <target> --draft-model <assistant> --draft-kind mtp -p "..." -n N runs MTP speculative decode (no error).
  • At --temp 0 the output is byte-identical to the non-speculative mlxcel generate output.
  • A measured decode speedup on the 12B Unified and/or 31B + assistant pair (mirroring the server numbers).
  • DFlash and the classic SpeculativeGenerator offline paths are unaffected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:modelsModel architectures, weights, loading, metadatapriority:lowLow prioritystatus:doneCompletedtype:enhancementNew features, capabilities, or significant additions

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions