Wire the MTP speculative-decoding round loop into the offline mlxcel generate path. Today MTP speculative decode only runs in the server burst path; the offline CLI returns a deferred error for every target.
Current state
src/commands/generate.rs (~lines 899-926): when --draft-kind mtp is passed to offline mlxcel generate, it errors:
"--draft-kind mtp is plumbed end-to-end but the offline `mlxcel generate` path
does not yet construct the kind-specific MtpGenerator round loop on this target model..."
So a one-shot CLI user cannot get the MTP speedup (~1.2-1.87x, measured) without standing up a server. The server burst path already drives MtpGenerator via Gemma4MtpTargetAdapter / Gemma4UnifiedMtpTargetAdapter; the offline path just never builds that loop.
Goal
Construct and drive the MtpGenerator round loop in the offline generate path for MTP-capable targets, reusing the same per-target MtpTarget adapters the server uses (src/models/gemma4_mtp_target.rs), so mlxcel generate --draft-kind mtp performs speculative decode.
Touchpoints
src/commands/generate.rs — the --draft-kind mtp branch (replace the error with the real loop).
src/models/gemma4_mtp_target.rs — reuse Gemma4MtpTargetAdapter / Gemma4UnifiedMtpTargetAdapter.
src/lib/mlxcel-core/src/speculative/mtp/ — MtpGenerator.
- The same B=1 default-on /
MLXCEL_ENABLE_MTP_B1 semantics as the server should apply.
Acceptance criteria
mlxcel generate -m <target> --draft-model <assistant> --draft-kind mtp -p "..." -n N runs MTP speculative decode (no error).
- At
--temp 0 the output is byte-identical to the non-speculative mlxcel generate output.
- A measured decode speedup on the 12B Unified and/or 31B + assistant pair (mirroring the server numbers).
- DFlash and the classic
SpeculativeGenerator offline paths are unaffected.
Wire the MTP speculative-decoding round loop into the offline
mlxcel generatepath. Today MTP speculative decode only runs in the server burst path; the offline CLI returns a deferred error for every target.Current state
src/commands/generate.rs(~lines 899-926): when--draft-kind mtpis passed to offlinemlxcel generate, it errors:So a one-shot CLI user cannot get the MTP speedup (~1.2-1.87x, measured) without standing up a server. The server burst path already drives
MtpGeneratorviaGemma4MtpTargetAdapter/Gemma4UnifiedMtpTargetAdapter; the offline path just never builds that loop.Goal
Construct and drive the
MtpGeneratorround loop in the offline generate path for MTP-capable targets, reusing the same per-targetMtpTargetadapters the server uses (src/models/gemma4_mtp_target.rs), somlxcel generate --draft-kind mtpperforms speculative decode.Touchpoints
src/commands/generate.rs— the--draft-kind mtpbranch (replace the error with the real loop).src/models/gemma4_mtp_target.rs— reuseGemma4MtpTargetAdapter/Gemma4UnifiedMtpTargetAdapter.src/lib/mlxcel-core/src/speculative/mtp/—MtpGenerator.MLXCEL_ENABLE_MTP_B1semantics as the server should apply.Acceptance criteria
mlxcel generate -m <target> --draft-model <assistant> --draft-kind mtp -p "..." -n Nruns MTP speculative decode (no error).--temp 0the output is byte-identical to the non-speculativemlxcel generateoutput.SpeculativeGeneratoroffline paths are unaffected.