Decode starvation when long prefill overlaps active decode on dual.yml #208
Replies: 1 comment
-
|
This is an excellent characterization — and you've correctly found a regime our published numbers don't cover. Mechanism first, since it ties the answers together. The mechanism (our read — consistent with our GDN forensics, not yet instrumented for this exact overlap)Chunked prefill co-batches your big prefill's current chunk with the decoding request's single token into one forward step. That step's latency is dominated by the prefill chunk — and on Qwen3.6's hybrid GDN/Mamba layers the chunk's recurrent forward is heavy (same kernel family behind our single-card "Cliff 2"). So during overlap each step ≈ chunk-time, your decoder emits ~1 token/step, and ITL balloons to the chunk-step time → your 0.2–0.9 TPS. Shrinking Q1 — decode-concurrent only. Confirmed. The 269 TPS @ 4-stream uses the short quicksort prompt (4 short-prefill streams decoding together). We have not benchmarked long-prefill-overlap — genuine gap, thank you for naming it. Q2 — Q3 — can't give you a clean dual-turbo A/B right now: it's currently parked. Q4 — yes, proxy admission control is the right answer — and note priority scheduling won't save you. Q5 — yes, and we'll document it. The decode-concurrent (throughput) vs long-prefill-overlap (latency) distinction is real and our On the A/Bs — you're better placed to run them than we areYou've already got the repro, the rig, and the exact production config — and honestly we don't have spare cycles to stand up a faithful long-prefill-overlap harness our side right now (and dual-turbo, the most interesting arm, is parked per Q3). So if you're up for it, the two highest-signal runs on
Plus |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Decode starvation when long prefill overlaps active decode on
dual.ymlOpening this as an investigation / docs-shaping thread, not a hard bug report.
I think our production traffic is hitting a concurrency regime that is different from the existing
dual-turbo/ TP=4 decode-concurrent benchmarks.TL;DR
On
dual.yml, if a long prompt/tool-result prefill enters while another request is already decoding, generation throughput can collapse until the prefill clears.This is not the same as the published concurrent benchmark shape, which uses short prompts and measures multiple streams decoding together. I think there are two useful regimes to distinguish:
dual-turbonumbers demonstrateRig / stack
iommu=ptdual.yml, TP=2, fp8 KVnightly-bf610c2f56764e1b30bc6065f4ceace3d6e59036align(auto-selected by vLLM with prefix caching)qwen3_xml--disable-custom-all-reduceactive, PCIe/no NVLinkBaseline important args:
Repro shape
Two concurrent
/v1/chat/completionsrequests:The exact script is uninteresting: it just fires those two requests at the same time. I can paste it if useful, but the important part is the shape: a long prefill enters while another request is decoding.
What we observed
8192+FULL_AND_PIECEWISE8192+PIECEWISE2048+PIECEWISE1024Representative vLLM stats while overlap existed:
At
2048 + PIECEWISE:So lowering
max_num_batched_tokenshelps wall time a lot, but does not restore normal decode ITL while overlap exists.1024is not viable on this model/config:Why this seems worth discussing here
docs/DUAL_CARD.mddocumentsdual-turboas true 4-stream parallel decoding, 269 TPS aggregate at 4 streams. The script behind that uses the canonical short code prompt:That is a valid decode-concurrent benchmark. The agentic production shape is different: long tool returns, long accumulated context, or large file reads can enter prefill while another interactive response is decoding.
vLLM's docs say chunked prefill should prioritize decode requests, but with Qwen3.6's hybrid GDN/Mamba
alignpath, decode still appears to starve during large prefill chunks.Questions
Do the existing
dual-turbo/ TP=4 concurrency validations exercise long-prefill-overlap, or are they intentionally decode-concurrent only?Is
max_num_batched_tokens=2048a reasonable production mitigation fordual.yml, or is there a better known value for Qwen3.6 align-mode?1024is invalid becauseblock_size=1568must be <= the budget.Would
dual-turboavoid this shape because of TQ3 + Genesis + MTP +max_num_batched_tokens=4128, or would it show the same starvation if tested with a long-prefill-overlap repro?Is proxy-level admission control the expected production answer for agentic traffic? For example: classify large-prefill requests by estimated prompt tokens and avoid admitting them while an interactive decode is active.
If this behavior is expected, would it be worth documenting the distinction between decode-concurrent throughput and long-prefill-overlap latency?
Happy to run whichever A/B would be most useful. The candidates I see are:
max_num_batched_tokens=1568dual-turbowith the same overlap shape, if useful as a controlled testalignmax_num_seqs=1or proxy serialization as a UX baselineBeta Was this translation helpful? Give feedback.
All reactions