Decode starvation when long prefill overlaps active decode on dual.yml #208

mgabor3141 · 2026-05-23T22:49:39Z

mgabor3141
May 23, 2026

Decode starvation when long prefill overlaps active decode on `dual.yml`

Opening this as an investigation / docs-shaping thread, not a hard bug report.

I think our production traffic is hitting a concurrency regime that is different from the existing dual-turbo / TP=4 decode-concurrent benchmarks.

TL;DR

On dual.yml, if a long prompt/tool-result prefill enters while another request is already decoding, generation throughput can collapse until the prefill clears.

This is not the same as the published concurrent benchmark shape, which uses short prompts and measures multiple streams decoding together. I think there are two useful regimes to distinguish:

Regime	Shape	Observed behavior
Decode-concurrent	several short-prompt requests decode together	scales well, this is what the existing `dual-turbo` numbers demonstrate
Long-prefill-overlap	one large prefill/tool result enters while another stream is decoding	decode can starve, ~0.1-0.9 TPS in our tests

Rig / stack

2x RTX 3090, no NVLink
ASUS PRIME X399-A + Threadripper 1950X
both GPUs PCIe 3.0 x16, iommu=pt
CachyOS bare metal
dual.yml, TP=2, fp8 KV
Qwen3.6-27B AutoRound INT4
vLLM image: nightly-bf610c2f56764e1b30bc6065f4ceace3d6e59036
prefix caching on, chunked prefill on
mamba-cache mode: align (auto-selected by vLLM with prefix caching)
MTP disabled locally for tool-call correctness under agentic workloads
tool parser overridden to qwen3_xml
--disable-custom-all-reduce active, PCIe/no NVLink

Baseline important args:

--max-model-len 262144
--max-num-seqs 2
--max-num-batched-tokens 8192
--kv-cache-dtype fp8_e5m2
--enable-prefix-caching
--enable-chunked-prefill
--scheduling-policy priority

Repro shape

Two concurrent /v1/chat/completions requests:

A: long-prefill / short-decode, 28,623 prompt tokens, 58 completion tokens
B: normal decode, 38 prompt tokens, 512 completion tokens

The exact script is uninteresting: it just fires those two requests at the same time. I can paste it if useful, but the important part is the shape: a long prefill enters while another request is decoding.

What we observed

Config	A wall time	B wall time	Worst observed gen TPS during overlap	Notes
`8192` + `FULL_AND_PIECEWISE`	live traffic only	live traffic only	~0.1-0.2	original symptom
`8192` + `PIECEWISE`	246s	260s	~0.2-0.4	PIECEWISE active, did not fix starvation
`2048` + `PIECEWISE`	107s	117s	~0.9	materially better, still starves decode
`1024`	n/a	n/a	n/a	invalid, align-mode block size is 1568

Representative vLLM stats while overlap existed:

# 8192 + PIECEWISE
Running: 2 reqs
Avg generation throughput: 0.2-0.4 tokens/s

# after overlap cleared
Running: 1 reqs
Avg generation throughput: ~48 tokens/s

At 2048 + PIECEWISE:

Running: 2 reqs
Avg generation throughput: ~0.9 tokens/s

# after overlap cleared
Running: 1 reqs
Avg generation throughput: ~43 tokens/s

So lowering max_num_batched_tokens helps wall time a lot, but does not restore normal decode ITL while overlap exists.

1024 is not viable on this model/config:

AssertionError: In Mamba cache align mode, block_size (1568) must be <= max_num_batched_tokens (1024).

Why this seems worth discussing here

docs/DUAL_CARD.md documents dual-turbo as true 4-stream parallel decoding, 269 TPS aggregate at 4 streams. The script behind that uses the canonical short code prompt:

PROMPT = "Write a Python implementation of quicksort with comments explaining each step."
MAX_TOKENS = 800

That is a valid decode-concurrent benchmark. The agentic production shape is different: long tool returns, long accumulated context, or large file reads can enter prefill while another interactive response is decoding.

vLLM's docs say chunked prefill should prioritize decode requests, but with Qwen3.6's hybrid GDN/Mamba align path, decode still appears to starve during large prefill chunks.

Questions

Do the existing dual-turbo / TP=4 concurrency validations exercise long-prefill-overlap, or are they intentionally decode-concurrent only?
Is max_num_batched_tokens=2048 a reasonable production mitigation for dual.yml, or is there a better known value for Qwen3.6 align-mode? 1024 is invalid because block_size=1568 must be <= the budget.
Would dual-turbo avoid this shape because of TQ3 + Genesis + MTP + max_num_batched_tokens=4128, or would it show the same starvation if tested with a long-prefill-overlap repro?
Is proxy-level admission control the expected production answer for agentic traffic? For example: classify large-prefill requests by estimated prompt tokens and avoid admitting them while an interactive decode is active.
If this behavior is expected, would it be worth documenting the distinction between decode-concurrent throughput and long-prefill-overlap latency?

Happy to run whichever A/B would be most useful. The candidates I see are:

exact floor-ish budget, max_num_batched_tokens=1568
dual-turbo with the same overlap shape, if useful as a controlled test
prefix caching off as a diagnostic only, to bypass mamba-cache align
max_num_seqs=1 or proxy serialization as a UX baseline

noonghunna · 2026-05-23T23:55:30Z

noonghunna
May 23, 2026
Maintainer

This is an excellent characterization — and you've correctly found a regime our published numbers don't cover. Mechanism first, since it ties the answers together.

The mechanism (our read — consistent with our GDN forensics, not yet instrumented for this exact overlap)

Chunked prefill co-batches your big prefill's current chunk with the decoding request's single token into one forward step. That step's latency is dominated by the prefill chunk — and on Qwen3.6's hybrid GDN/Mamba layers the chunk's recurrent forward is heavy (same kernel family behind our single-card "Cliff 2"). So during overlap each step ≈ chunk-time, your decoder emits ~1 token/step, and ITL balloons to the chunk-step time → your 0.2–0.9 TPS.

Shrinking max_num_batched_tokens shrinks the chunk → faster steps → decode tokens land more often (your 8192→2048: 0.2→0.9). But align floors the chunk at block_size=1568, so there's a hard architectural bound — decode-during-overlap ITL is gated by the 1568-token GDN prefill-chunk step time and can't be tuned to zero. That single fact answers most of your questions:

Q1 — decode-concurrent only. Confirmed. The 269 TPS @ 4-stream uses the short quicksort prompt (4 short-prefill streams decoding together). We have not benchmarked long-prefill-overlap — genuine gap, thank you for naming it.

Q2 — 1568 is your floor; 2048 is a fine production value. block_size=1568 is the align hard minimum (your 1024 error). 8192→2048 already captured most of the win; 1568 may nudge ITL slightly but hits the same bound. So yes — 2048 is reasonable; treat it as mitigation, not a fix.

Q3 — can't give you a clean dual-turbo A/B right now: it's currently parked. dual-turbo's Genesis-MTP path is pinned to a vLLM nightly that's been purged from Docker Hub (#167); we're holding for Sander's next stable Genesis release before re-pinning to a current vLLM branch, so it won't boot cleanly today. Two things worth knowing for when it's back: (a) it runs max_num_batched_tokens=4128 — larger than your 2048, so by your own chunk-size finding it'd likely starve more than your 2048 unless its Genesis GDN-chunked-prefill patches change the dynamic; (b) it ships MTP n=3, which you disabled for tool-call correctness — so it'd reintroduce that. Net: not a free win, and currently un-testable.

Q4 — yes, proxy admission control is the right answer — and note priority scheduling won't save you. --scheduling-policy priority controls admission/preemption order, not intra-step compute sharing; once a prefill chunk is co-batched with your decode in one step, priority can't un-starve it. So classify-and-gate at the proxy (don't admit a >N-token prefill while an interactive decode is live, or route big prefills to a separate queue) is the expected pattern. (max_num_seqs=1 serializes it away entirely at the cost of all concurrency — a useful UX baseline to measure against.)

Q5 — yes, and we'll document it. The decode-concurrent (throughput) vs long-prefill-overlap (latency) distinction is real and our DUAL_CARD.md elides it. We'll add it so the 269-TPS number isn't read as a latency guarantee under agentic traffic.

On the A/Bs — you're better placed to run them than we are

You've already got the repro, the rig, and the exact production config — and honestly we don't have spare cycles to stand up a faithful long-prefill-overlap harness our side right now (and dual-turbo, the most interesting arm, is parked per Q3). So if you're up for it, the two highest-signal runs on dual.yml would be:

max_num_batched_tokens=1568 (the align floor) — does overlap ITL improve measurably past 2048, or is 2048 already at the bound?
prefix-caching OFF (diagnostic only) — this drops the mamba-cache align path; if overlap ITL changes materially, it confirms align block granularity is the bound (your leading hypothesis and ours).

Plus max_num_seqs=1 as the serialized UX baseline. Report decode ITL during overlap (not just wall time) and we'll fold your numbers + the regime distinction into DUAL_CARD.md with credit. That's the most useful shape of data here — thank you for the rigorous writeup.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decode starvation when long prefill overlaps active decode on dual.yml #208

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Decode starvation when long prefill overlaps active decode on dual.yml #208

Uh oh!

mgabor3141 May 23, 2026

Decode starvation when long prefill overlaps active decode on dual.yml

TL;DR

Rig / stack

Repro shape

What we observed

Why this seems worth discussing here

Questions

Replies: 1 comment

Uh oh!

noonghunna May 23, 2026 Maintainer

The mechanism (our read — consistent with our GDN forensics, not yet instrumented for this exact overlap)

On the A/Bs — you're better placed to run them than we are

mgabor3141
May 23, 2026

Decode starvation when long prefill overlaps active decode on `dual.yml`

noonghunna
May 23, 2026
Maintainer