Hardware mention for dual modded 3080s #25

troymroberts · 2026-05-01T22:26:19Z

troymroberts
May 1, 2026

I have dual modded 3080's (20gb each for 40gb vram total), so they should also fully support this method (going to try now actually)

So i feel like they should get a hardware mention.

noonghunna · 2026-05-02T09:19:17Z

noonghunna
May 2, 2026
Maintainer

@troymroberts — yes, dual modded 3080 20GB pairs are squarely in scope and you're right that this method should fully apply. They share the relevant hardware properties:

Same SM 8.6 (Ampere consumer) — every Triton kernel we ship (PN26b sparse-V, PN12 FFN intermediate pool, PN17 FA2 lse-clamp) is tuned for SM86 and runs identical code on your 3080s
Same compute capability — no FA3/FA4 path (Hopper-only), so we don't lose anything Ada/Hopper users would gain
Slightly less per-card VRAM — 20 GB vs 24 GB. The shipped TQ3 + 0.95 mem-util configs at 180K/145K assume ~24 GB. On 20 GB you'd want to drop max-model-len further. Math: PN12+PN25 pool residence + DeltaNet GDN forward state buffer + activation budget is ~8 GB regardless of card; weights are ~7 GB on TP=2; that leaves ~5 GB for KV pool on a 20 GB card. At ~17 KB/token (TQ3 KV), that's ~100K-130K context on TP=2.

If you do try the install:

bash scripts/setup.sh qwen3.6-27b
# For the dual-3080 case, recommend starting at:
bash scripts/switch.sh vllm/dual    # 262K target — will fail engine pre-check, will report estimated max

The engine pre-check will print the real max your config supports. Drop max-model-len to that number in the compose, then retry.

Happy to add a hardware mention in docs/HARDWARE.md once you've reported boot success at a specific ctx ceiling — that data point is more useful than a theoretical "should work" callout. Could you share the boot log + the Maximum concurrency for X tokens per request line once you have a clean config? I'll add it to the docs with credit.

2× modded 3080 20GB is a meaningful coverage point for the SM86 / 40 GB combined memory class — would be the first published data we have outside the 3090 family.

2 replies

troymroberts May 2, 2026
Author

All 10 checks passed. Here's the validation report — paste-ready for the maintainer.

(I should note I have my 3080's power limited to 200w each)

Validation: 2× modded RTX 3080 20 GB on TP=2 — qwen3.6-27b-autoround-int4 + Genesis v7.14 + K8V4 + MTP n=3

Hardware: VM 112 on Proxmox , 2× RTX 3080 20 GB modded (PCIe b3:00, 17:00), driver 580.126.09 / CUDA 13.0, NCCL P2P disabled
(NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0), 200 W per-card power limit. SM 8.6 confirmed by Genesis platform probe.

Active config — compose/docker-compose.turbo.yml, image vllm/vllm-openai:nightly-07351e0883...:

tensor-parallel-size 2, disable-custom-all-reduce
max-model-len 262144 (full target — not dropped from your 100K–130K prediction)
gpu-memory-utilization 0.82 (down from shipped 0.95 — see below)
kv-cache-dtype turboquant_k8v4 (tighter than the TQ3 your math assumed)
speculative-config '{"method":"mtp","num_speculative_tokens":3}'
enable-prefix-caching, enable-chunked-prefill, qwen3_coder tool parser
Genesis v7.14 P64/P65/P66/P68/P69 enabled

Boot — the line you asked for:
[gpu_worker.py:440] Available KV cache memory: 5.2 GiB
[kv_cache_utils.py:1404] GPU KV cache size: 422,400 tokens
[kv_cache_utils.py:1409] Maximum concurrency for 262,144 tokens per request: 1.43x
[core.py:298] init engine took 235.71 s (compilation: 120.71 s)

5.2 GiB available KV pool per GPU is dead-on your "~5 GB" prediction. K8V4 (vs TQ3) is what got us to the full 262K target instead of the 100K–130K ceiling
you sketched.

verify-full.sh — 10/10 pass (after a one-line patch to check 2; see script bug below):

┌─────┬───────────────────────┬───────────────────────────────────────────────────────┐
│ # │ Check │ Result │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 1 │ /v1/models │ ✓ │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 2 │ Genesis patches │ ⊘ (script-anchor bug, not a stack issue) │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 3 │ Paris │ ✓ │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 4 │ tool_calls[] │ ✓ get_weather({"city":"San Francisco"}) │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 5 │ SSE streaming │ ✓ 11 chunks, coherent │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 6 │ Thinking mode │ ✓ reasoning 594 chars, content populated │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 7 │ Long-ctx needle │ ✓ recalled at 9 819 / 29 320 / 58 568 / 91 069 tokens │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 8 │ 15 K tool-prefill OOM │ ✓ survived, emitted tool_call cleanly │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 9 │ Output quality │ ✓ variety=0.640, max_line_repeat=0, finish=stop │
├─────┼───────────────────────┼───────────────────────────────────────────────────────┤
│ 10 │ MTP acceptance length │ ✓ AL = 3.00 (≥ 2.0 threshold) │
└─────┴───────────────────────┴───────────────────────────────────────────────────────┘

bench.sh — 3 warmup + 5 measured, 1 000 tokens, single-stream:
wall_TPS mean = 48.97 std = 1.05 CV = 2.1% min 47.43 max 50.11
decode_TPS mean = 49.31 std = 1.06 CV = 2.2% min 47.74 max 50.46
TTFT mean = 139 ms std = 1 ms min 138 max 140
GPU state at end of bench: GPU0 65 % util / 196.7 W / 82 °C, GPU1 56 % / 196.9 W / 74 °C — both right at the 200 W cap, thermals fine.

Steady-state aggregate under 8 concurrent reqs (from earlier prod traffic in the boot log): ~210 tokens/s, KV cache usage 52 %.

MTP from earlier real traffic: mean acceptance length 2.46 (8-concurrent, 1000-tok decode), per-position 0.721 / 0.461 / 0.279, draft acceptance 48.7 %.

Two things worth flagging back

Script bug — scripts/verify.sh and scripts/verify-full.sh check 2. The marker anchor is [OK] Qwen3 tool_call fix, which Genesis v7.14 no longer emits —
the relevant patches are now P15 Qwen3 None/null tool arg parser (applied) and P12 Qwen3 <tool_call> implicit reasoning end (marked upstream_merged,
skipped as obsolete). Combined with set -euo pipefail, the empty grep aborts the script before the intended else warn-branch can run.

Minimal fix:
logs="$(docker logs "${CONTAINER}" 2>&1 | { grep -E "Qwen3 tool_call fix|[FAILED]" || true; } | tail -5)"
And updating the anchor list to include P15 Qwen3 None/null tool arg parser / upstream_merged would let the OK branch fire on v7.14+ stacks.

Why 0.82 mem-util, not the shipped 0.95. vLLM nightly's gpu_worker.py reports: "current --gpu-memory-utilization=0.8200 is equivalent to
--gpu-memory-utilization=0.8039 without CUDA-graph memory profiling. To maintain the same effective KV cache size as before, increase to 0.8361." On 20 GB
cards the cudagraph-profiling overhead is a meaningful slice; we landed at 0.82 because higher values left no headroom for the activation peak in the 15 K
tool-prefill case (check 8). Worth a note in docs/HARDWARE.md for future 20 GB users.

Net: dual modded 3080 20 GB at SM86, 40 GB combined VRAM, runs the full 262K stack with K8V4 + MTP cleanly. First published data point for this hardware
class outside the 3090 family, as you noted.

noonghunna May 2, 2026
Maintainer

@troymroberts — this is exactly the data we needed. Thank you. First validated SM86 / 40 GB combined config outside the 3090 family, and the 5.2 GiB available KV per GPU landing dead-on the prediction is the kind of cross-rig confirmation that makes the documentation honest.

A few observations worth highlighting from your run, then both items shipped to master.

What your run tells us

K8V4 vs TQ3 trade-off was the right call. My math assumed TQ3 (~17 KB/token) and predicted 100K-130K. You went K8V4 (~10-11 KB/token at average density) and unlocked the full 262K target. That's a meaningful data point for any future 20 GB user — K8V4 is the right KV config when memory is the binding constraint, not throughput.
49 TPS wall single-stream at 200W power-limited is impressive. For reference, our 2× 3090 PCIe-only on dual.yml (FP8 KV, no power limit) lands ~50 narr / 67 code TPS — your modded 3080s at 200W are essentially matching unrestricted 3090s on the FP8/K8V4 path. This is consistent with the bottleneck being PCIe collective traffic, not raw compute.
AL=3.00 on MTP matches what Sander's A5000 PROD reports — confirms MTP K=3 acceptance is hardware-class agnostic on SM86. 210 TPS aggregate at n=8 is also right in the expected band (4.3× concurrency multiplier, close to our 4.67× on dual-turbo TP=2 with TQ3).
0.82 mem-util on 20 GB — this is the data point I was missing. The cudagraph-profiling overhead is a fixed-ish absolute size, so it eats a bigger fraction of a smaller card. Now documented.

Both items shipped to master (commit `95b0905`)

1. Verify script anchor + pipefail bug — fixed

Updated scripts/verify.sh and scripts/verify-full.sh check 2 to use Genesis v7.14+ canonical markers:

apply_all elapsed: (fires once when apply_all completes clean)
[Genesis] applied: (per-patch apply line, partial-log fallback)
[Genesis] FAILED (any patch errored)

And wrapped the grep in { ... || true; } per your suggestion so set -euo pipefail doesn't abort the script when grep returns 1 on no-match. Credited you in the inline comment.

2. `docs/HARDWARE.md` — 2× modded 3080 20 GB row added (with credit)

2× RTX 3080 modded 20 GB | 20 GB / card (40 GB combined) | sm_86 | Tested 2026-05-02 by @troymroberts (#25) at 200W/card power limit. dual.yml (TQ k8v4 KV + MTP K=3) boots at full 262K target with gpu-memory-utilization=0.82 (down from shipped 0.95 — see note below). Available KV pool 5.2 GB/card, max concurrency 1.43×. verify-full 10/10 pass; bench 49 TPS wall single-stream, 210 TPS aggregate at n=8. First published SM86 / 40 GB combined data point outside the 3090 family.

Plus a sub-section noting the 0.82 mem-util target for 20 GB cards. Same commit also adds an A5000 row noting it as Sander's PROD class — useful context for newcomers.

Open follow-ups (low priority)

Power-scaling curve at 250W / 300W on your rig would be interesting if you ever feel like it — would tell us whether the 200W cap is leaving meaningful TPS on the table or whether the modded 3080's silicon ceiling is below 250W under sustained load. Not blocking; just a "if you ever bench with the cap raised" ask.
dual-turbo.yml (TQ3 + spec-decode + 4-stream concurrent serving) would be worth a try on your config now that the basic dual.yml + K8V4 path is validated. The 3-bit KV pool would push you past 4× concurrency at 262K. Same caveat — only if you have time.

The contribution loop you just demonstrated — clone, verify-full, post the report — is exactly the model the pinned welcome post was just rewritten to call out. Yours is the first 40 GB combined SM86 row in HARDWARE.md, and the 0.82 mem-util tuning is a real piece of documentation that came from your rig that nobody else could have surfaced.

Welcome aboard, properly. Thanks for the rigor.

troymroberts · 2026-05-02T11:44:44Z

troymroberts
May 2, 2026
Author

Hey there. yes I fully intend to unlock the power scaling.

I have them in a dell precision 9820 and i didn't get the correct dell GPU power splitter cables that I ordered, so each gpu only has 1x 8 pin power connector installed for now. When i get the correct cables i will bench at full power.

But thanks for the repo it was very helpful.

I'm excited to see how far these cards can be pushed. i.e. if they ever get around to implementing Pflash into vllm that could meaningfully reduce wall clocks even more.

3 replies

noonghunna May 2, 2026
Maintainer

@troymroberts — thanks for the context on the power cabling. Dell Precision 9820 + correct splitter cables for full 350W per card should be a meaningful jump from your 200W-capped numbers; once you have them benched I'll happily add a "before/after power scaling" row to docs/HARDWARE.md for the modded 3080 line. That curve is data we don't have on any rig and would help future users size their PSU/cabling planning.

Update on Pflash — apologies, I missed Sandro Puppo's announcement when I checked yesterday. The lucebox post gives a clear picture of what it is, and you're right that this is genuinely exciting for your hardware:

Prefill accelerator, not decode — sits in front of DFlash. The DFlash daemon handles the generation phase; Pflash uses a small drafter (Qwen3-0.6B) + block-sparse attention to score token importance across the full prompt, keep only top-scoring spans (keep_ratio=0.05 claimed), and shrink 128K prompts to ~6.5K tokens before the target model's prefill runs.
Headline claim: TTFT 24.8s vs ~257s vanilla llama.cpp at 128K — ~10.4× prefill speedup, with NIAH single-needle retrieval intact at every tested ctx (32K → 128K).
Hardware: sm_80+, benched on RTX 3090 — squarely in the Ampere class we ship for. Modded 3080 20GB is sm_86 too, so it should run identically to the 3090 numbers (subject to your 200W power cap).
Engine reality: C++/CUDA only, custom ggml graph + hand-written kernels — "No Python, no Triton, no PyTorch in the inference loop." Not integrated into vLLM or llama.cpp. To benefit, you'd need to run the lucebox-hub server stack (Luce-Org/lucebox-hub on GitHub), which is the same repo we already track in docs/UPSTREAM.md for DFlash.
MIT licensed, public April 2026.

What this changes on club-3090's roadmap:

I just added Pflash to UPSTREAM.md as a watch entry alongside DFlash. The blocker for adoption here is the same one DFlash hit: lucebox-hub is its own server stack with daemon-mode quirks (greedy-only sampling, "empty prompt" regression, vision off, our verify-stress doesn't fully pass at >65K). Until either (a) lucebox-hub stabilizes for production agent traffic or (b) someone ports the ideas into vLLM, Pflash stays a watch entry rather than a shipping option.

But the techniques are genuinely interesting cross-rig signal:

Speculative prefill via drafter attention scoring is a generic idea that could land in vLLM independently
Block-sparse attention on the drafter is exactly the SM86 sparse-V family that Sandermage's PN26b targets — there's potentially shared kernel work here
The combined prefill-acceleration + spec-decode-decoding stack is the right framing for any future TTFT-bound workload (RAG, long-context coding agents)

I'll flag it to Sandermage on the next Genesis discussion — Pflash + Genesis composing on the same z-lab draft model is exactly the kind of cross-stack experiment our community could surface bugs for.

Thanks for the heads-up. The "wait, what's that?" prompt is exactly how new techniques propagate from one cross-rig user to the rest of the community — please keep flagging them.

troymroberts May 2, 2026
Author

Yeah sorry lol, should have included a link as the Pflash article only dropped hours ago.

My speculation is that it will arrive in Vllm sooner or later just like Dflash as the idea is generic enough to be part of the standard engine

noonghunna May 2, 2026
Maintainer

@troymroberts — no apology needed, the timing was fine. We tracked it down via @Sandermage's discussion thread reference once we had the right keyword, and added a watch row in docs/UPSTREAM.md with the lucebox blog + Sandro's announcement linked.

Your prediction tracks. Two specific signals that suggest a vLLM port is plausible:

DFlash precedent. vLLM mainline already has a dflash spec-decode method (vllm/v1/spec_decode/dflash.py in our pinned nightly). Same author group (Luce-Org). PFlash sits in front of DFlash as a prefill accelerator — natural extension of the existing pipeline.
Block-sparse drafter attention is generic. PFlash uses block-sparse attention on a small drafter (Qwen3-0.6B) for token-importance scoring. vLLM's existing sparse-V infrastructure (which Sandermage's PN26b kernel targets for SM86 specifically) is structurally similar. A community port wouldn't need to reinvent kernels; just wire the prefill-compression path.

club-3090 plans to explore once lucebox-hub server stabilizes (currently has the daemon-mode + greedy-only quirks documented in UPSTREAM.md). When PR #40898 (DFlash SWA support) lands and the upstream Qwen3-Next path matures, PFlash adoption probably follows shortly after.

Keep flagging things you spot. The cross-rig signal loop is what makes this stack honest.

efschu · 2026-05-02T21:47:28Z

efschu
May 2, 2026

running 2x3080 @ 320W PL

========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
warm-1 wall= 12.98s ttft= 133ms toks=1000 wall_TPS= 77.05 decode_TPS= 77.84
warm-2 wall= 13.28s ttft= 111ms toks=1000 wall_TPS= 75.31 decode_TPS= 75.95
warm-3 wall= 13.51s ttft= 88ms toks=1000 wall_TPS= 74.00 decode_TPS= 74.49

=== measured (5) ===
run-1 wall= 12.60s ttft= 114ms toks=1000 wall_TPS= 79.34 decode_TPS= 80.06
run-2 wall= 13.31s ttft= 81ms toks=1000 wall_TPS= 75.14 decode_TPS= 75.60
run-3 wall= 13.34s ttft= 87ms toks=1000 wall_TPS= 74.95 decode_TPS= 75.44
run-4 wall= 13.36s ttft= 80ms toks=1000 wall_TPS= 74.86 decode_TPS= 75.32
run-5 wall= 13.08s ttft= 79ms toks= 953 wall_TPS= 72.85 decode_TPS= 73.29

=== summary [narrative] (n=5) ===
wall_TPS mean= 75.43 std= 2.38 CV= 3.2% min=72.85 max=79.34
decode_TPS mean= 75.94 std= 2.49 CV= 3.3% min=73.29 max=80.06
TTFT mean= 88ms std= 15ms min=79ms max=114ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
warm-1 wall= 4.75s ttft= 112ms toks= 464 wall_TPS= 97.62 decode_TPS= 99.98
warm-2 wall= 7.06s ttft= 111ms toks= 688 wall_TPS= 97.42 decode_TPS= 98.98
warm-3 wall= 7.69s ttft= 109ms toks= 739 wall_TPS= 96.05 decode_TPS= 97.43

=== measured (5) ===
run-1 wall= 5.94s ttft= 116ms toks= 578 wall_TPS= 97.28 decode_TPS= 99.21
run-2 wall= 7.81s ttft= 112ms toks= 784 wall_TPS=100.38 decode_TPS=101.84
run-3 wall= 8.30s ttft= 109ms toks= 800 wall_TPS= 96.36 decode_TPS= 97.64
run-4 wall= 8.36s ttft= 110ms toks= 800 wall_TPS= 95.73 decode_TPS= 97.02
run-5 wall= 8.22s ttft= 109ms toks= 800 wall_TPS= 97.31 decode_TPS= 98.62

=== summary [code] (n=5) ===
wall_TPS mean= 97.41 std= 1.79 CV= 1.8% min=95.73 max=100.38
decode_TPS mean= 98.86 std= 1.87 CV= 1.9% min=97.02 max=101.84
TTFT mean= 111ms std= 3ms min=109ms max=116ms

=== GPU state ===
0, 86 %, 16445 MiB, 20480 MiB, 295.86 W, 83
1, 90 %, 16445 MiB, 20480 MiB, 253.60 W, 84

K8V4

1 reply

noonghunna May 2, 2026
Maintainer

Thank you @efschu — this is a genuinely valuable cross-rig data point for two reasons:

What you just confirmed

1. K8V4 on dual-card is highly competitive — possibly the right "balanced" choice.

Your numbers: 75.43 narr / 97.41 code TPS on 2× 3080 at 320W PL with K8V4 KV. For context, our shipped docker-compose.dual-turbo.yml (TQ3 KV on 2× 3090 at 230W PL) measures 53.65 narr / 72.93 code. You're getting +40% narr / +33% code with K8V4 — even on smaller per-card VRAM and a power-up advantage. That's a strong signal that K8V4 has a real niche between fp8 (dual.yml) and TQ3 (dual-turbo.yml).

This is direct evidence for discussion #33 — Middle Ground between dual and dual-turbo? which @JusefPol opened earlier today asking exactly this question. Going to cross-link your numbers there.

2. Power scaling on modded 3080 20GB matters significantly.

@troymroberts at 200W PL got 49 TPS narr / 210 TPS aggregate at n=8 (same hardware class). You at 320W get 75 TPS narr — that's the SM86 family's per-card frequency curve scaling exactly as expected. First published 320W datapoint for modded 3080-20GB.

To make this docs-actionable, could you share

A few things would help us turn this into a credited row in docs/HARDWARE.md (alongside troymroberts's 200W row):

Which compose config? Did you edit dual-turbo.yml to swap turboquant_3bit_nc → turboquant_k8v4, or are you using something else?
--max-model-len you set — likely lower than 262K given the 16.45 GB/card showing in your nvidia-smi output. What ctx ceiling did you settle on?
--gpu-memory-utilization — the K8V4 + 2× 20GB combo probably needs lower than dual-turbo's 0.85.
First ~100 lines of your container boot log:
```
docker logs <your-container-name> 2>&1 | head -100
```
That captures the Genesis patch banner + final engine config + KV pool size — enough to reproduce on our side.

club-3090 commit SHA you tested against:

git -C /path/to/club-3090 rev-parse --short HEAD

Once we have the boot log + commit, I'll add a row to docs/HARDWARE.md crediting you, and reference your numbers in the #33 k8v4-variant evaluation.

Thanks for taking the time to bench properly (3 warmups + 5 measured, full GPU state output) — exactly the data shape we need to build out cross-rig coverage of the SM86 family.

efschu · 2026-05-03T06:04:21Z

efschu
May 3, 2026

I'm sorry, it was late yesterday and I started the wrong docker container to bench. I started "normal dual" one with fp8_e5m2 instead of turbo.

Here the numbers for dual-turbo.yml with gpu-memory-utilization "0.82" and turboquant_3bit_nc

========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall= 18.10s  ttft=  2725ms  toks=1000  wall_TPS= 55.26  decode_TPS= 65.05
  warm-2     wall= 13.38s  ttft=    64ms  toks=1000  wall_TPS= 74.76  decode_TPS= 75.11
  warm-3     wall= 13.28s  ttft=    94ms  toks=1000  wall_TPS= 75.32  decode_TPS= 75.85

=== measured (5) ===
  run-1      wall= 13.15s  ttft=    93ms  toks= 968  wall_TPS= 73.63  decode_TPS= 74.15
  run-2      wall= 13.28s  ttft=    93ms  toks=1000  wall_TPS= 75.31  decode_TPS= 75.84
  run-3      wall= 12.33s  ttft=    91ms  toks=1000  wall_TPS= 81.11  decode_TPS= 81.71
  run-4      wall= 13.22s  ttft=    68ms  toks=1000  wall_TPS= 75.62  decode_TPS= 76.01
  run-5      wall= 13.11s  ttft=    92ms  toks=1000  wall_TPS= 76.27  decode_TPS= 76.81

=== summary [narrative] (n=5) ===
  wall_TPS       mean=  76.39   std=  2.81   CV= 3.7%   min=73.63   max=81.11
  decode_TPS     mean=  76.91   std=  2.86   CV= 3.7%   min=74.15   max=81.71
  TTFT          mean=    87ms  std=   11ms  min=68ms  max=93ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall=  8.12s  ttft=    95ms  toks= 800  wall_TPS= 98.50  decode_TPS= 99.67
  warm-2     wall=  7.52s  ttft=    92ms  toks= 744  wall_TPS= 98.94  decode_TPS=100.16
  warm-3     wall=  6.92s  ttft=    95ms  toks= 694  wall_TPS=100.31  decode_TPS=101.71

=== measured (5) ===
  run-1      wall=  7.14s  ttft=    93ms  toks= 704  wall_TPS= 98.58  decode_TPS= 99.88
  run-2      wall=  6.83s  ttft=    93ms  toks= 688  wall_TPS=100.75  decode_TPS=102.14
  run-3      wall=  8.11s  ttft=    94ms  toks= 800  wall_TPS= 98.69  decode_TPS= 99.85
  run-4      wall=  6.18s  ttft=    93ms  toks= 628  wall_TPS=101.55  decode_TPS=103.10
  run-5      wall=  7.63s  ttft=    91ms  toks= 800  wall_TPS=104.88  decode_TPS=106.15

=== summary [code] (n=5) ===
  wall_TPS       mean= 100.89   std=  2.58   CV= 2.6%   min=98.58   max=104.88
  decode_TPS     mean= 102.22   std=  2.62   CV= 2.6%   min=99.85   max=106.15
  TTFT          mean=    93ms  std=    1ms  min=91ms  max=94ms

=== GPU state ===
0, 91 %, 15775 MiB, 20480 MiB, 305.66 W, 82
1, 96 %, 15775 MiB, 20480 MiB, 259.89 W, 84

This run I used dual-turbo.yml with gpu-memory-utilization "0.82" and kv-cache-dtype turboquant_k8v4:

========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall= 14.83s  ttft=  1088ms  toks=1000  wall_TPS= 67.43  decode_TPS= 72.77
  warm-2     wall= 13.80s  ttft=    62ms  toks=1000  wall_TPS= 72.46  decode_TPS= 72.79
  warm-3     wall= 12.89s  ttft=    63ms  toks=1000  wall_TPS= 77.61  decode_TPS= 77.99

=== measured (5) ===
  run-1      wall= 12.12s  ttft=    63ms  toks= 918  wall_TPS= 75.77  decode_TPS= 76.17
  run-2      wall= 13.40s  ttft=    91ms  toks= 996  wall_TPS= 74.31  decode_TPS= 74.82
  run-3      wall= 12.46s  ttft=    90ms  toks= 974  wall_TPS= 78.16  decode_TPS= 78.73
  run-4      wall= 12.57s  ttft=    93ms  toks= 979  wall_TPS= 77.87  decode_TPS= 78.45
  run-5      wall= 12.32s  ttft=    90ms  toks= 964  wall_TPS= 78.26  decode_TPS= 78.83

=== summary [narrative] (n=5) ===
  wall_TPS       mean=  76.87   std=  1.76   CV= 2.3%   min=74.31   max=78.26
  decode_TPS     mean=  77.40   std=  1.81   CV= 2.3%   min=74.82   max=78.83
  TTFT          mean=    85ms  std=   13ms  min=63ms  max=93ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall=  5.22s  ttft=    91ms  toks= 486  wall_TPS= 93.09  decode_TPS= 94.74
  warm-2     wall=  4.41s  ttft=    90ms  toks= 444  wall_TPS=100.71  decode_TPS=102.81
  warm-3     wall=  3.84s  ttft=    91ms  toks= 372  wall_TPS= 96.77  decode_TPS= 99.12

=== measured (5) ===
  run-1      wall=  4.18s  ttft=    93ms  toks= 424  wall_TPS=101.43  decode_TPS=103.75
  run-2      wall=  5.16s  ttft=    92ms  toks= 501  wall_TPS= 97.04  decode_TPS= 98.79
  run-3      wall=  5.41s  ttft=    91ms  toks= 552  wall_TPS=102.01  decode_TPS=103.75
  run-4      wall=  4.50s  ttft=    93ms  toks= 443  wall_TPS= 98.36  decode_TPS=100.43
  run-5      wall=  7.77s  ttft=    90ms  toks= 785  wall_TPS=101.02  decode_TPS=102.20

=== summary [code] (n=5) ===
  wall_TPS       mean=  99.97   std=  2.15   CV= 2.2%   min=97.04   max=102.01
  decode_TPS     mean= 101.78   std=  2.16   CV= 2.1%   min=98.79   max=103.75
  TTFT          mean=    92ms  std=    1ms  min=90ms  max=93ms

=== GPU state ===
0, 92 %, 15773 MiB, 20480 MiB, 306.70 W, 81
1, 95 %, 15773 MiB, 20480 MiB, 266.92 W, 84


WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [8, 6], "is_ampere_datacenter": false, "is_ampere_consumer": true, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": false}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 3080
[INFO:genesis.apply_all]   canonical: RTX 3080  cc: (8, 6)  SM: 68  L2: 5 MB  BW: 760 GB/s  regime: bandwidth
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [OFF] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [ON ] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [ON ] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           ⚠️  enabled but NOT recommended on this GPU (may regress or no-op)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=True on SM=(8, 6) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=True (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-03 05:54:38 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-03 05:54:38 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61b — Qwen3 streaming partial-tag overlap guard | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P61b Qwen3 streaming overlap guard] applied 2 sub-patches: p61b_import, p61b_overlap_guard
[INFO:genesis.apply_all] [Genesis] applied: P61b Qwen3 streaming partial-tag overlap guard — P61b applied: overlap guard active in streaming path
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P62 — Structured-output spec-decode reasoning-end timing fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P62 structured_output/__init__.py] applied 2 sub-patches: p62_grammar_bitmask, p62_new_methods
[INFO:genesis.wiring.text_patch] [P62 scheduler.py] applied 3 sub-patches: p62_sched_update_from_output, p62_sched_udti, p62_sched_udtio
[INFO:genesis.apply_all] [Genesis] applied: P62 structured-output spec-decode timing fix — P62 applied: 2 files modified, 0 idempotent. Reasoning-aware grammar acceptance + spec-token validation now active. Should reduce residual broken tool-call rate when </think> arrives in spec batch.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P61 Qwen3 multi-tool first-occurrence] applied 1 sub-patches: p61_first_occurrence
[INFO:genesis.apply_all] [Genesis] applied: P61 Qwen3 multi-tool first-occurrence — P61 applied: extract_content_ids now finds FIRST <tool_call> (was LAST). Multi-tool requests preserve all tool calls.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P60b causal_conv1d.py — Triton kernel + wrapper] applied 8 sub-patches: p60b_kernel_sig, p60b_constexpr, p60b_offset_calc, p60b_step1, p60b_step2, p60b_wrapper_sig, p60b_kernel_call, p60b_kernel_kwarg
[INFO:genesis.wiring.text_patch] [P60b gdn_linear_attn.py — pass num_accepted to conv_fn] applied 1 sub-patches: p60b_gdn_conv_fn
[INFO:genesis.wiring.p60b_gdn_ngram_triton_kernel] [Genesis P60b] Triton cache: no cache dir cleared (may be empty already or permission denied)
[INFO:genesis.apply_all] [Genesis] applied: P60b GDN+ngram Triton kernel offset — P60b Phase 2 applied: 2 files modified, 0 idempotent. Triton kernel offset active. no cache dir cleared (may be empty already or permission denied) First spec-decode call will trigger kernel recompile (~5-10s).
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P60 gdn_attn.py — spec_decode_src_indices] applied 3 sub-patches: p60_field, p60_build, p60_ctor
[INFO:genesis.wiring.text_patch] [P60 gdn_linear_attn.py — SSM state pre-copy] applied 2 sub-patches: p60_core, p60_decode
[INFO:genesis.wiring.text_patch] [P60 gpu_model_runner.py — non-spec passthrough] applied 1 sub-patches: p60_passthrough
[INFO:genesis.apply_all] [Genesis] applied: P60 GDN+ngram state recovery — P60 Phase 1 applied: 3 files modified, 0 idempotent. SSM state pre-copy active for GDN+ngram spec decode. Conv state Triton kernel fix (Phase 2) NOT included — if reproducer still shows broken output, Phase 2 is the next step.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P65 — TurboQuant spec-decode cudagraph downgrade | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P65 turboquant_attn.py — spec-decode CG downgrade] applied 1 sub-patches: p65_cg_support_downgrade
[INFO:genesis.apply_all] [Genesis] applied: P65 TurboQuant spec-decode cudagraph downgrade — P65 applied: TurboQuant CG support downgraded to UNIFORM_SINGLE_TOKEN_DECODE. Spec-verify K+1 batches now run eager (correct per-request continuation), 1-token decode batches still use cudagraph. Workaround for noonghunna #40880 MTP+TurboQuant bug.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P68 — Auto force tool_choice=required for long-context tool calls | opt-in env (config: neutral)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P69 — Long-context tool-format reminder injection | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P68/P69 serving.py — long-ctx tool-call hook injection] applied 1 sub-patches: p6869_hook_insert
[INFO:genesis.apply_all] [Genesis] applied: P68/P69 long-context tool-call adherence — Hook injected into create_chat_completion. Active mitigations: P68 (auto force tool_choice=required for long ctx); P69 (append tool-format reminder to last user msg). Threshold: GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS env (default 50000, raised from 8000 in v7.65 per Issue #9).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage

HEAD 95b2c3b

0 replies

noonghunna · 2026-05-03T09:36:07Z

noonghunna
May 3, 2026
Maintainer

No worries @efschu — easy to mix up at 11pm. Thanks for the correction. Walking back my earlier reply that framed your first post as K8V4 evidence — that was on me; I should have looked at the actual config you ran rather than the bottom-of-output annotation.

Reframing with the corrected attribution:

Compose	KV	mem-util	2× 3080 @ 320W (your numbers)	2× 3090 @ 230W (our numbers)
`dual.yml`	fp8_e5m2	(default 0.92)	75.43 narr / 97.41 code	69.05 narr / 88.58 code
`dual-turbo.yml`	TQ3	0.82	76.39 narr / ~99 code	53.65 narr / 72.93 code

Two interesting findings from the corrected data:

1. fp8 ≈ TQ3 on your 320W 3080 rig. 75.4 vs 76.4 narr — essentially identical within run-to-run variance. On our 230W 3090 baseline, fp8 (69) vs TQ3 (54) is a 27% gap. Why the difference?

Hypotheses worth considering:

320W power lets your cards stay closer to peak frequency under sustained load — TQ3's higher arithmetic intensity (3-bit unpack + scale) is more sensitive to clock speed than fp8 plain dequant. At our 230W cap, the 3090s downclock during the heavier TQ3 path; your 320W headroom keeps them clocked.
3080 vs 3090 has narrower memory bus (320-bit vs 384-bit). TQ3 reduces KV bandwidth pressure proportionally more on narrower-bus cards — could be why the per-stream gap closes on yours.
Could be just measurement noise. CV is ~3% on both, so a 1 TPS difference between fp8 and TQ3 isn't necessarily signal.

If we get a third data point at a different power level (e.g. 230W on your 3080 rig), it'd disambiguate. No pressure — your 320W run is already valuable cross-rig data.

2. The K8V4 question we discussed in #33 is still untested. I'd thought your data was evidence; it wasn't. We don't ship a K8V4 dual variant; my proposal there was to author one (dual-k8v4.yml with --kv-cache-dtype turboquant_k8v4) and run the same bench. Still on the table; just clarifying the evidence base hasn't shifted.

Quick docs-action: getting your data into HARDWARE.md

Both your runs are valuable. To turn them into a proper credited row in docs/HARDWARE.md:

git pull origin master    # gets the new scripts/report.sh diagnostic helper
bash scripts/report.sh --bench > my-rig.md

That captures hardware (incl. PCIe lane width per card — relevant on multi-card boards), full env, AND runs the canonical bench in one pass. Paste the contents into a comment here and I'll add a 2× 3080 @ 320W row to HARDWARE.md crediting you.

0 replies

efschu · 2026-05-03T10:33:48Z

efschu
May 3, 2026

pulled, now container not starging anymore:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [8, 6], "is_ampere_datacenter": false, "is_ampere_consumer": true, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": false}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 3080
[INFO:genesis.apply_all]   canonical: RTX 3080  cc: (8, 6)  SM: 68  L2: 5 MB  BW: 760 GB/s  regime: bandwidth
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [OFF] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [ON ] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [ON ] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           ⚠️  enabled but NOT recommended on this GPU (may regress or no-op)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=True on SM=(8, 6) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=True (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-03 10:29:09 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-03 10:29:09 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61b — Qwen3 streaming partial-tag overlap guard | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P61b Qwen3 streaming overlap guard] applied 2 sub-patches: p61b_import, p61b_overlap_guard
[INFO:genesis.apply_all] [Genesis] applied: P61b Qwen3 streaming partial-tag overlap guard — P61b applied: overlap guard active in streaming path
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P62 — Structured-output spec-decode reasoning-end timing fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P62 structured_output/__init__.py] applied 2 sub-patches: p62_grammar_bitmask, p62_new_methods
[INFO:genesis.wiring.text_patch] [P62 scheduler.py] applied 3 sub-patches: p62_sched_update_from_output, p62_sched_udti, p62_sched_udtio
[INFO:genesis.apply_all] [Genesis] applied: P62 structured-output spec-decode timing fix — P62 applied: 2 files modified, 0 idempotent. Reasoning-aware grammar acceptance + spec-token validation now active. Should reduce residual broken tool-call rate when </think> arrives in spec batch.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P61 Qwen3 multi-tool first-occurrence] applied 1 sub-patches: p61_first_occurrence
[INFO:genesis.apply_all] [Genesis] applied: P61 Qwen3 multi-tool first-occurrence — P61 applied: extract_content_ids now finds FIRST <tool_call> (was LAST). Multi-tool requests preserve all tool calls.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P60b causal_conv1d.py — Triton kernel + wrapper] applied 8 sub-patches: p60b_kernel_sig, p60b_constexpr, p60b_offset_calc, p60b_step1, p60b_step2, p60b_wrapper_sig, p60b_kernel_call, p60b_kernel_kwarg
[INFO:genesis.wiring.text_patch] [P60b gdn_linear_attn.py — pass num_accepted to conv_fn] applied 1 sub-patches: p60b_gdn_conv_fn
[INFO:genesis.wiring.p60b_gdn_ngram_triton_kernel] [Genesis P60b] Triton cache: no cache dir cleared (may be empty already or permission denied)
[INFO:genesis.apply_all] [Genesis] applied: P60b GDN+ngram Triton kernel offset — P60b Phase 2 applied: 2 files modified, 0 idempotent. Triton kernel offset active. no cache dir cleared (may be empty already or permission denied) First spec-decode call will trigger kernel recompile (~5-10s).
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P60 gdn_attn.py — spec_decode_src_indices] applied 3 sub-patches: p60_field, p60_build, p60_ctor
[INFO:genesis.wiring.text_patch] [P60 gdn_linear_attn.py — SSM state pre-copy] applied 2 sub-patches: p60_core, p60_decode
[INFO:genesis.wiring.text_patch] [P60 gpu_model_runner.py — non-spec passthrough] applied 1 sub-patches: p60_passthrough
[INFO:genesis.apply_all] [Genesis] applied: P60 GDN+ngram state recovery — P60 Phase 1 applied: 3 files modified, 0 idempotent. SSM state pre-copy active for GDN+ngram spec decode. Conv state Triton kernel fix (Phase 2) NOT included — if reproducer still shows broken output, Phase 2 is the next step.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P65 — TurboQuant spec-decode cudagraph downgrade | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P65 turboquant_attn.py — spec-decode CG downgrade] applied 1 sub-patches: p65_cg_support_downgrade
[INFO:genesis.apply_all] [Genesis] applied: P65 TurboQuant spec-decode cudagraph downgrade — P65 applied: TurboQuant CG support downgraded to UNIFORM_SINGLE_TOKEN_DECODE. Spec-verify K+1 batches now run eager (correct per-request continuation), 1-token decode batches still use cudagraph. Workaround for noonghunna #40880 MTP+TurboQuant bug.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P68 — Auto force tool_choice=required for long-context tool calls | opt-in env (config: neutral)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P69 — Long-context tool-format reminder injection | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P68/P69 serving.py — long-ctx tool-call hook injection] applied 1 sub-patches: p6869_hook_insert
[INFO:genesis.apply_all] [Genesis] applied: P68/P69 long-context tool-call adherence — Hook injected into create_chat_completion. Active mitigations: P68 (auto force tool_choice=required for long ctx); P69 (append tool-format reminder to last user msg). Threshold: GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS env (default 50000, raised from 8000 in v7.65 per Issue #9).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '8.6', 'fp8_mode': 'e4b15', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P83 v1/core/single_type_kv_cache_manager.py — MTP keep-last-cached-block (vllm#38182 mitigation)] applied 2 sub-patches: p83_full_attention_skip_pop, p83_sliding_window_skip_pop
[INFO:genesis.apply_all] [Genesis] applied: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — P83 applied: MTP keep-last-cached-block guard installed at both FullAttentionManager and SlidingWindowManager pop sites. Activates ONLY when GENESIS_ENABLE_P83=1 is set in env. MTP-targeted; do NOT enable for true Eagle/Eagle3.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P100 flashinfer.py — native FULL CG for spec-decode (vllm#41127)] applied 11 sub-patches: p100_imports_drop_uniform, p100_fispecdecode_dataclass, p100_metadata_decode_union, p100_cgsupport_uniform_batch, p100_init_decode_wrap_field, p100_init_cgdict, p100_init_reorder_threshold, p100_init_qo_indptr_buffer, p100_get_spec_decode_prefill_wrapper_method, p100_build_query_len_scan_branch, p100_forward_fispecdecode_case
[INFO:genesis.apply_all] [Genesis] applied: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — P100 v7.62.17 applied: 11 sub-patches on flashinfer.py for native FULL CUDA graph + spec-decode without TRTLLM. 27B variants now get UNIFORM_BATCH cudagraph (was PIECEWISE) for K+1 spec-verify. Expected: +5-10% TPS on Ampere SM 8.6. NO-OP for PROD (TQ backend). Composes with P67/P67b/P98/P99 (different backends).
[INFO:genesis.wiring.text_patch] [P103 model_executor/layers/fla/ops/chunk.py — self-install hook (v7.69, club-3090#19 finding 2)] applied 1 sub-patches: p103_self_install_at_chunk_py_end
[INFO:genesis.wiring.p103_fla_cliff2_chunked] [Genesis P103 self-install] wrapper installed in chunk.py at module-import time (survives `exec vllm serve` + worker spawn)
[INFO:genesis.apply_all] [Genesis] applied: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — text-patch=applied (chunk.py self-install hook appended (survives `exec vllm serve` + worker spawn)); setattr already idempotent (chunk.py self-install fired or prior apply still active in this process)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P101 turboquant_attn.py — TQ continuation 64-token slicing (vllm#41123)] applied 2 sub-patches: p101_threshold_lower_and_max_cached, p101_continuation_slicing_loop
[INFO:genesis.apply_all] [Genesis] applied: P101 TQ continuation 64-token slicing (vllm#41123 selective) — P101 v7.62.16 applied: turboquant_attn.py continuation prefill now uses 64-token slicing + 32K cached-len cutoff. Expected: +3-12% TPS on PROD long-context. Composes with P98/P99 (non-overlapping anchors). REMINDER: P56 needs re-anchor against new use_decode_continuation block.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P99 workspace.py — memoize get_simultaneous (perf hotfix)] applied 2 sub-patches: p99_get_simultaneous_memo_entry, p99_get_simultaneous_memo_return
[INFO:genesis.apply_all] [Genesis] applied: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — P99 v7.62.15 applied: WorkspaceManager.get_simultaneous() now memoizes (shapes_and_dtypes, ubatch, ws_data_ptr) → cached views. Cache HIT bypasses list-comps + accumulate + _ensure_workspace_size → ~5x faster per call. Properly invalidates on ws re-alloc.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P98 turboquant_attn.py — revert WorkspaceManager (perf hotfix)] upstream marker 'UNIFORM_SINGLE_TOKEN_DECODE' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] applied: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — P98 v7.62.14 applied: turboquant_attn.py _decode_attention + continuation prefill dequant now use per-layer cached buffers (OLD pattern, pre-vllm#40941). Removes Python indirection from decode hot path. Expected: +15-25% TPS recovery on Ampere small-batch single-stream workload.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P94 v1/spec_decode/llm_base_proposer.py — zero-alloc prepare_next_token_ids_padded (vllm#41043)] applied 1 sub-patches: p94_zero_alloc_backup_token_ids
[INFO:genesis.apply_all] [Genesis] applied: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — P94 applied: prepare_next_token_ids_padded uses in-place loop instead of .tolist() + list-comp + np.array(). Removes GPU->CPU sync, list-comp allocation, np.array allocation. Algorithmic identity preserved; expected +2-4% wall TPS on MTP K=3 spec decode.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P91 gptq_marlin.py — cdiv groups + row-group attrs (vllm#39460)] applied 3 sub-patches: p91_gm_floor_input_size_to_cdiv, p91_gm_floor_partition_to_cdiv, p91_gm_register_row_group_attrs
[INFO:genesis.wiring.text_patch] [P91 parameter.py — group-aware row_parallel start_idx (vllm#39460)] applied 1 sub-patches: p91_param_group_aware_start_idx
[INFO:genesis.apply_all] [Genesis] applied: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — P91 applied (DUAL FILE): gptq_marlin.py uses cdiv() for scale rows and tags scales/qzeros with row_group_size + row_input_size_per_partition; parameter.py uses group-aware start_idx for row-parallel scale/zero loading. Fixes silent dequant corruption when input_size_per_partition % group_size != 0 at TP>1.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in env (config: neutral)
[ERROR:genesis.apply_all] [Genesis] FAILED: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — P87 marlin.py — sub-tile output dim pad-on-load (vllm#40361): write_error ([Errno 30] Read-only file system: '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py')
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN8 model_executor/models/utils.py — get_draft_quant_config online-quant propagation (vllm#40849)] applied 2 sub-patches: pN8_imports, pN8_body
[INFO:genesis.apply_all] [Genesis] applied: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — PN8 applied: get_draft_quant_config() now inherits target's OnlineQuantizationConfig when draft has no explicit quant. Frees ~600 MiB on FP8-target + external-draft (Eagle3/DFlash/MTP). No-op when target is not online-quantized.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN9 — Independent drafter attention backend (vllm#39930) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN11 model_executor/layers/mamba/gdn_linear_attn.py — force a/b contiguity in fix_query_key_value_ordering (vllm#41142)] applied 1 sub-patches: pN11_a_b_contiguous
[INFO:genesis.apply_all] [Genesis] applied: PN11 GDN a/b contiguity (vllm#41142 backport) — PN11 applied: GatedDeltaNetAttention.fix_query_key_value_ordering now forces .contiguous() on `b` and `a` after reshape (vllm#41142). Defensive — eliminates silent quality drift class for any future Qwen3-Next/GDN config with num_v_heads == num_k_heads.
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN30 model_executor/layers/mamba/mamba_utils.py — DS layout spec-decode AL>1 fix (issue #17)] applied 2 sub-patches: pN30_get_conv_copy_spec_contiguous, pN30_module_level_state
[INFO:genesis.wiring.text_patch] [PN30 v1/worker/mamba_utils.py — do_mamba_copy_block stream sync + temp tensor cleanup (issue #17)] applied 1 sub-patches: pN30_do_mamba_copy_block_cleanup
[INFO:genesis.wiring.text_patch] [PN30 v1/worker/mamba_utils.py — collect_mamba_copy_meta dst-shaped DS temp (issue #17, v7.68)] applied 1 sub-patches: pN30_collect_mamba_copy_meta_dst_shaped_temp
[INFO:genesis.apply_all] [Genesis] applied: PN30 DS conv state + spec-decode AL>1 (issue #17) — PN30 v7.68 applied: DS conv state layout + spec-decode AL>1 now uses collect_mamba_copy_meta dst-shaped temp blocks for DS conv offset>0, preserving destination row stride. get_conv_copy_spec fails closed if the collect-time bypass is missed; do_mamba_copy_block keeps PN30's stream sync + temp clear lifecycle. Supersedes v7.65 compact .contiguous() approach which silently corrupted DS row strides (diagnosed by noonghunna/ChatGPT-Codex 2026-05-02).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN12 model_executor/layers/activation.py — SiluAndMul forward_cuda FFN intermediate pool (Cliff 1 fix on TQ3)] applied 1 sub-patches: pN12_silu_and_mul_pool
[INFO:genesis.apply_all] [Genesis] applied: PN12 FFN intermediate scratch pool (Cliff 1 fix) — PN12 applied: SiluAndMul.forward_cuda now acquires output via FFNIntermediateCache pool when GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1. Closes Cliff 1 OOM on TQ3 path that PN8 couldn't address (different memory class).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P15B turboquant_attn.py — _flash_attn_varlen max_seqlen_k clamp (Issue #15 fix)] applied 1 sub-patches: p15b_fa_varlen_clamp
[INFO:genesis.apply_all] [Genesis] applied: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — P15B applied: _flash_attn_varlen now clamps max_seqlen_k to actual cu_seqlens_k span. Prevents 50 MiB FA wrapper workspace OOM on long-context continuation-prefill (Issue #15). Adds one GPU->CPU sync per call on infrequent path.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P38b turboquant_attn.py — _continuation_prefill compile-safe hook (Issue #14 fix)] applied 1 sub-patches: p38b_continuation_prefill_hook
[INFO:genesis.wiring.p38b_compile_safe_hook] [Genesis P38b] installed _genesis_p38_dispatch on TurboQuantAttentionImpl — compile-safe path active
[INFO:genesis.apply_all] [Genesis] applied: P38B P38 compile-safe in-source hook (Issue #14 fix) — P38b applied: text-patch applied (P38b in-source hook injected); dispatcher installed on TurboQuantAttentionImpl. _continuation_prefill now survives aot_compile_fullgraph capture. Fixes Issue #14 (noonghunna).
[INFO:genesis.kernels.tq_decode_sparse_v] [Genesis PN26 sparse_v] kernel built and cached
[INFO:genesis.wiring.pn26_sparse_v_kernel] [Genesis PN26 sparse_v] wrapped vllm.v1.attention.ops.triton_turboquant_decode.triton_turboquant_decode_attention with sparse-V dispatcher (also rebound on consumers: ['vllm.v1.attention.backends.turboquant_attn'])
[INFO:genesis.apply_all] [Genesis] applied: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — PN26 sparse-V kernel dispatcher installed. Routes to Genesis fork when seq_len >= min_ctx (default 8192). Threshold: BLASST λ=a/L if GENESIS_PN26_SPARSE_V_SCALE_FACTOR>0, else fixed GENESIS_PN26_SPARSE_V_THRESHOLD (default 0.001). First sparse-V kernel deployed for SM86 (Ampere consumer) — no upstream NVIDIA reference exists yet.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in env (config: neutral)
[INFO:genesis.silu_and_mul_customop] [PN25] registered torch op genesis::silu_and_mul_pooled via direct_register_custom_op (vLLM canonical, v7.66 — fork-safe + Inductor-opaque)
[INFO:genesis.wiring.text_patch] [PN25 model_executor/layers/activation.py — SiluAndMul forward_native opaque-op pool (Inductor-safe Cliff 1 mech B)] applied 2 sub-patches: pN25_silu_and_mul_import_time_register, pN25_silu_and_mul_forward_native_opaque
[INFO:genesis.apply_all] [Genesis] applied: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — PN25 applied: SiluAndMul.forward_native now dispatches through genesis::silu_and_mul_pooled opaque custom op so torch.compile/Inductor cannot inline the FFN intermediate alloc. Closes Cliff 1 mech B on inductor-compiled FFN forward (club-3090#16 reproducer class). Sister to PN12 — PN12 covers eager forward_cuda, PN25 covers compile forward_native; both share the same FFNIntermediateCache pool and can be enabled simultaneously without conflict.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN13 compilation/cuda_graph.py — CUDAGraphWrapper gc.collect/empty_cache lambda arity fix (vllm#41235)] applied 1 sub-patches: pN13_lambda_arity
[INFO:genesis.apply_all] [Genesis] applied: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — PN13 applied: CUDAGraphWrapper.__call__ now accepts var-args on patched gc.collect / empty_cache lambdas (vllm#41235). Defensive fix — prevents TypeError worker-death class when dynamo recompiles nested @torch.compile callables inside cudagraph capture region.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN14 triton_turboquant_decode.py — TQ decode IOOB safe_page_idx clamp (vllm#40074)] applied 1 sub-patches: pN14_safe_page_idx
[INFO:genesis.apply_all] [Genesis] applied: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — PN14 applied: _tq_decode_stage1 now clamps masked-out lanes to page_idx=0 before block-table pointer arithmetic (vllm#40074). Defensive fix — prevents Triton bounds-checker assertion class originally reported on 4090 with >32k sequences in upstream #39998.
[INFO:genesis.wiring.text_patch] [PN19 scoped max_split_size_mb] applied 2 sub-patches: pn19_helper_method, pn19_load_model_wrap
[INFO:genesis.apply_all] [Genesis] applied: PN19 Scoped max_split_size_mb during model load (vllm#41268) — PN19 applied: max_split_size_mb=20 scoped to model load (vllm#41268 backport, MatthewBonanni). Frees 200-500 MiB load-time fragmentation.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN22 qwen3.py — get_top_tokens (vllm#39419)] applied 1 sub-patches: pn22_qwen3_get_top_tokens
[INFO:genesis.wiring.text_patch] [PN22 qwen3_dflash.py — get_top_tokens (vllm#39419)] applied 1 sub-patches: pn22_dflash_get_top_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN22 Local argmax for TP draft (vllm#39419 backport) — PN22 applied: get_top_tokens() added to qwen3.py + qwen3_dflash.py (vllm#39419 backport). Enables vocab-parallel argmax in spec-decode draft path; +9-30% TPS on TP>=2 per PR author.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.wiring.text_patch] [PN17 FA2 softmax_lse runtime clamp] applied 1 sub-patches: pn17_clamp
[INFO:genesis.apply_all] [Genesis] applied: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — PN17 applied: FA2 softmax_lse buffer now clamped to actual seqused_k at runtime, freeing 50-100 MiB on long-ctx (Cliff 1 mechanism A fix per noonghunna Genesis Issue #11).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P85 v1/core/single_type_kv_cache_manager.py — hybrid fine-shadow prefix cache (vllm#38182 followup)] applied 2 sub-patches: p85_mamba_cache_blocks_shadow, p85_mamba_find_longest_cache_hit_fine
[INFO:genesis.apply_all] [Genesis] applied: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — P85 applied: hybrid fine-shadow prefix cache installed at MambaManager. cache_blocks now registers fine-grained shadow entries; find_longest_cache_hit prefers fine-scan with eviction-safety verify. Requires GENESIS_ENABLE_P84=1 + GENESIS_P84_HASH_BLOCK_SIZE=<N> for fine hashes to be computed in the first place.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P74 config/scheduler.py — auto chunk-clamp via long_prefill_token_threshold] applied 1 sub-patches: p74_auto_clamp
[INFO:genesis.apply_all] [Genesis] applied: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — P74 applied: SchedulerConfig auto-clamps long_prefill_token_threshold to GENESIS_PREALLOC_TOKEN_BUDGET when batched_tokens > budget. Prevents prealloc buffer overflow when running with batched=8192.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P72 v1/worker/gpu_model_runner.py — profile_run M cap] applied 1 sub-patches: p72_profile_run_cap
[INFO:genesis.apply_all] [Genesis] applied: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — P72 applied: profile_run M capped to GENESIS_PROFILE_RUN_CAP_M (default 4096). Unblocks --max-num-batched-tokens > 4096 by avoiding Dynamo fake-tensor shape mismatch in moe_align_block_size symbolic-shape inference.
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P58 — Async-scheduler -1 placeholder fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P58 request.py — placeholder counter field] applied 2 sub-patches: p58_request_field, p58_request_num_tokens_with_spec
[INFO:genesis.wiring.text_patch] [P58 async_scheduler.py — counter-based placeholder intent] applied 1 sub-patches: p58_async_sched_assignment
[INFO:genesis.wiring.text_patch] [P58 scheduler.py — placeholder gating + new method] applied 5 sub-patches: p58_sched_spec_block, p58_sched_new_method, p58_sched_preempt, p58_sched_draft_site_a, p58_sched_draft_site_b
[INFO:genesis.apply_all] [Genesis] applied: P58 async-scheduler -1 placeholder fix — P58 backport applied: 3 files modified, 0 already-applied. Async scheduler -1 placeholder leakage fixed; spec-decode + cudagraph workloads should no longer loop or IMA. Validate with our regression smoke suite before serving traffic.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4128 (via GENESIS_PREALLOC_TOKEN_BUDGET)
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis P17/P18] Marlin tuning ready: SM=(8, 6) bsm=8 num_warps=4 num_stages=3
[INFO:genesis.apply_all] [Genesis] applied: P17/P18 Marlin MoE per-SM tuning — SM=(8, 6) bsm=8
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (8, 6) → Marlin fallback required (Triton block FP8 unsupported on Ampere)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(8, 6) → Marlin fallback path selected
[INFO:genesis.apply_all] Genesis Results: 60 applied, 38 skipped, 1 failed, 1 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 1 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | APPLY  | TurboQuant spec-decode cudagraph downgrade    | opt-in env (config: neutral)                                 | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | APPLY  | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in env (config: neutral)                                 | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P85 — missing required dependency: P85 requires 'P84' to also be APPLY (currently SKIP)
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P67 — conflict: P65 and P67 are both APPLY but declared mutually exclusive — pick one
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P67b — conflict: P65 and P67b are both APPLY but declared mutually exclusive — pick one
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 8.1s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | APPLY  | TurboQuant spec-decode cudagraph downgrade    | opt-in env (config: neutral)                                 | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | APPLY  | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in env (config: neutral)                                 | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [8, 6], "is_ampere_datacenter": false, "is_ampere_consumer": true, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": false}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 3080
[INFO:genesis.apply_all]   canonical: RTX 3080  cc: (8, 6)  SM: 68  L2: 5 MB  BW: 760 GB/s  regime: bandwidth
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [OFF] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [ON ] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [ON ] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           ⚠️  enabled but NOT recommended on this GPU (may regress or no-op)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] kv_cache_utils.py already has Genesis P8 marker — idempotent skip (this is the expected restart path; patch persists across container restart since file edits are on the bind-mount layer)
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(idempotent), scheduler=applied(idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — already applied this image layer (idempotent)
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=True on SM=(8, 6) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=True (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — v2 already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-03 10:31:23 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-03 10:31:23 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.silu_and_mul_customop] [PN25] registered torch op genesis::silu_and_mul_pooled via direct_register_custom_op (vLLM canonical, v7.66 — fork-safe + Inductor-opaque)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61b — Qwen3 streaming partial-tag overlap guard | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P61b Qwen3 streaming partial-tag overlap guard — already applied (idempotent)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P62 — Structured-output spec-decode reasoning-end timing fix | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P62 structured-output spec-decode timing fix — P62 applied: 0 files modified, 2 idempotent. Reasoning-aware grammar acceptance + spec-token validation now active. Should reduce residual broken tool-call rate when </think> arrives in spec batch.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P61 Qwen3 multi-tool first-occurrence — already applied (idempotent)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in env (config: neutral)
[INFO:genesis.wiring.p60b_gdn_ngram_triton_kernel] [Genesis P60b] Triton cache: no cache dir cleared (may be empty already or permission denied)
[INFO:genesis.apply_all] [Genesis] applied: P60b GDN+ngram Triton kernel offset — P60b Phase 2 applied: 0 files modified, 2 idempotent. Triton kernel offset active. no cache dir cleared (may be empty already or permission denied) First spec-decode call will trigger kernel recompile (~5-10s).
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P60 GDN+ngram state recovery — P60 Phase 1 applied: 0 files modified, 3 idempotent. SSM state pre-copy active for GDN+ngram spec decode. Conv state Triton kernel fix (Phase 2) NOT included — if reproducer still shows broken output, Phase 2 is the next step.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 0 files modified, 2 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P65 — TurboQuant spec-decode cudagraph downgrade | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P65 TurboQuant spec-decode cudagraph downgrade — P65 applied: TurboQuant CG support downgraded to UNIFORM_SINGLE_TOKEN_DECODE. Spec-verify K+1 batches now run eager (correct per-request continuation), 1-token decode batches still use cudagraph. Workaround for noonghunna #40880 MTP+TurboQuant bug.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P68 — Auto force tool_choice=required for long-context tool calls | opt-in env (config: neutral)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P69 — Long-context tool-format reminder injection | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P68/P69 long-context tool-call adherence — Hook injected into create_chat_completion. Active mitigations: P68 (auto force tool_choice=required for long ctx); P69 (append tool-format reminder to last user msg). Threshold: GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS env (default 50000, raised from 8000 in v7.65 per Issue #9).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '8.6', 'fp8_mode': 'e4b15', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p83_mtp_keep_last_cached_block] [P83] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in env (config: neutral)
[INFO:genesis.wiring.p100_flashinfer_full_cg_specdec] [P100] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — idempotent (marker present)
[INFO:genesis.wiring.p103_fla_cliff2_chunked] [Genesis P103 self-install] wrapper installed in chunk.py at module-import time (survives `exec vllm serve` + worker spawn)
[INFO:genesis.apply_all] [Genesis] applied: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — text-patch=idempotent (chunk.py self-install hook appended (survives `exec vllm serve` + worker spawn)); setattr already idempotent (chunk.py self-install fired or prior apply still active in this process)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in env (config: neutral)
[INFO:genesis.wiring.p101_tq_continuation_slicing] [P101] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P101 TQ continuation 64-token slicing (vllm#41123 selective) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p99_workspace_manager_memoize] [P99] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P98 turboquant_attn.py — revert WorkspaceManager (perf hotfix)] upstream marker 'UNIFORM_SINGLE_TOKEN_DECODE' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] applied: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — P98 v7.62.14 applied: turboquant_attn.py _decode_attention + continuation prefill dequant now use per-layer cached buffers (OLD pattern, pre-vllm#40941). Removes Python indirection from decode hot path. Expected: +15-25% TPS recovery on Ampere small-batch single-stream workload.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in env (config: neutral)
[INFO:genesis.wiring.p94_spec_decode_zero_alloc] [P94] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — idempotent (both markers present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in env (config: neutral)
[ERROR:genesis.apply_all] [Genesis] FAILED: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — P87 marlin.py — sub-tile output dim pad-on-load (vllm#40361): write_error ([Errno 30] Read-only file system: '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py')
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in env (config: neutral)
[INFO:genesis.wiring.pN8_mtp_draft_online_quant_propagation] [PN8] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN9 — Independent drafter attention backend (vllm#39930) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — PN11 model_executor/layers/mamba/gdn_linear_attn.py — force a/b contiguity in fix_query_key_value_ordering (vllm#41142): already applied (marker present)
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.pN33_spec_decode_warmup_k] [PN33] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — PN30 DS layout + spec-decode AL>1: already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — PN12 model_executor/layers/activation.py — SiluAndMul forward_cuda FFN intermediate pool (Cliff 1 fix on TQ3): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — P15B turboquant_attn.py — _flash_attn_varlen max_seqlen_k clamp (Issue #15 fix): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p38b_compile_safe_hook] [Genesis P38b] installed _genesis_p38_dispatch on TurboQuantAttentionImpl — compile-safe path active
[INFO:genesis.apply_all] [Genesis] applied: P38B P38 compile-safe in-source hook (Issue #14 fix) — P38b applied: text-patch skipped (in-source hook already present); dispatcher installed on TurboQuantAttentionImpl. _continuation_prefill now survives aot_compile_fullgraph capture. Fixes Issue #14 (noonghunna).
[INFO:genesis.kernels.tq_decode_sparse_v] [Genesis PN26 sparse_v] kernel built and cached
[INFO:genesis.wiring.pn26_sparse_v_kernel] [Genesis PN26 sparse_v] wrapped vllm.v1.attention.ops.triton_turboquant_decode.triton_turboquant_decode_attention with sparse-V dispatcher (also rebound on consumers: ['vllm.v1.attention.backends.turboquant_attn'])
[INFO:genesis.apply_all] [Genesis] applied: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — PN26 sparse-V kernel dispatcher installed. Routes to Genesis fork when seq_len >= min_ctx (default 8192). Threshold: BLASST λ=a/L if GENESIS_PN26_SPARSE_V_SCALE_FACTOR>0, else fixed GENESIS_PN26_SPARSE_V_THRESHOLD (default 0.001). First sparse-V kernel deployed for SM86 (Ampere consumer) — no upstream NVIDIA reference exists yet.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — PN25 model_executor/layers/activation.py — SiluAndMul forward_native opaque-op pool (Inductor-safe Cliff 1 mech B): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — PN13 compilation/cuda_graph.py — CUDAGraphWrapper gc.collect/empty_cache lambda arity fix (vllm#41235): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — PN14 triton_turboquant_decode.py — TQ decode IOOB safe_page_idx clamp (vllm#40074): already applied (marker present)
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — PN19 scoped max_split_size_mb: already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: PN22 Local argmax for TP draft (vllm#39419 backport) — PN22 applied: get_top_tokens() added to qwen3.py + qwen3_dflash.py (vllm#39419 backport). Enables vocab-parallel argmax in spec-decode draft path; +9-30% TPS on TP>=2 per PR author.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — PN17 FA2 softmax_lse runtime clamp: already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p85_hybrid_fine_shadow_prefix_cache] [P85] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in env (config: neutral)
[INFO:genesis.wiring.p74_chunk_clamp] [P74] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in env (config: neutral)
[INFO:genesis.wiring.p72_profile_run_cap] [P72] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — idempotent (marker present)
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P58 — Async-scheduler -1 placeholder fix | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P58 async-scheduler -1 placeholder fix — P58 backport applied: 0 files modified, 3 already-applied. Async scheduler -1 placeholder leakage fixed; spec-decode + cudagraph workloads should no longer loop or IMA. Validate with our regression smoke suite before serving traffic.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — already patched this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — already patched this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4128 (via GENESIS_PREALLOC_TOKEN_BUDGET)
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — already applied (idempotent)
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis P17/P18] Marlin tuning ready: SM=(8, 6) bsm=8 num_warps=4 num_stages=3
[INFO:genesis.apply_all] [Genesis] applied: P17/P18 Marlin MoE per-SM tuning — SM=(8, 6) bsm=8
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — already applied this image layer (idempotent)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (8, 6) → Marlin fallback required (Triton block FP8 unsupported on Ampere)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(8, 6) → Marlin fallback path selected
[INFO:genesis.apply_all] Genesis Results: 51 applied, 47 skipped, 1 failed, 1 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 1 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | APPLY  | TurboQuant spec-decode cudagraph downgrade    | opt-in env (config: neutral)                                 | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | APPLY  | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in env (config: neutral)                                 | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P85 — missing required dependency: P85 requires 'P84' to also be APPLY (currently SKIP)
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P67 — conflict: P65 and P67 are both APPLY but declared mutually exclusive — pick one
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P67b — conflict: P65 and P67b are both APPLY but declared mutually exclusive — pick one
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 4.7s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | APPLY  | TurboQuant spec-decode cudagraph downgrade    | opt-in env (config: neutral)                                 | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | APPLY  | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in env (config: neutral)                                 | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [8, 6], "is_ampere_datacenter": false, "is_ampere_consumer": true, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": false}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 3080
[INFO:genesis.apply_all]   canonical: RTX 3080  cc: (8, 6)  SM: 68  L2: 5 MB  BW: 760 GB/s  regime: bandwidth
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [OFF] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [ON ] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [ON ] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           ⚠️  enabled but NOT recommended on this GPU (may regress or no-op)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] kv_cache_utils.py already has Genesis P8 marker — idempotent skip (this is the expected restart path; patch persists across container restart since file edits are on the bind-mount layer)
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(idempotent), scheduler=applied(idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — already applied this image layer (idempotent)
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=True on SM=(8, 6) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=True (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — v2 already applied this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-03 10:31:48 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-03 10:31:48 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.silu_and_mul_customop] [PN25] registered torch op genesis::silu_and_mul_pooled via direct_register_custom_op (vLLM canonical, v7.66 — fork-safe + Inductor-opaque)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61b — Qwen3 streaming partial-tag overlap guard | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P61b Qwen3 streaming partial-tag overlap guard — already applied (idempotent)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P62 — Structured-output spec-decode reasoning-end timing fix | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P62 structured-output spec-decode timing fix — P62 applied: 0 files modified, 2 idempotent. Reasoning-aware grammar acceptance + spec-token validation now active. Should reduce residual broken tool-call rate when </think> arrives in spec batch.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P61 Qwen3 multi-tool first-occurrence — already applied (idempotent)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in env (config: neutral)
[INFO:genesis.wiring.p60b_gdn_ngram_triton_kernel] [Genesis P60b] Triton cache: no cache dir cleared (may be empty already or permission denied)
[INFO:genesis.apply_all] [Genesis] applied: P60b GDN+ngram Triton kernel offset — P60b Phase 2 applied: 0 files modified, 2 idempotent. Triton kernel offset active. no cache dir cleared (may be empty already or permission denied) First spec-decode call will trigger kernel recompile (~5-10s).
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P60 GDN+ngram state recovery — P60 Phase 1 applied: 0 files modified, 3 idempotent. SSM state pre-copy active for GDN+ngram spec decode. Conv state Triton kernel fix (Phase 2) NOT included — if reproducer still shows broken output, Phase 2 is the next step.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 0 files modified, 2 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P65 — TurboQuant spec-decode cudagraph downgrade | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P65 TurboQuant spec-decode cudagraph downgrade — P65 applied: TurboQuant CG support downgraded to UNIFORM_SINGLE_TOKEN_DECODE. Spec-verify K+1 batches now run eager (correct per-request continuation), 1-token decode batches still use cudagraph. Workaround for noonghunna #40880 MTP+TurboQuant bug.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P68 — Auto force tool_choice=required for long-context tool calls | opt-in env (config: neutral)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P69 — Long-context tool-format reminder injection | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P68/P69 long-context tool-call adherence — Hook injected into create_chat_completion. Active mitigations: P68 (auto force tool_choice=required for long ctx); P69 (append tool-format reminder to last user msg). Threshold: GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS env (default 50000, raised from 8000 in v7.65 per Issue #9).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '8.6', 'fp8_mode': 'e4b15', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p83_mtp_keep_last_cached_block] [P83] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in env (config: neutral)
[INFO:genesis.wiring.p100_flashinfer_full_cg_specdec] [P100] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — idempotent (marker present)
[INFO:genesis.wiring.p103_fla_cliff2_chunked] [Genesis P103 self-install] wrapper installed in chunk.py at module-import time (survives `exec vllm serve` + worker spawn)
[INFO:genesis.apply_all] [Genesis] applied: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — text-patch=idempotent (chunk.py self-install hook appended (survives `exec vllm serve` + worker spawn)); setattr already idempotent (chunk.py self-install fired or prior apply still active in this process)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in env (config: neutral)
[INFO:genesis.wiring.p101_tq_continuation_slicing] [P101] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P101 TQ continuation 64-token slicing (vllm#41123 selective) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p99_workspace_manager_memoize] [P99] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P98 turboquant_attn.py — revert WorkspaceManager (perf hotfix)] upstream marker 'UNIFORM_SINGLE_TOKEN_DECODE' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] applied: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — P98 v7.62.14 applied: turboquant_attn.py _decode_attention + continuation prefill dequant now use per-layer cached buffers (OLD pattern, pre-vllm#40941). Removes Python indirection from decode hot path. Expected: +15-25% TPS recovery on Ampere small-batch single-stream workload.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in env (config: neutral)
[INFO:genesis.wiring.p94_spec_decode_zero_alloc] [P94] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — idempotent (both markers present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in env (config: neutral)
[ERROR:genesis.apply_all] [Genesis] FAILED: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — P87 marlin.py — sub-tile output dim pad-on-load (vllm#40361): write_error ([Errno 30] Read-only file system: '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py')
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in env (config: neutral)
[INFO:genesis.wiring.pN8_mtp_draft_online_quant_propagation] [PN8] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN9 — Independent drafter attention backend (vllm#39930) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — PN11 model_executor/layers/mamba/gdn_linear_attn.py — force a/b contiguity in fix_query_key_value_ordering (vllm#41142): already applied (marker present)
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.pN33_spec_decode_warmup_k] [PN33] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — PN30 DS layout + spec-decode AL>1: already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — PN12 model_executor/layers/activation.py — SiluAndMul forward_cuda FFN intermediate pool (Cliff 1 fix on TQ3): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — P15B turboquant_attn.py — _flash_attn_varlen max_seqlen_k clamp (Issue #15 fix): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p38b_compile_safe_hook] [Genesis P38b] installed _genesis_p38_dispatch on TurboQuantAttentionImpl — compile-safe path active
[INFO:genesis.apply_all] [Genesis] applied: P38B P38 compile-safe in-source hook (Issue #14 fix) — P38b applied: text-patch skipped (in-source hook already present); dispatcher installed on TurboQuantAttentionImpl. _continuation_prefill now survives aot_compile_fullgraph capture. Fixes Issue #14 (noonghunna).
[INFO:genesis.kernels.tq_decode_sparse_v] [Genesis PN26 sparse_v] kernel built and cached
[INFO:genesis.wiring.pn26_sparse_v_kernel] [Genesis PN26 sparse_v] wrapped vllm.v1.attention.ops.triton_turboquant_decode.triton_turboquant_decode_attention with sparse-V dispatcher (also rebound on consumers: ['vllm.v1.attention.backends.turboquant_attn'])
[INFO:genesis.apply_all] [Genesis] applied: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — PN26 sparse-V kernel dispatcher installed. Routes to Genesis fork when seq_len >= min_ctx (default 8192). Threshold: BLASST λ=a/L if GENESIS_PN26_SPARSE_V_SCALE_FACTOR>0, else fixed GENESIS_PN26_SPARSE_V_THRESHOLD (default 0.001). First sparse-V kernel deployed for SM86 (Ampere consumer) — no upstream NVIDIA reference exists yet.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — PN25 model_executor/layers/activation.py — SiluAndMul forward_native opaque-op pool (Inductor-safe Cliff 1 mech B): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — PN13 compilation/cuda_graph.py — CUDAGraphWrapper gc.collect/empty_cache lambda arity fix (vllm#41235): already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — PN14 triton_turboquant_decode.py — TQ decode IOOB safe_page_idx clamp (vllm#40074): already applied (marker present)
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — PN19 scoped max_split_size_mb: already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: PN22 Local argmax for TP draft (vllm#39419 backport) — PN22 applied: get_top_tokens() added to qwen3.py + qwen3_dflash.py (vllm#39419 backport). Enables vocab-parallel argmax in spec-decode draft path; +9-30% TPS on TP>=2 per PR author.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — PN17 FA2 softmax_lse runtime clamp: already applied (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.p85_hybrid_fine_shadow_prefix_cache] [P85] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in env (config: neutral)
[INFO:genesis.wiring.p74_chunk_clamp] [P74] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — idempotent (marker present)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in env (config: neutral)
[INFO:genesis.wiring.p72_profile_run_cap] [P72] marker present — skip (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — idempotent (marker present)
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P58 — Async-scheduler -1 placeholder fix | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] applied: P58 async-scheduler -1 placeholder fix — P58 backport applied: 0 files modified, 3 already-applied. Async scheduler -1 placeholder leakage fixed; spec-decode + cudagraph workloads should no longer loop or IMA. Validate with our regression smoke suite before serving traffic.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — already patched this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — already patched this image layer (idempotent)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4128 (via GENESIS_PREALLOC_TOKEN_BUDGET)
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — already applied (idempotent)
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis P17/P18] Marlin tuning ready: SM=(8, 6) bsm=8 num_warps=4 num_stages=3
[INFO:genesis.apply_all] [Genesis] applied: P17/P18 Marlin MoE per-SM tuning — SM=(8, 6) bsm=8
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — already applied this image layer (idempotent)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (8, 6) → Marlin fallback required (Triton block FP8 unsupported on Ampere)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(8, 6) → Marlin fallback path selected
[INFO:genesis.apply_all] Genesis Results: 51 applied, 47 skipped, 1 failed, 1 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 1 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | APPLY  | TurboQuant spec-decode cudagraph downgrade    | opt-in env (config: neutral)                                 | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | APPLY  | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in env (config: neutral)                                 | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P85 — missing required dependency: P85 requires 'P84' to also be APPLY (currently SKIP)
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P67 — conflict: P65 and P67 are both APPLY but declared mutually exclusive — pick one
[ERROR:genesis.dispatcher] [Genesis Dispatcher v2] validator ERROR: P67b — conflict: P65 and P67b are both APPLY but declared mutually exclusive — pick one
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 4.6s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | APPLY  | TurboQuant spec-decode cudagraph downgrade    | opt-in env (config: neutral)                                 | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | APPLY  | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in env (config: neutral)                                 | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)

0 replies

noonghunna · 2026-05-03T10:47:01Z

noonghunna
May 3, 2026
Maintainer

@efschu — appreciate the immediate flag. Three things going on; need a bit more log to disambiguate.

What I see in your output

Three Genesis dispatcher v2 validator errors:

[ERROR] P85 — missing required dependency: P85 requires 'P84' to also be APPLY (currently SKIP)
[ERROR] P67 — conflict: P65 and P67 are both APPLY but declared mutually exclusive — pick one
[ERROR] P67b — conflict: P65 and P67b are both APPLY but declared mutually exclusive — pick one

These are real config conflicts on our side. Our dual-turbo.yml (and long-text.yml, long-vision.yml, bounded-thinking.yml) ship with all three of GENESIS_ENABLE_P65 + P67 + P85 = 1. In Genesis v7.69, the dispatcher v2 escalated:

P67/P67b vs P65 — P65 was the cudagraph-downgrade workaround for the spec-decode cascade bug; P67 is the proper Triton kernel fix designed to replace it. v7.69 says you can't have both — pick the proper fix (P67).
P85 → P84 — P85 (hybrid fine-shadow prefix cache) requires P84 (hash_block_size override, the actual root-cause fix for vllm#38182). We enable P85 without P84.

What I don't know yet

Whether the validator errors are blocking boot on your rig, OR whether they're informational warnings and the actual failure happens downstream.

Reason I'm uncertain: I booted long-text.yml (same conflict set) twice on our test rig today and saw Genesis Results: 63 applied, 36 skipped, 0 failed with no boot blocker. The dispatcher v2 errors might fire and apply_all might continue past them. If so, your actual failure is something else further down the log.

Could you share the boot log after the second apply-matrix print? Specifically the lines between:

[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 4.6s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
... (all the patch rows)
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated...)

...and whatever comes next (Application startup, EngineCore failed, ValueError, etc.). Easiest:

docker logs vllm-qwen36-27b-dual-turbo 2>&1 | tail -100

In the meantime — quick workaround

If you want to unblock locally and try booting again, edit models/qwen3.6-27b/vllm/compose/docker-compose.dual-turbo.yml and remove these two lines from the environment: block:

- GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1
- GENESIS_ENABLE_P85=1

That keeps P67 (proper fix, supersedes P65) and drops P85 (the hybrid prefix cache patch — opt-in optimization, not load-bearing for correctness). Should clear all three dispatcher v2 validator errors.

If boot still fails after that edit, the validator errors weren't the cause — and your tail -100 log will tell us what was.

On our side

Independent of your reproduction — our shipped composes have outdated env-var bundles vs Genesis v7.69's dispatcher v2. Filing a fix to drop P65 + P85 from our composes (P67 is strictly better; P85 needs P84 which we don't enable). Will land regardless of whether the validators are the blocker on your specific case.

Thanks for catching this immediately — these env-var conflicts would have bitten every dual-turbo / long-* user once they pulled latest. Your bench data tonight unblocks a real fix on our side.

0 replies

efschu · 2026-05-03T14:40:45Z

efschu
May 3, 2026

still crashing

docker logs vllm-qwen36-27b-dual-turbo 2>&1 | tail -100
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 8.0s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)

went back to 95b2c3b, upgraded report.sh and bench.sh to current version

my-rig.md

0 replies

noonghunna · 2026-05-03T14:57:37Z

noonghunna
May 3, 2026
Maintainer

@efschu — your visible log ends mid-dispatcher-matrix, which means the container died WHILE Genesis was still printing the apply table. We need the lines below the matrix to see the actual crash. Two diagnosis paths, do whichever is convenient:

Path A — full log (most informative)

docker logs vllm-qwen36-27b-dual-turbo > /tmp/dual-turbo.log 2>&1
wc -l /tmp/dual-turbo.log

If it's <1000 lines, paste the whole thing. If larger, grep the tail past the matrix:

sed -n '/Genesis Results:/,$p' /tmp/dual-turbo.log
# OR if no 'Genesis Results' line yet (matrix is the last thing printed):
sed -n '/apply matrix:/,$p' /tmp/dual-turbo.log

That reveals whether (a) Genesis crashed mid-apply on a specific patch, (b) Genesis finished but vLLM aborted on the model load / CUDA init / cudagraph capture, or (c) something else entirely.

Path B — verify your stack is in sync first

The dispatcher v2 conflicts I diagnosed earlier (P65/P67/P85) shipped fixes on master 2026-05-03 (commit a26e30b for the dual + long-* composes). On current master, dual-turbo.yml has P65 and P85 commented out — only P67 is enabled. So if your pull is clean and your Genesis tree is at 2db18df (v7.69), those specific conflicts shouldn't fire.

Today we also shipped bash scripts/update.sh (commit 43fe2a4) which is the one-shot fix for exactly this situation:

bash scripts/update.sh
# → bails if you have local edits (commit/stash them first)
# → git pull --ff-only origin master
# → re-runs setup.sh qwen3.6-27b (re-pins Genesis tree to declared GENESIS_PIN)

This handles the most common failure mode: "pulled but didn't re-run setup, Genesis tree is stale, env vars expect new dispatcher behavior, container crashes." Worth running before further investigation — eliminates a class of false positive.

If update.sh passes cleanly + the container still crashes, then we have a real bug. Path A's log will tell us where.

Also helpful

bash scripts/report.sh > my-rig.md and paste — captures repo commit, GENESIS_PIN declared vs on-disk, container Python/CUDA, image SHA. Catches the "did setup.sh actually align my Genesis tree" question.

For boot UX: also today shipped a switch.sh improvement (commit 8e9cf70) — wait_ready now detects container-died and dumps last 30 log lines immediately instead of timing out at 600s. So when update.sh re-boots dual-turbo, you should see a clearer error if it dies again.

Sorry for the regression. The dispatcher v2 strict-validation in v7.69 made our compose env-var combos more fragile, and the cleanup commits landed late.

1 reply

efschu May 3, 2026

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [8, 6], "is_ampere_datacenter": false, "is_ampere_consumer": true, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": false}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 3080
[INFO:genesis.apply_all]   canonical: RTX 3080  cc: (8, 6)  SM: 68  L2: 5 MB  BW: 760 GB/s  regime: bandwidth
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [OFF] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [ON ] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [ON ] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           ⚠️  enabled but NOT recommended on this GPU (may regress or no-op)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=True on SM=(8, 6) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=True (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-03 19:57:11 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-03 19:57:11 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61b — Qwen3 streaming partial-tag overlap guard | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P61b Qwen3 streaming overlap guard] applied 2 sub-patches: p61b_import, p61b_overlap_guard
[INFO:genesis.apply_all] [Genesis] applied: P61b Qwen3 streaming partial-tag overlap guard — P61b applied: overlap guard active in streaming path
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P62 — Structured-output spec-decode reasoning-end timing fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P62 structured_output/__init__.py] applied 2 sub-patches: p62_grammar_bitmask, p62_new_methods
[INFO:genesis.wiring.text_patch] [P62 scheduler.py] applied 3 sub-patches: p62_sched_update_from_output, p62_sched_udti, p62_sched_udtio
[INFO:genesis.apply_all] [Genesis] applied: P62 structured-output spec-decode timing fix — P62 applied: 2 files modified, 0 idempotent. Reasoning-aware grammar acceptance + spec-token validation now active. Should reduce residual broken tool-call rate when </think> arrives in spec batch.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P61 Qwen3 multi-tool first-occurrence] applied 1 sub-patches: p61_first_occurrence
[INFO:genesis.apply_all] [Genesis] applied: P61 Qwen3 multi-tool first-occurrence — P61 applied: extract_content_ids now finds FIRST <tool_call> (was LAST). Multi-tool requests preserve all tool calls.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P60b causal_conv1d.py — Triton kernel + wrapper] applied 8 sub-patches: p60b_kernel_sig, p60b_constexpr, p60b_offset_calc, p60b_step1, p60b_step2, p60b_wrapper_sig, p60b_kernel_call, p60b_kernel_kwarg
[INFO:genesis.wiring.text_patch] [P60b gdn_linear_attn.py — pass num_accepted to conv_fn] applied 1 sub-patches: p60b_gdn_conv_fn
[INFO:genesis.wiring.p60b_gdn_ngram_triton_kernel] [Genesis P60b] Triton cache: no cache dir cleared (may be empty already or permission denied)
[INFO:genesis.apply_all] [Genesis] applied: P60b GDN+ngram Triton kernel offset — P60b Phase 2 applied: 2 files modified, 0 idempotent. Triton kernel offset active. no cache dir cleared (may be empty already or permission denied) First spec-decode call will trigger kernel recompile (~5-10s).
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P60 gdn_attn.py — spec_decode_src_indices] applied 3 sub-patches: p60_field, p60_build, p60_ctor
[INFO:genesis.wiring.text_patch] [P60 gdn_linear_attn.py — SSM state pre-copy] applied 2 sub-patches: p60_core, p60_decode
[INFO:genesis.wiring.text_patch] [P60 gpu_model_runner.py — non-spec passthrough] applied 1 sub-patches: p60_passthrough
[INFO:genesis.apply_all] [Genesis] applied: P60 GDN+ngram state recovery — P60 Phase 1 applied: 3 files modified, 0 idempotent. SSM state pre-copy active for GDN+ngram spec decode. Conv state Triton kernel fix (Phase 2) NOT included — if reproducer still shows broken output, Phase 2 is the next step.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P65 — TurboQuant spec-decode cudagraph downgrade | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P65 TurboQuant spec-decode cudagraph downgrade — opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P68 — Auto force tool_choice=required for long-context tool calls | opt-in env (config: neutral)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P69 — Long-context tool-format reminder injection | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P68/P69 serving.py — long-ctx tool-call hook injection] applied 1 sub-patches: p6869_hook_insert
[INFO:genesis.apply_all] [Genesis] applied: P68/P69 long-context tool-call adherence — Hook injected into create_chat_completion. Active mitigations: P68 (auto force tool_choice=required for long ctx); P69 (append tool-format reminder to last user msg). Threshold: GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS env (default 50000, raised from 8000 in v7.65 per Issue #9).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '8.6', 'fp8_mode': 'e4b15', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P83 v1/core/single_type_kv_cache_manager.py — MTP keep-last-cached-block (vllm#38182 mitigation)] applied 2 sub-patches: p83_full_attention_skip_pop, p83_sliding_window_skip_pop
[INFO:genesis.apply_all] [Genesis] applied: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — P83 applied: MTP keep-last-cached-block guard installed at both FullAttentionManager and SlidingWindowManager pop sites. Activates ONLY when GENESIS_ENABLE_P83=1 is set in env. MTP-targeted; do NOT enable for true Eagle/Eagle3.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P100 flashinfer.py — native FULL CG for spec-decode (vllm#41127)] applied 11 sub-patches: p100_imports_drop_uniform, p100_fispecdecode_dataclass, p100_metadata_decode_union, p100_cgsupport_uniform_batch, p100_init_decode_wrap_field, p100_init_cgdict, p100_init_reorder_threshold, p100_init_qo_indptr_buffer, p100_get_spec_decode_prefill_wrapper_method, p100_build_query_len_scan_branch, p100_forward_fispecdecode_case
[INFO:genesis.apply_all] [Genesis] applied: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — P100 v7.62.17 applied: 11 sub-patches on flashinfer.py for native FULL CUDA graph + spec-decode without TRTLLM. 27B variants now get UNIFORM_BATCH cudagraph (was PIECEWISE) for K+1 spec-verify. Expected: +5-10% TPS on Ampere SM 8.6. NO-OP for PROD (TQ backend). Composes with P67/P67b/P98/P99 (different backends).
[INFO:genesis.wiring.text_patch] [P103 model_executor/layers/fla/ops/chunk.py — self-install hook (v7.69, club-3090#19 finding 2)] applied 1 sub-patches: p103_self_install_at_chunk_py_end
[INFO:genesis.wiring.p103_fla_cliff2_chunked] [Genesis P103 self-install] wrapper installed in chunk.py at module-import time (survives `exec vllm serve` + worker spawn)
[INFO:genesis.apply_all] [Genesis] applied: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — text-patch=applied (chunk.py self-install hook appended (survives `exec vllm serve` + worker spawn)); setattr already idempotent (chunk.py self-install fired or prior apply still active in this process)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P101 turboquant_attn.py — TQ continuation 64-token slicing (vllm#41123)] applied 2 sub-patches: p101_threshold_lower_and_max_cached, p101_continuation_slicing_loop
[INFO:genesis.apply_all] [Genesis] applied: P101 TQ continuation 64-token slicing (vllm#41123 selective) — P101 v7.62.16 applied: turboquant_attn.py continuation prefill now uses 64-token slicing + 32K cached-len cutoff. Expected: +3-12% TPS on PROD long-context. Composes with P98/P99 (non-overlapping anchors). REMINDER: P56 needs re-anchor against new use_decode_continuation block.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P99 workspace.py — memoize get_simultaneous (perf hotfix)] applied 2 sub-patches: p99_get_simultaneous_memo_entry, p99_get_simultaneous_memo_return
[INFO:genesis.apply_all] [Genesis] applied: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — P99 v7.62.15 applied: WorkspaceManager.get_simultaneous() now memoizes (shapes_and_dtypes, ubatch, ws_data_ptr) → cached views. Cache HIT bypasses list-comps + accumulate + _ensure_workspace_size → ~5x faster per call. Properly invalidates on ws re-alloc.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P98 turboquant_attn.py — revert WorkspaceManager (perf hotfix)] applied 2 sub-patches: p98_decode_workspace_revert, p98_dequant_workspace_revert
[INFO:genesis.apply_all] [Genesis] applied: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — P98 v7.62.14 applied: turboquant_attn.py _decode_attention + continuation prefill dequant now use per-layer cached buffers (OLD pattern, pre-vllm#40941). Removes Python indirection from decode hot path. Expected: +15-25% TPS recovery on Ampere small-batch single-stream workload.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P94 v1/spec_decode/llm_base_proposer.py — zero-alloc prepare_next_token_ids_padded (vllm#41043)] applied 1 sub-patches: p94_zero_alloc_backup_token_ids
[INFO:genesis.apply_all] [Genesis] applied: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — P94 applied: prepare_next_token_ids_padded uses in-place loop instead of .tolist() + list-comp + np.array(). Removes GPU->CPU sync, list-comp allocation, np.array allocation. Algorithmic identity preserved; expected +2-4% wall TPS on MTP K=3 spec decode.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P91 gptq_marlin.py — cdiv groups + row-group attrs (vllm#39460)] applied 3 sub-patches: p91_gm_floor_input_size_to_cdiv, p91_gm_floor_partition_to_cdiv, p91_gm_register_row_group_attrs
[INFO:genesis.wiring.text_patch] [P91 parameter.py — group-aware row_parallel start_idx (vllm#39460)] applied 1 sub-patches: p91_param_group_aware_start_idx
[INFO:genesis.apply_all] [Genesis] applied: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — P91 applied (DUAL FILE): gptq_marlin.py uses cdiv() for scale rows and tags scales/qzeros with row_group_size + row_input_size_per_partition; parameter.py uses group-aware start_idx for row-parallel scale/zero loading. Fixes silent dequant corruption when input_size_per_partition % group_size != 0 at TP>1.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in env (config: neutral)
[ERROR:genesis.apply_all] [Genesis] FAILED: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — P87 marlin.py — sub-tile output dim pad-on-load (vllm#40361): write_error ([Errno 30] Read-only file system: '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py')
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN8 model_executor/models/utils.py — get_draft_quant_config online-quant propagation (vllm#40849)] applied 2 sub-patches: pN8_imports, pN8_body
[INFO:genesis.apply_all] [Genesis] applied: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — PN8 applied: get_draft_quant_config() now inherits target's OnlineQuantizationConfig when draft has no explicit quant. Frees ~600 MiB on FP8-target + external-draft (Eagle3/DFlash/MTP). No-op when target is not online-quantized.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN9 — Independent drafter attention backend (vllm#39930) | opt-in env (config: neutral)
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN11 model_executor/layers/mamba/gdn_linear_attn.py — force a/b contiguity in fix_query_key_value_ordering (vllm#41142)] applied 1 sub-patches: pN11_a_b_contiguous
[INFO:genesis.apply_all] [Genesis] applied: PN11 GDN a/b contiguity (vllm#41142 backport) — PN11 applied: GatedDeltaNetAttention.fix_query_key_value_ordering now forces .contiguous() on `b` and `a` after reshape (vllm#41142). Defensive — eliminates silent quality drift class for any future Qwen3-Next/GDN config with num_v_heads == num_k_heads.
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN30 model_executor/layers/mamba/mamba_utils.py — DS layout spec-decode AL>1 fix (issue #17)] applied 2 sub-patches: pN30_get_conv_copy_spec_contiguous, pN30_module_level_state
[INFO:genesis.wiring.text_patch] [PN30 v1/worker/mamba_utils.py — do_mamba_copy_block stream sync + temp tensor cleanup (issue #17)] applied 1 sub-patches: pN30_do_mamba_copy_block_cleanup
[INFO:genesis.wiring.text_patch] [PN30 v1/worker/mamba_utils.py — collect_mamba_copy_meta dst-shaped DS temp (issue #17, v7.68)] applied 1 sub-patches: pN30_collect_mamba_copy_meta_dst_shaped_temp
[INFO:genesis.apply_all] [Genesis] applied: PN30 DS conv state + spec-decode AL>1 (issue #17) — PN30 v7.68 applied: DS conv state layout + spec-decode AL>1 now uses collect_mamba_copy_meta dst-shaped temp blocks for DS conv offset>0, preserving destination row stride. get_conv_copy_spec fails closed if the collect-time bypass is missed; do_mamba_copy_block keeps PN30's stream sync + temp clear lifecycle. Supersedes v7.65 compact .contiguous() approach which silently corrupted DS row strides (diagnosed by noonghunna/ChatGPT-Codex 2026-05-02).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN12 model_executor/layers/activation.py — SiluAndMul forward_cuda FFN intermediate pool (Cliff 1 fix on TQ3)] applied 1 sub-patches: pN12_silu_and_mul_pool
[INFO:genesis.apply_all] [Genesis] applied: PN12 FFN intermediate scratch pool (Cliff 1 fix) — PN12 applied: SiluAndMul.forward_cuda now acquires output via FFNIntermediateCache pool when GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1. Closes Cliff 1 OOM on TQ3 path that PN8 couldn't address (different memory class).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P15B turboquant_attn.py — _flash_attn_varlen max_seqlen_k clamp (Issue #15 fix)] applied 1 sub-patches: p15b_fa_varlen_clamp
[INFO:genesis.apply_all] [Genesis] applied: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — P15B applied: _flash_attn_varlen now clamps max_seqlen_k to actual cu_seqlens_k span. Prevents 50 MiB FA wrapper workspace OOM on long-context continuation-prefill (Issue #15). Adds one GPU->CPU sync per call on infrequent path.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P38b turboquant_attn.py — _continuation_prefill compile-safe hook (Issue #14 fix)] applied 1 sub-patches: p38b_continuation_prefill_hook
[INFO:genesis.wiring.p38b_compile_safe_hook] [Genesis P38b] installed _genesis_p38_dispatch on TurboQuantAttentionImpl — compile-safe path active
[INFO:genesis.apply_all] [Genesis] applied: P38B P38 compile-safe in-source hook (Issue #14 fix) — P38b applied: text-patch applied (P38b in-source hook injected); dispatcher installed on TurboQuantAttentionImpl. _continuation_prefill now survives aot_compile_fullgraph capture. Fixes Issue #14 (noonghunna).
[INFO:genesis.kernels.tq_decode_sparse_v] [Genesis PN26 sparse_v] kernel built and cached
[INFO:genesis.wiring.pn26_sparse_v_kernel] [Genesis PN26 sparse_v] wrapped vllm.v1.attention.ops.triton_turboquant_decode.triton_turboquant_decode_attention with sparse-V dispatcher (also rebound on consumers: ['vllm.v1.attention.backends.turboquant_attn'])
[INFO:genesis.apply_all] [Genesis] applied: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — PN26 sparse-V kernel dispatcher installed. Routes to Genesis fork when seq_len >= min_ctx (default 8192). Threshold: BLASST λ=a/L if GENESIS_PN26_SPARSE_V_SCALE_FACTOR>0, else fixed GENESIS_PN26_SPARSE_V_THRESHOLD (default 0.001). First sparse-V kernel deployed for SM86 (Ampere consumer) — no upstream NVIDIA reference exists yet.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in env (config: neutral)
[INFO:genesis.silu_and_mul_customop] [PN25] registered torch op genesis::silu_and_mul_pooled via direct_register_custom_op (vLLM canonical, v7.66 — fork-safe + Inductor-opaque)
[INFO:genesis.wiring.text_patch] [PN25 model_executor/layers/activation.py — SiluAndMul forward_native opaque-op pool (Inductor-safe Cliff 1 mech B)] applied 2 sub-patches: pN25_silu_and_mul_import_time_register, pN25_silu_and_mul_forward_native_opaque
[INFO:genesis.apply_all] [Genesis] applied: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — PN25 applied: SiluAndMul.forward_native now dispatches through genesis::silu_and_mul_pooled opaque custom op so torch.compile/Inductor cannot inline the FFN intermediate alloc. Closes Cliff 1 mech B on inductor-compiled FFN forward (club-3090#16 reproducer class). Sister to PN12 — PN12 covers eager forward_cuda, PN25 covers compile forward_native; both share the same FFNIntermediateCache pool and can be enabled simultaneously without conflict.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN13 compilation/cuda_graph.py — CUDAGraphWrapper gc.collect/empty_cache lambda arity fix (vllm#41235)] applied 1 sub-patches: pN13_lambda_arity
[INFO:genesis.apply_all] [Genesis] applied: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — PN13 applied: CUDAGraphWrapper.__call__ now accepts var-args on patched gc.collect / empty_cache lambdas (vllm#41235). Defensive fix — prevents TypeError worker-death class when dynamo recompiles nested @torch.compile callables inside cudagraph capture region.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN14 triton_turboquant_decode.py — TQ decode IOOB safe_page_idx clamp (vllm#40074)] applied 1 sub-patches: pN14_safe_page_idx
[INFO:genesis.apply_all] [Genesis] applied: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — PN14 applied: _tq_decode_stage1 now clamps masked-out lanes to page_idx=0 before block-table pointer arithmetic (vllm#40074). Defensive fix — prevents Triton bounds-checker assertion class originally reported on 4090 with >32k sequences in upstream #39998.
[INFO:genesis.wiring.text_patch] [PN19 scoped max_split_size_mb] applied 2 sub-patches: pn19_helper_method, pn19_load_model_wrap
[INFO:genesis.apply_all] [Genesis] applied: PN19 Scoped max_split_size_mb during model load (vllm#41268) — PN19 applied: max_split_size_mb=20 scoped to model load (vllm#41268 backport, MatthewBonanni). Frees 200-500 MiB load-time fragmentation.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [PN22 qwen3.py — get_top_tokens (vllm#39419)] applied 1 sub-patches: pn22_qwen3_get_top_tokens
[INFO:genesis.wiring.text_patch] [PN22 qwen3_dflash.py — get_top_tokens (vllm#39419)] applied 1 sub-patches: pn22_dflash_get_top_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN22 Local argmax for TP draft (vllm#39419 backport) — PN22 applied: get_top_tokens() added to qwen3.py + qwen3_dflash.py (vllm#39419 backport). Enables vocab-parallel argmax in spec-decode draft path; +9-30% TPS on TP>=2 per PR author.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.wiring.text_patch] [PN17 FA2 softmax_lse runtime clamp] applied 1 sub-patches: pn17_clamp
[INFO:genesis.apply_all] [Genesis] applied: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — PN17 applied: FA2 softmax_lse buffer now clamped to actual seqused_k at runtime, freeing 50-100 MiB on long-ctx (Cliff 1 mechanism A fix per noonghunna Genesis Issue #11).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P74 config/scheduler.py — auto chunk-clamp via long_prefill_token_threshold] applied 1 sub-patches: p74_auto_clamp
[INFO:genesis.apply_all] [Genesis] applied: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — P74 applied: SchedulerConfig auto-clamps long_prefill_token_threshold to GENESIS_PREALLOC_TOKEN_BUDGET when batched_tokens > budget. Prevents prealloc buffer overflow when running with batched=8192.
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P72 v1/worker/gpu_model_runner.py — profile_run M cap] applied 1 sub-patches: p72_profile_run_cap
[INFO:genesis.apply_all] [Genesis] applied: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — P72 applied: profile_run M capped to GENESIS_PROFILE_RUN_CAP_M (default 4096). Unblocks --max-num-batched-tokens > 4096 by avoiding Dynamo fake-tensor shape mismatch in moe_align_block_size symbolic-shape inference.
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P58 — Async-scheduler -1 placeholder fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P58 request.py — placeholder counter field] applied 2 sub-patches: p58_request_field, p58_request_num_tokens_with_spec
[INFO:genesis.wiring.text_patch] [P58 async_scheduler.py — counter-based placeholder intent] applied 1 sub-patches: p58_async_sched_assignment
[INFO:genesis.wiring.text_patch] [P58 scheduler.py — placeholder gating + new method] applied 5 sub-patches: p58_sched_spec_block, p58_sched_new_method, p58_sched_preempt, p58_sched_draft_site_a, p58_sched_draft_site_b
[INFO:genesis.apply_all] [Genesis] applied: P58 async-scheduler -1 placeholder fix — P58 backport applied: 3 files modified, 0 already-applied. Async scheduler -1 placeholder leakage fixed; spec-decode + cudagraph workloads should no longer loop or IMA. Validate with our regression smoke suite before serving traffic.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4128 (via GENESIS_PREALLOC_TOKEN_BUDGET)
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis P17/P18] Marlin tuning ready: SM=(8, 6) bsm=8 num_warps=4 num_stages=3
[INFO:genesis.apply_all] [Genesis] applied: P17/P18 Marlin MoE per-SM tuning — SM=(8, 6) bsm=8
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (8, 6) → Marlin fallback required (Triton block FP8 unsupported on Ampere)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(8, 6) → Marlin fallback path selected
[INFO:genesis.apply_all] Genesis Results: 58 applied, 40 skipped, 1 failed, 1 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 1 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  PN9 independent drafter attention backend (vllm#39930 backport) — upstream drift: 'spec_cfg.attention_backend' present in llm_base_proposer.py — PR #39930 (or equivalent) appears merged; PN9 self-retires (use --speculative-config.attention_backend instead)
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 8.0s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | APPLY  | Qwen3 streaming partial-tag overlap guard     | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783)      
P62   | APPLY  | Structured-output spec-decode reasoning-end t | opt-in env (config: neutral)                                 | sfbemerk (vllm#36138), ciciror
P61   | APPLY  | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in env (config: neutral)                                 | ExtReMLapin (vllm#40783) — P61
P60b  | APPLY  | GDN+ngram Triton kernel offset (Phase 2)      | opt-in env (config: neutral)                                 | tdoublep (vllm#40738)         
P60   | APPLY  | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in env (config: neutral)                                 | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | APPLY  | Auto force tool_choice=required for long-cont | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P69   | APPLY  | Long-context tool-format reminder injection   | opt-in env (config: neutral)                                 | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | APPLY  | MTP keep-last-cached-block (vllm#38182 downst | opt-in env (config: neutral)                                 | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | APPLY  | FlashInfer FULL CUDA graph for spec-decode (v | opt-in env (config: neutral)                                 | Backport of vllm#41127 (open 2
P101  | APPLY  | TQ continuation 64-token slicing (vllm#41123  | opt-in env (config: neutral)                                 | Selective backport of vllm#411
P99   | APPLY  | WorkspaceManager.get_simultaneous memoization | opt-in env (config: neutral)                                 | Per Sander 2026-04-28: 'if rev
P98   | APPLY  | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in env (config: neutral)                                 | Reverts upstream PR #40941 (ME
P94   | APPLY  | Spec-decode prepare_next_token_ids_padded zer | opt-in env (config: neutral)                                 | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | APPLY  | AutoRound row-parallel group cdiv + start-idx | opt-in env (config: neutral)                                 | Backport of non-MoE-specific p
P87   | APPLY  | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in env (config: neutral)                                 | Backport of vllm#40361 (OPEN).
PN8   | APPLY  | MTP/draft online-quant propagation (vllm#4084 | opt-in env (config: neutral)                                 | Backport of vllm#40849 (bhoomi
PN9   | APPLY  | Independent drafter attention backend (vllm#3 | opt-in env (config: neutral)                                 | Backport of vllm#39930 (Matthe
PN11  | APPLY  | GDN a/b contiguity in fix_query_key_value_ord | opt-in env (config: neutral)                                 | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | APPLY  | DS conv state layout + spec-decode AL>1 fix ( | opt-in env (config: neutral)                                 | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | APPLY  | FFN intermediate scratch pool — Cliff 1 fix o | opt-in env (config: neutral)                                 | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | APPLY  | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
P38B  | APPLY  | P38 compile-safe in-source hook (Issue #14 fi | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | APPLY  | SiluAndMul.forward_native opaque-op pool (Cli | opt-in env (config: neutral)                                 | Genesis-original 2026-05-01 in
PN13  | APPLY  | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in env (config: neutral)                                 | Backport of vllm#41235 (roikor
PN14  | APPLY  | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in env (config: neutral)                                 | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | APPLY  | Local argmax for TP draft (vllm#39419 backpor | opt-in env (config: neutral)                                 | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | APPLY  | Auto chunk-clamp via long_prefill_token_thres | opt-in env (config: neutral)                                 | Genesis-original (zero-VRAM-co
P72   | APPLY  | profile_run M cap (unblocks --max-num-batched | opt-in env (config: neutral)                                 | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | APPLY  | Async-scheduler -1 placeholder fix            | opt-in env (config: neutral)                                 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)

# club-3090 rig report

Generated: 2026-05-03 20:00:58 UTC

_Redacted output (paths, host, user, tokens). Re-run with `--no-redact` for full data._

## System

- **OS:** Debian GNU/Linux 13 (trixie)
- **Kernel:** 6.17.2-2-pve
- **Environment:** bare metal
- **Locale:** en_US.UTF-8
- **Timezone:** CEST
- **Uptime:** up 3 days, 8 hours, 59 minutes

## CPU + RAM

- **CPU:** AMD Ryzen 9 5950X 16-Core Processor (32 threads)
- **RAM:** 125Gi total, 32Gi available

## Disk

- **/spinning/llm_stuff/club-3090/models-cache:** 1.7T available, zfs filesystem
- **/var/lib/docker:** 1.7T available, zfs filesystem

## GPU hardware

- **GPU 0:** NVIDIA GeForce RTX 3080 | 20480 MiB | driver 595.58.03 | VBIOS 94.02.27.00.14 | persistence=Disabled
  - **Power:** limit=320.00 W (default=320.00 W, max=320.00 W) | current_draw=13.55 W
  - **PCIe:** x4 lanes negotiated (GPU max x16, Gen up to 4) | bus 00000000:05:00.0 ⚠ slot is narrower than GPU capability — affects load + all-reduce bandwidth
- **GPU 1:** NVIDIA GeForce RTX 3080 | 20480 MiB | driver 595.58.03 | VBIOS 94.02.27.00.14 | persistence=Disabled
  - **Power:** limit=320.00 W (default=320.00 W, max=320.00 W) | current_draw=17.42 W
  - **PCIe:** x8 lanes negotiated (GPU max x16, Gen up to 4) | bus 00000000:0B:00.0 ⚠ slot is narrower than GPU capability — affects load + all-reduce bandwidth
- **CUDA Runtime (per driver):** 13.2
- **ECC mode:** [N/A] (3090s don't have ECC; expect N/A)

### NVLink

_No NVLink detected (PCIe-only)_

### Topology

<details><summary>PCIe / GPU topology matrix</summary>

GPU0	GPU1	NIC0	NIC1	NIC2	NIC3	NIC4	NIC5	CPU Affinity	NUMA Affinity	GPU NUMA ID

GPU0 X PHB PIX PIX PIX PIX PIX PIX 0-31 0 N/A
GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-31 0 N/A
NIC0 PIX PHB X PIX PIX PIX PIX PIX
NIC1 PIX PHB PIX X PIX PIX PIX PIX
NIC2 PIX PHB PIX PIX X PIX PIX PIX
NIC3 PIX PHB PIX PIX PIX X PIX PIX
NIC4 PIX PHB PIX PIX PIX PIX X PIX
NIC5 PIX PHB PIX PIX PIX PIX PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: rocep4s0f0
NIC1: rocep4s0f1
NIC2: rocep4s0f0v1
NIC3: rocep4s0f0v2
NIC4: rocep4s0f1v1
NIC5: rocep4s0f1v2


</details>

### Full nvidia-smi

<details><summary>Full nvidia-smi output</summary>

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+


</details>

## Display / desktop state

- **$DISPLAY:** localhost:12.0 (X11 / Wayland session present)
- **Display processes running:** none detected
- **GPU 0 idle VRAM:** 0 MiB ✓
- **GPU 1 idle VRAM:** 0 MiB ✓

## Container runtime

- **Docker:** 29.0.0
- **docker compose (v2):** 5.0.0
- **NVIDIA Container Toolkit:** 1.13.5

## Stack version

- **club-3090:** `7be8ecc` (branch: `master`)
- **Working tree:** ⚠ has uncommitted changes (run `git status` to inspect)
- **GENESIS_PIN default:** `2db18df` (per scripts/setup.sh)
- **Cached vLLM images:**
  - tag `nightly` digest `sha256:82ef13585cd058276fdeb1ed01c343c3d9fbc76de7168ff9f4b6159e5c7ad99f` (3 days ago)
  - tag `nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8` digest `sha256:7923b48047be655ab8a18f2d152f7a67a03acb80866ee7cae298fea13c7ac9c7` (5 days ago)
  - tag `latest` digest `sha256:8672d9356d4f4474695fd69ef56531d9e482517da3b31feb9c975689332a4fb0` (16 months ago)

## Active container

_No vLLM container running. Start one with `bash scripts/launch.sh` and re-run for the full report._

---

_Generated by `bash scripts/report.sh`. Add `--verify` for verify-full output, `--bench` for canonical TPS bench, `--full` for both. Use `--no-redact` to disable redaction (internal sharing only)._

tested again with clean full setup, but still.
staying with 95b2c3b meanwhile

noonghunna · 2026-05-03T15:25:01Z

noonghunna
May 3, 2026
Maintainer

Heads-up @efschu — separate from your dual-turbo boot crash above (still need that full log to diagnose), one finding worth flagging in case it intersects: we just reproduced @GuiPerPT's club-3090#41 Cliff 2 OOM on single-card traffic. Both shipping single-card variants fail at ~21-26K accumulated context. Detailed writeup: #41 comment.

For your dual-3080 setup running dual-turbo (TP=2): activation should be split across both cards, so this single-card ceiling probably doesn't affect you directly. If you can get past the boot crash and run SOAK_MODE=continuous SOAK_SESSIONS=5 bash scripts/soak-test.sh against dual-turbo, that would be useful cross-rig validation that TP=2 escapes the cliff. But please prioritize getting your boot fixed first — let me know if the diagnosis paths I posted earlier surfaced anything.

Codex investigation queued for the kernel-level fix. Updates here as we learn more.

0 replies

efschu · 2026-05-04T06:44:18Z

efschu
May 4, 2026

Solved by reading this line:
[ERROR:genesis.apply_all] [Genesis] FAILED: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — P87 marlin.py — sub-tile output dim pad-on-load (vllm#40361): write_error ([Errno 30] Read-only file system: '/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py')
mounted as rw instead of ro, and it starts. Now I will run some benches/tests.

Edit:
========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
warm-1 wall= 12.35s ttft= 275ms toks=1000 wall_TPS= 80.97 decode_TPS= 82.81
warm-2 wall= 11.47s ttft= 64ms toks= 984 wall_TPS= 85.81 decode_TPS= 86.29
warm-3 wall= 11.73s ttft= 88ms toks= 994 wall_TPS= 84.72 decode_TPS= 85.36

=== measured (5) ===
run-1 wall= 11.16s ttft= 90ms toks=1000 wall_TPS= 89.63 decode_TPS= 90.36
run-2 wall= 12.29s ttft= 62ms toks=1000 wall_TPS= 81.40 decode_TPS= 81.81
run-3 wall= 11.16s ttft= 90ms toks= 895 wall_TPS= 80.20 decode_TPS= 80.85
run-4 wall= 12.23s ttft= 92ms toks= 990 wall_TPS= 80.93 decode_TPS= 81.54
run-5 wall= 11.91s ttft= 88ms toks=1000 wall_TPS= 83.98 decode_TPS= 84.61

=== summary [narrative] (n=5) ===
wall_TPS mean= 83.23 std= 3.85 CV= 4.6% min=80.20 max=89.63
decode_TPS mean= 83.83 std= 3.92 CV= 4.7% min=80.85 max=90.36
TTFT mean= 84ms std= 12ms min=62ms max=92ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
warm-1 wall= 3.63s ttft= 87ms toks= 394 wall_TPS=108.53 decode_TPS=111.20
warm-2 wall= 4.54s ttft= 88ms toks= 462 wall_TPS=101.69 decode_TPS=103.70
warm-3 wall= 5.03s ttft= 89ms toks= 544 wall_TPS=108.06 decode_TPS=110.00

=== measured (5) ===
run-1 wall= 7.98s ttft= 88ms toks= 800 wall_TPS=100.30 decode_TPS=101.42
run-2 wall= 7.58s ttft= 69ms toks= 800 wall_TPS=105.50 decode_TPS=106.46
run-3 wall= 5.56s ttft= 87ms toks= 616 wall_TPS=110.84 decode_TPS=112.60
run-4 wall= 4.43s ttft= 88ms toks= 469 wall_TPS=105.84 decode_TPS=107.99
run-5 wall= 5.71s ttft= 88ms toks= 595 wall_TPS=104.14 decode_TPS=105.76

=== summary [code] (n=5) ===
wall_TPS mean= 105.32 std= 3.79 CV= 3.6% min=100.30 max=110.84
decode_TPS mean= 106.85 std= 4.04 CV= 3.8% min=101.42 max=112.60
TTFT mean= 84ms std= 8ms min=69ms max=88ms

=== GPU state ===
0, 96 %, 15886 MiB, 20480 MiB, 249.18 W, 78
1, 98 %, 15886 MiB, 20480 MiB, 249.09 W, 82

is working again

0 replies

efschu · 2026-05-04T07:32:36Z

efschu
May 4, 2026

but last check failes:

scripts/verify-stress.sh 
Running STRESS / boundary test against http://localhost:8011 (model=qwen3.6-27b-autoround, container=vllm-qwen36-27b)
  This script does the heavy stuff (longctx needle ladder + ~25K-token tool prefill).
  For the fast functional smoke (~2 min), use verify-full.sh instead.

[1/7] Long-context needle small rungs (10K / 30K) ...
    ✓   9820 tokens: recalled 'crimson iguana 36' (got: crimson iguana 36 )
    ✓  29318 tokens: recalled 'golden otter 64' (got: golden otter 64 )
  ✓ all long-ctx depths recalled secret correctly
[2/7] Tool response prefill OOM (~25K-token mock tool response) ...
  ✓ tool prefill OK — model emitted 1 tool_call(s) (finish=tool_calls, prefill survived)
[3/7] IDE-agent one-shot prompt (sys + tool schemas + user request) ...
  ✓ IDE-agent one-shot OK — 109 completion tokens (195 chars), finish=stop
[4/7] Multi-turn agent prompt (sys + tools + 4-turn history) ...
  ✓ multi-turn agent OK
[5/7] LCB-coding shape (LeetCode-style problem + structured plan) ...
  ✓ LCB-coding shape OK
[6/7] Reasoning-heavy (math problem + max_tokens=8192) ...
  ✓ reasoning-heavy OK — 8192 completion tokens
[7/7] Long-context needle large rungs (60K / 90K — Cliff 2 territory) ...
    ✓  58570 tokens: recalled 'sapphire narwhal 90' (got: sapphire narwhal 90 )
    ✗ scale=1400: HTTP 000 (request failed)
  ✗ partial recall — some in-budget depths failed
    → Attention quality degrades at longer contexts on this config OR the deployment crashed mid-test. Check docker logs.

1 stress check(s) failed. See hints above.

0 replies

noonghunna · 2026-05-04T11:57:15Z

noonghunna
May 4, 2026
Maintainer

@efschu — that probe 7 fail at 90K matches your full bug report at #47 (same rig, same finding). Just answered the mechanism + recommended workarounds there in detail.

TL;DR for this discussion thread: 90K is at or past the Cliff 2 boundary on 2× 20 GB Ampere even on TP=2 with dual-turbo — DeltaNet GDN forward buffer + activation peak exceeds the per-card budget after split when the card has 4 GB less than a 3090. The "PCIe bandwidth collapse" you're seeing on nvtop is the kernel hanging mid-prefill (allocator wall → all-reduce spins), same cause as the OOM observed at a different layer.

What you've found defines the working envelope precisely: passes 60K, fails 90K on dual-turbo with turboquant_k8v4 KV. That's a real cross-rig data point — would be valuable as a Numbers from your rig submission once you've benched the working 60K config.

Continuing on #47 for the bug-tracking thread.

0 replies

Hardware mention for dual modded 3080s #25

Uh oh!

Replies: 13 comments · 7 replies

Uh oh!

noonghunna May 2, 2026 Maintainer

Uh oh!

Uh oh!

troymroberts May 2, 2026 Author

Uh oh!

noonghunna May 2, 2026 Maintainer

What your run tells us

Both items shipped to master (commit 95b0905)

1. Verify script anchor + pipefail bug — fixed

2. docs/HARDWARE.md — 2× modded 3080 20 GB row added (with credit)

Open follow-ups (low priority)

Uh oh!

troymroberts May 2, 2026 Author

Uh oh!

Uh oh!

noonghunna May 2, 2026 Maintainer

Uh oh!

troymroberts May 2, 2026 Author

Uh oh!

noonghunna May 2, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

noonghunna May 2, 2026 Maintainer

What you just confirmed

To make this docs-actionable, could you share

Uh oh!

Uh oh!

noonghunna May 3, 2026 Maintainer

Quick docs-action: getting your data into HARDWARE.md

Uh oh!

Uh oh!

noonghunna May 3, 2026 Maintainer

What I see in your output

What I don't know yet

In the meantime — quick workaround

On our side

Uh oh!

Uh oh!

noonghunna May 3, 2026 Maintainer

Path A — full log (most informative)

Path B — verify your stack is in sync first

Also helpful

Uh oh!

Uh oh!

noonghunna May 3, 2026 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noonghunna May 4, 2026 Maintainer

Replies: 13 comments 7 replies

noonghunna
May 2, 2026
Maintainer

troymroberts May 2, 2026
Author

noonghunna May 2, 2026
Maintainer

Both items shipped to master (commit `95b0905`)

2. `docs/HARDWARE.md` — 2× modded 3080 20 GB row added (with credit)

troymroberts
May 2, 2026
Author

noonghunna May 2, 2026
Maintainer

troymroberts May 2, 2026
Author

noonghunna May 2, 2026
Maintainer

noonghunna May 2, 2026
Maintainer

noonghunna
May 3, 2026
Maintainer

noonghunna
May 3, 2026
Maintainer

noonghunna
May 3, 2026
Maintainer

noonghunna
May 3, 2026
Maintainer

noonghunna
May 4, 2026
Maintainer