Initial data from 5090 run #51

apnar · 2026-05-04T17:27:57Z

apnar
May 4, 2026

As requested in one of the other discussions here is some initial output from a rtx 5090. I'm running on microk8s but using a container with the same parameters as:

models/qwen3.6-27b/vllm/compose/docker-compose.yml

Output of scripts/report.sh --full:

club-3090 rig report

Generated: 2026-05-04 17:02:23 UTC

Redacted output (paths, host, user, tokens). Re-run with --no-redact for full data.

System

OS: Ubuntu 26.04 LTS
Kernel: 7.0.0-15-generic
Environment: bare metal
Locale: en_US.UTF-8
Timezone: EDT
Uptime: up 14 minutes

CPU + RAM

CPU: 12th Gen Intel(R) Core(TM) i9-12900K (24 threads)
RAM: 123Gi total, 78Gi available
Swap: 8.0Gi

Disk

/k8s/club-3090/models-cache: 576G available, zfs filesystem

GPU hardware

GPU 0: NVIDIA GeForce RTX 5090 | 32607 MiB | driver 580.142 | VBIOS 98.02.2E.80.39 | persistence=Disabled
- Power: limit=600.00 W (default=600.00 W, max=600.00 W) | current_draw=5.46 W
- PCIe: x16 lanes negotiated (GPU max x16, Gen up to 5) | bus 00000000:01:00.0
CUDA Runtime (per driver): 13.0
ECC mode: [N/A] (3090s don't have ECC; expect N/A)

NVLink

No NVLink detected (PCIe-only)

Topology

PCIe / GPU topology matrix

	GPU0	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	0-23	0		N/A
NIC0	PHB	 X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx4_0

Full nvidia-smi

Full nvidia-smi output

Mon May  4 13:02:25 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142                Driver Version: 580.142        CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        Off |   00000000:01:00.0 Off |                  N/A |
|  0%   35C    P8              6W /  600W |   28452MiB /  32607MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           28688      C   VLLM::EngineCore                      28442MiB |
+-----------------------------------------------------------------------------------------+

Display / desktop state

$DISPLAY: unset (headless)
Display processes running: none detected
GPU 0 idle VRAM: 28452 MiB ⚠ something is using this GPU (display, browser, container)

Container runtime

Docker: not installed

Stack version

club-3090: 011d4cc (branch: master)
Working tree: ⚠ has uncommitted changes (run git status to inspect)
GENESIS_PIN default: 2db18df (per scripts/setup.sh)

Active container

No vLLM container running. Start one with bash scripts/launch.sh and re-run for the full report.

verify-full.sh output

verify-full output

Running FULL functional test against http://localhost:8020 (model=qwen3.6-27b-autoround, container=vllm-qwen36-27b)

[1/8] Server reachable on /v1/models ...
  ✓ server is serving
[2/8] Genesis patches applied ...
  ⊘ docker not in PATH (skipped)
[3/8] Basic completion — capital of France ...
  ✓ reply contains 'Paris'
[4/8] Tool calling ...
  ✓ tool_calls[] populated with get_weather
[5/8] Streaming (SSE) ...
  ✓ streamed 10 chunks, 79 chars:  Logic flows wrong, Red text appears on the screen, Found the missing semicolon. ...
[6/8] Thinking / reasoning mode ...
  ✓ reasoning 866 chars, content 4 chars (finish=stop)
    reasoning: Here's a thinking process:  1.  **Analyze User Input:**    -...
    content:     4....
[7/8] Output quality / cascade detection (2K-token completion) ...
  ✓ output OK — 9395 chars, variety=0.665, max_line_repeat=0, finish=stop
[8/8] MTP acceptance length threshold ...
  ⊘ docker not in PATH (skipped)

All checks passed. Stack is ready for full-functionality use.

verify-stress.sh output

verify-stress output (7 boundary checks incl. Cliff 2 needle recall)

Running STRESS / boundary test against http://localhost:8020 (model=qwen3.6-27b-autoround, container=vllm-qwen36-27b)
  This script does the heavy stuff (longctx needle ladder + ~25K-token tool prefill).
  For the fast functional smoke (~2 min), use verify-full.sh instead.

[1/7] Long-context needle small rungs (10K / 30K) ...
    ✓   9820 tokens: recalled 'sapphire platypus 50' (got: sapphire platypus 50 )
    ✓  29320 tokens: recalled 'sapphire narwhal 66' (got: sapphire narwhal 66 )
  ✓ all long-ctx depths recalled secret correctly
[2/7] Tool response prefill OOM (~25K-token mock tool response) ...
  ✓ tool prefill OK — text response (657 chars, finish=stop)
[3/7] IDE-agent one-shot prompt (sys + tool schemas + user request) ...
  ✓ IDE-agent one-shot OK — 66 completion tokens (114 chars), finish=stop
[4/7] Multi-turn agent prompt (sys + tools + 4-turn history) ...
  ✓ multi-turn agent OK
[5/7] LCB-coding shape (LeetCode-style problem + structured plan) ...
  ✓ LCB-coding shape OK
[6/7] Reasoning-heavy (math problem + max_tokens=8192) ...
  ✓ reasoning-heavy OK — 8192 completion tokens
[7/7] Long-context needle large rungs (60K / 90K — Cliff 2 territory) ...
    ⊘ scale=900: HTTP 400 (exceeds --max-model-len, expected — clean rejection)
    ⊘ scale=1400: HTTP 400 (exceeds --max-model-len, expected — clean rejection)
  ⊘ all depths above --max-model-len (deployed=48000); shrink ladder or raise ctx (skipped)

All stress / boundary checks passed. KV-cache and prefill paths are sound for the deployed config.

soak-test.sh (SOAK_MODE=continuous) output

soak-test stdout (5-session × 5-turn ramping conversation, ~25 min)

[soak] ERROR: 'docker' not found in PATH

_soak summary.md not produced — check stdout above_

bench.sh output

bench output (3 warmups + 5 measured per prompt)


========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall=  8.30s  ttft=    82ms  toks=1000  wall_TPS=120.43  decode_TPS=121.63
  warm-2     wall=  8.30s  ttft=    72ms  toks=1000  wall_TPS=120.52  decode_TPS=121.58
  warm-3     wall=  8.56s  ttft=    73ms  toks=1000  wall_TPS=116.88  decode_TPS=117.89

=== measured (5) ===
  run-1      wall=  8.44s  ttft=    72ms  toks=1000  wall_TPS=118.44  decode_TPS=119.47
  run-2      wall=  8.11s  ttft=    72ms  toks=1000  wall_TPS=123.37  decode_TPS=124.49
  run-3      wall=  8.00s  ttft=    73ms  toks=1000  wall_TPS=124.99  decode_TPS=126.13
  run-4      wall=  8.19s  ttft=    74ms  toks=1000  wall_TPS=122.11  decode_TPS=123.23
  run-5      wall=  8.02s  ttft=    73ms  toks=1000  wall_TPS=124.70  decode_TPS=125.84

=== summary [narrative] (n=5) ===
  wall_TPS       mean= 122.72   std=  2.65   CV= 2.2%   min=118.44   max=124.99
  decode_TPS     mean= 123.83   std=  2.70   CV= 2.2%   min=119.47   max=126.13
  TTFT          mean=    73ms  std=    1ms  min=72ms  max=74ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall=  4.92s  ttft=    72ms  toks= 800  wall_TPS=162.72  decode_TPS=165.15
  warm-2     wall=  4.60s  ttft=    72ms  toks= 742  wall_TPS=161.34  decode_TPS=163.91
  warm-3     wall=  4.86s  ttft=    74ms  toks= 744  wall_TPS=153.23  decode_TPS=155.61

=== measured (5) ===
  run-1      wall=  4.90s  ttft=    72ms  toks= 800  wall_TPS=163.23  decode_TPS=165.67
  run-2      wall=  4.69s  ttft=    72ms  toks= 800  wall_TPS=170.73  decode_TPS=173.41
  run-3      wall=  4.51s  ttft=    72ms  toks= 737  wall_TPS=163.35  decode_TPS=166.00
  run-4      wall=  5.20s  ttft=    72ms  toks= 800  wall_TPS=153.89  decode_TPS=156.05
  run-5      wall=  4.30s  ttft=    72ms  toks= 689  wall_TPS=160.24  decode_TPS=162.98

=== summary [code] (n=5) ===
  wall_TPS       mean= 162.29   std=  6.08   CV= 3.7%   min=153.89   max=170.73
  decode_TPS     mean= 164.82   std=  6.25   CV= 3.8%   min=156.05   max=173.41
  TTFT          mean=    72ms  std=    0ms  min=72ms  max=72ms

=== GPU state ===
0, 97 %, 29832 MiB, 32607 MiB, 434.06 W, 61

Generated by bash scripts/report.sh. Flags: --verify (verify-full), --stress (verify-stress 7/7 incl. Cliff 2 needles), --soak (SOAK_MODE=continuous, catches Cliff 2b), --bench (canonical TPS), --full (all four, ~35 min). Use --no-redact to disable redaction (internal sharing only).

apnar · 2026-05-04T17:29:20Z

apnar
May 4, 2026
Author

As a comparison here is the bench output from the exact same system with a RTX 3090ti:

bench.sh output

bench output (3 warmups + 5 measured per prompt)


========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall= 12.71s  ttft=    94ms  toks=1000  wall_TPS= 78.71  decode_TPS= 79.29
  warm-2     wall= 12.99s  ttft=    88ms  toks=1000  wall_TPS= 76.98  decode_TPS= 77.50
  warm-3     wall= 13.23s  ttft=    88ms  toks=1000  wall_TPS= 75.57  decode_TPS= 76.08

=== measured (5) ===
  run-1      wall= 12.86s  ttft=    85ms  toks=1000  wall_TPS= 77.76  decode_TPS= 78.28
  run-2      wall= 12.99s  ttft=    85ms  toks=1000  wall_TPS= 76.97  decode_TPS= 77.48
  run-3      wall= 12.92s  ttft=    85ms  toks=1000  wall_TPS= 77.38  decode_TPS= 77.90
  run-4      wall= 13.06s  ttft=    89ms  toks=1000  wall_TPS= 76.58  decode_TPS= 77.11
  run-5      wall= 12.93s  ttft=    85ms  toks=1000  wall_TPS= 77.33  decode_TPS= 77.84

=== summary [narrative] (n=5) ===
  wall_TPS       mean=  77.21   std=  0.45   CV= 0.6%   min=76.58   max=77.76
  decode_TPS     mean=  77.72   std=  0.45   CV= 0.6%   min=77.11   max=78.28
  TTFT          mean=    86ms  std=    2ms  min=85ms  max=89ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall=  7.07s  ttft=    87ms  toks= 735  wall_TPS=104.01  decode_TPS=105.31
  warm-2     wall=  7.01s  ttft=    89ms  toks= 716  wall_TPS=102.17  decode_TPS=103.48
  warm-3     wall=  4.76s  ttft=    88ms  toks= 484  wall_TPS=101.73  decode_TPS=103.65

=== measured (5) ===
  run-1      wall=  6.63s  ttft=    88ms  toks= 676  wall_TPS=101.96  decode_TPS=103.33
  run-2      wall=  4.97s  ttft=    88ms  toks= 499  wall_TPS=100.40  decode_TPS=102.21
  run-3      wall=  7.95s  ttft=    89ms  toks= 781  wall_TPS= 98.25  decode_TPS= 99.36
  run-4      wall=  7.00s  ttft=    88ms  toks= 726  wall_TPS=103.64  decode_TPS=104.97
  run-5      wall=  8.13s  ttft=    88ms  toks= 800  wall_TPS= 98.44  decode_TPS= 99.51

=== summary [code] (n=5) ===
  wall_TPS       mean= 100.54   std=  2.31   CV= 2.3%   min=98.25   max=103.64
  decode_TPS     mean= 101.88   std=  2.43   CV= 2.4%   min=99.36   max=104.97
  TTFT          mean=    88ms  std=    1ms  min=88ms  max=89ms

=== GPU state ===
0, 99 %, 22102 MiB, 24564 MiB, 405.99 W, 68
1, 0 %, 2189 MiB, 7680 MiB, 26.84 W, 77

0 replies

apnar · 2026-05-04T17:30:11Z

apnar
May 4, 2026
Author

I'd love some suggestions on tweaks to the compose file that I can test to make better use of the 5090 and remove some of the needed workarounds that are in place from the 3090.

0 replies

noonghunna · 2026-05-04T17:38:56Z

noonghunna
May 4, 2026
Maintainer

Massive thanks @apnar — first 5090 (Blackwell sm_120) data point on club-3090. Quick read on what your numbers show, then concrete tweaks to chase.

What the data says (5090 vs 3090 Ti, same system)

Metric	3090 Ti (Ampere sm_86)	RTX 5090 (Blackwell sm_120)	Ratio
Narrative wall TPS (mean of 5)	77.21	122.72	1.59×
Code wall TPS (mean of 5)	100.54	162.29	1.61×
TTFT	86 ms	73 ms	0.85×
Power during decode	405 W	434 W	1.07×
VRAM peak (single card)	22.1 GB / 24 GB	29.8 GB / 32 GB	n/a
verify-stress probes 1-6	(not run here)	✅ all pass	—

~1.6× speedup is roughly the memory-bandwidth ratio (5090's 1.79 TB/s vs 3090 Ti's 1.01 TB/s = 1.77×) which is the right shape for an INT4-quantized 27B that's bandwidth-bound on weight reads. You're getting most of what the hardware promises out of an unmodified-for-Blackwell config — meaning the workarounds we ship for Ampere aren't catastrophically counterproductive on Blackwell, but several of them can be removed or replaced for further gain. Concrete suggestions in priority order:

Tweak A (likely +5-15%): drop the Ampere-specific Marlin RO-mount

The compose mounts ../patches/vllm-marlin-pad/marlin.py over the in-image vLLM file (lines 53-54 of dual-turbo.yml, similar pattern in dual.yml). That patch is our local backport of vllm#40361 — the Marlin pad-sub-tile-n fix that Ampere needs for AutoRound INT4. Blackwell's Marlin tiling logic is different and may not need this fix.

Try:

# In your override compose, comment out or remove the two RO mounts:
# - ../patches/vllm-marlin-pad/marlin.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/marlin.py:ro
# - ../patches/vllm-marlin-pad/MPLinearKernel.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/kernels/linear/mixed_precision/MPLinearKernel.py:ro

Boot. If it boots clean and bench.sh numbers don't regress vs your current run, the patch was unnecessary on Blackwell — drop it. If you hit a pad_sub_tile_n shape error or accuracy regression, it's still needed even on Blackwell and this becomes useful upstream signal for vllm#40361.

Tweak B (likely +5-10% accuracy, neutral TPS): swap KV format

Your current config inherits --kv-cache-dtype fp8_e5m2 from the Ampere dual.yml baseline. On Ampere fp8 was the safest pick because there's no native FP8 compute and dequant-to-fp16 is the only path. Blackwell has native FP8 tensor cores (sm_120), including FP8 e4m3 which is more accurate than e5m2.

Try:

- --kv-cache-dtype
- fp8_e4m3

e4m3 has 1 more mantissa bit than e5m2 → less rounding noise on KV writes. On Blackwell you get the accuracy benefit AND the native compute path. Run a verify-stress probe at 60K + 90K needle (you'd need to raise --max-model-len from 48K to test those) to confirm no recall regression.

Note: also worth testing nvfp4 if vLLM supports it for your model+TP combo — Blackwell has native FP4 tensor cores. It's the marquee Blackwell feature. Less mature path on vLLM today though.

Tweak C (lots more headroom): raise max-model-len

You have 48K cap but the 5090's 32 GB VRAM has ~3 GB of unused headroom (29.8 used / 32 avail) at your current load. The KV pool is 5K-ish tokens worth bigger than what 48K needs:

max-model-len	Approx KV pool VRAM (fp8 e5m2)
48K (current)	~10 GB
128K	~24 GB (would push close to ceiling)
96K	~17 GB (likely the sweet spot)

Try --max-model-len 96000 and see what Available KV cache memory: X GiB boots at. Probe 7 of verify-stress.sh then runs at 60K / 90K needles instead of skipping. That gives you a real Cliff 2 measurement on Blackwell — first cross-rig data point for the 32 GB single-card class.

Tweak D (probably no-op but worth flagging): Genesis Blackwell auto-detection

Your boot log will show is_blackwell: true in the Genesis platform detection block. Several Genesis patches are SM86-targeted and will auto-skip on Blackwell:

PN26b (SM86 sparse-V Triton kernel) — should auto-skip
P67 (TQ multi-query kernel for spec-decode K+1) — depends on the env-default; safe to leave on
P3 (TurboQuant BF16→FP8 cast for Ampere fix) — should auto-skip

After your bench, would be useful to grep the boot log for is_blackwell and how many patches got [OFF] vs [ON] — different from the Ampere boot:

docker logs vllm-qwen36-27b 2>&1 | grep -E "is_blackwell|\[OFF\]|\[ON \]" | head -30

Some of those [OFF] patches might have Blackwell variants in newer Genesis pins worth opting in. Worth a separate experiment after the easier tweaks above.

On the soak-test docker error

Your soak-test.sh failed with 'docker' not found in PATH (similar story to setup.sh, which we softened on disc #48). Will follow up on softening the docker check in soak-test.sh too — same gate, same fix pattern. Tracking that as a separate item.

In the meantime, the soak-helper update I shipped today (commit f32d8a6) detects "silent-empty" turns (HTTP 200 + 0 completion tokens) automatically — useful gate even when you can't run it via the docker entrypoint. If you can run the helper directly:

python3 scripts/soak-helper.py summary <turn-log.csv> /tmp/summary.md <boot_vram_mib> 200 0 5

(But you'd need to feed it a turn-log.csv from a manually-orchestrated soak run, which on microk8s means writing a small wrapper. Probably not worth it until I soften soak-test.sh's docker check the same way setup.sh got softened.)

What we'd love to see next

In rough priority for adding to BENCHMARKS as a new "5090 single-card" row:

Tweak A/B/C results — even one combo measured cleanly is a great first point
is_blackwell: true boot log section so we can see which Genesis patches auto-skip
NVFP4 attempt if you're up for it — kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16 is the closest community quant we know about (only ~12 downloads, so you'd be the canary)

Genuinely thanks for being the first Blackwell data point on this stack. The 5090 sub-class is where most of the hardware roadmap goes; having someone running it through our verify-stress harness is a real contribution.

0 replies

apnar · 2026-05-04T19:16:30Z

apnar
May 4, 2026
Author

Thanks for the suggestions. For A, I'm only running a single card so I've been using the default docker-compose.yml, the suggestion in A appears to only apply to the dual configurations as I didn't see those lines in the default one. I made the changes suggested in B and C and here are the bench results:

bench.sh output

bench output (3 warmups + 5 measured per prompt)


========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall=  9.37s  ttft=  1049ms  toks=1000  wall_TPS=106.70  decode_TPS=120.15
  warm-2     wall=  8.07s  ttft=    76ms  toks=1000  wall_TPS=123.92  decode_TPS=125.10
  warm-3     wall=  8.72s  ttft=    77ms  toks=1000  wall_TPS=114.67  decode_TPS=115.69

=== measured (5) ===
  run-1      wall=  8.52s  ttft=    78ms  toks=1000  wall_TPS=117.37  decode_TPS=118.45
  run-2      wall=  8.44s  ttft=    76ms  toks= 999  wall_TPS=118.33  decode_TPS=119.40
  run-3      wall=  8.13s  ttft=    75ms  toks=1000  wall_TPS=123.00  decode_TPS=124.14
  run-4      wall=  8.47s  ttft=    76ms  toks=1000  wall_TPS=118.05  decode_TPS=119.12
  run-5      wall=  8.00s  ttft=    76ms  toks= 985  wall_TPS=123.09  decode_TPS=124.27

=== summary [narrative] (n=5) ===
  wall_TPS       mean= 119.97   std=  2.83   CV= 2.4%   min=117.37   max=123.09
  decode_TPS     mean= 121.08   std=  2.88   CV= 2.4%   min=118.45   max=124.27
  TTFT          mean=    76ms  std=    1ms  min=75ms  max=78ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall=  5.04s  ttft=    76ms  toks= 787  wall_TPS=156.29  decode_TPS=158.70
  warm-2     wall=  5.11s  ttft=    75ms  toks= 800  wall_TPS=156.40  decode_TPS=158.72
  warm-3     wall=  4.35s  ttft=    75ms  toks= 680  wall_TPS=156.16  decode_TPS=158.90

=== measured (5) ===
  run-1      wall=  4.96s  ttft=    75ms  toks= 763  wall_TPS=153.69  decode_TPS=156.06
  run-2      wall=  5.37s  ttft=    75ms  toks= 800  wall_TPS=148.89  decode_TPS=150.99
  run-3      wall=  3.37s  ttft=    78ms  toks= 516  wall_TPS=153.13  decode_TPS=156.78
  run-4      wall=  5.31s  ttft=    78ms  toks= 800  wall_TPS=150.73  decode_TPS=152.96
  run-5      wall=  2.95s  ttft=    76ms  toks= 464  wall_TPS=157.37  decode_TPS=161.53

=== summary [code] (n=5) ===
  wall_TPS       mean= 152.76   std=  3.22   CV= 2.1%   min=148.89   max=157.37
  decode_TPS     mean= 155.66   std=  4.03   CV= 2.6%   min=150.99   max=161.53
  TTFT          mean=    76ms  std=    2ms  min=75ms  max=78ms

=== GPU state ===
0, 91 %, 29202 MiB, 32607 MiB, 409.04 W, 60

Additionally, here is the grep for is_blackwell:

k logs club-3090 2>&1 | grep -E "is_blackwell|\[OFF\]|\[ON \]"
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [12, 0], "is_ampere_datacenter": false, "is_ampere_consumer": false, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]   [OFF] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)

I'll look to give the suggested model a try later but am happy to try any other config tweaks you'd like to suggest.

2 replies

noonghunna May 4, 2026
Maintainer

@apnar — three observations, plus a Genesis bug to flag.

Tweak A — you're right, NA for single-card

Confirmed: docker-compose.yml (single-card) doesn't have the Marlin RO mount; only the dual-card composes do. My note should have said "applicable on dual-* composes only" — fixing in the docs.

The Genesis sm_120 detection gap

This is the most interesting line in your boot log:

Genesis platform: {"nvidia": {"compute_capability": [12, 0],
  "is_ampere_consumer": false, "is_ada_lovelace": false,
  "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}}

Genesis correctly detects compute_capability=[12, 0] and has_native_fp8=true, but classifies your 5090 as is_blackwell: false — so most Blackwell-targeted patches won't engage. Your [ON] / [OFF] summary shows only P67 ON and P83+P84+P85 OFF (out of ~80 patches Genesis ships) — that's the smallest patch set Genesis applies to any rig class.

This is a Genesis bug, not a club-3090 issue. sm_120 is RTX 5090 / 5080 (consumer Blackwell). Genesis's platform detector probably checks for sm_100 (datacenter Blackwell) but not sm_120 (consumer). Worth filing on Sandermage/genesis-vllm-patches — title something like "Add sm_120 (consumer Blackwell, RTX 5090/5080) to is_blackwell detection".

That filing alone would unlock a bunch of Blackwell-aware patches for you. Would expect non-trivial TPS uplift if it works (PN26b, P67 multi-query, FlashInfer NVFP4 paths, etc., are all Blackwell-relevant).

Your B+C results — surprising and useful

Config	Narr wall TPS	Code wall TPS	VRAM peak
Original (fp8_e5m2 + 48K)	122.72	162.29	29832 MiB
B+C (fp8_e4m3 + 96K)	119.97	152.76	29202 MiB
Δ	-2.2%	-5.9%	-2.1%

Both fp8_e4m3 and 96K ctx slightly REGRESSED on your rig. Counter to my recommendation. Two likely reasons:

fp8_e4m3 path on sm_120 isn't optimized in vLLM yet. fp8_e5m2 has been the more-tested path on Ampere/Ada (Genesis P3 explicitly handles it). vLLM's e4m3 native compute on Blackwell consumer may use a slower fallback. The Genesis detection gap above probably contributes (it would normally apply Blackwell-optimized FP8 paths if it knew you were on Blackwell).
96K context with no actual long-prompt workload = bench prompts are tiny (~65-78 chars), so the 96K capacity isn't exercised, but cudagraph capture sizes shift between 48K and 96K configs, and that costs a small overhead. Worth re-testing 96K ONLY when you have a long-prompt workload.

Recommendation: revert to fp8_e5m2 for now, keep max-model-len at whatever ceiling matches your actual workload (48K for short prompts, 96K+ if you regularly send long prompts). The fp8_e4m3 win on Blackwell is real in theory but needs the vLLM/Genesis stack to catch up.

NVFP4 — the real Blackwell experiment

5090 has native FP4 tensor cores — that's the marquee Blackwell feature. The closest community quant for our model is kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16 — only ~12 downloads, so you'd be the canary, but it's specifically NVFP4 with linear_attn BF16 (matches the carve-out we use on Lorbus's INT4).

If it loads cleanly and runs, it should ship a real Blackwell win because the FP4 weights run on native FP4 tensor cores end-to-end (not dequant-to-fp16 like we'd see on Ampere). vLLM's NVFP4 support landed in late 2025 — should be in the nightly you're running.

To try:

hf download kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16 --local-dir ~/club-3090/models-cache/qwen3.6-27b-nvfp4
Modify your override compose: change --model path + --quantization auto_round (NVFP4 may need a different quant arg — vLLM's loader auto-detects from the config.json's quantization_config.quant_method)
Boot + bench

Expected: meaningfully faster than fp8 paths if it works. Possibly broken if vLLM nightly's NVFP4 path has issues — that'd be useful upstream signal too.

Action items I owe you

Update the 4-tweak ladder doc note to flag A as dual-only
Revert the fp8_e4m3 recommendation in HARDWARE.md / docs (was wrong on your rig — possibly wrong on all consumer Blackwell until Genesis sm_120 detection lands)
Add NVFP4 as the recommended next experiment for sm_120

Will land those when I do the next pass on docs. In the meantime, the Genesis sm_120 filing is the highest-leverage thing you can do — it'd benefit everyone running 5090 on this stack.

noonghunna May 4, 2026
Maintainer

Filed the Genesis sm_120 detection bug as genesis-vllm-patches#20. Fix is a 1-line change in guards.py:191 (cc[0] == 10 → cc[0] in (10, 12)). Tagging here so you have the upstream pointer; if/when Sander merges it, your boot log will start showing is_blackwell: true and any Blackwell-aware patches he ships will engage on your rig.

apnar · 2026-05-04T20:09:23Z

apnar
May 4, 2026
Author

Tried running the kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16 as suggested.

First run:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [12, 0], "is_ampere_datacenter": false, "is_ampere_consumer": false, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 5090
[INFO:genesis.apply_all]   canonical: RTX 5090  cc: (12, 0)  SM: 170  L2: 88 MB  BW: 1792 GB/s  regime: compute
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [REC] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P40=1
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [REC] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [OFF] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=False on SM=(12, 0) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=False (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-04 19:59:26 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-04 19:59:26 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61b — Qwen3 streaming partial-tag overlap guard | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P61b Qwen3 streaming partial-tag overlap guard — opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P62 — Structured-output spec-decode reasoning-end timing fix | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P62 structured-output spec-decode timing fix — opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P61 Qwen3 multi-tool first-occurrence — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60b GDN+ngram Triton kernel offset — opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60 GDN+ngram state recovery — opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P65 — TurboQuant spec-decode cudagraph downgrade | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P65 TurboQuant spec-decode cudagraph downgrade — opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P68 — Auto force tool_choice=required for long-context tool calls | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P69 — Long-context tool-format reminder injection | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P68/P69 long-context tool-call adherence — neither P68 nor P69 enabled; hook injection skipped to keep serving.py pristine. P68: opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage | P69: opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '12.0', 'fp8_mode': 'e4nv', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P101 TQ continuation 64-token slicing (vllm#41123 selective) — opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN9 — Independent drafter attention backend (vllm#39930) | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P38B P38 compile-safe in-source hook (Issue #14 fix) — opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — opt-in: set GENESIS_ENABLE_PN26_SPARSE_V=1 to enable sparse-V tile-skip kernel (BLASST λ=a/L formula by default)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT not set; default OFF. Backport of vllm#41268 (MatthewBonanni, OPEN). PyTorch 2.10+ introduces load-time fragmentation; this patch sets max_split_size_mb=20 during model load, restores on exit. Estimated win: 200-500 MiB on H100 (per #41268 author); unverified on Ampere — measure before relying on it.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN22 Local argmax for TP draft (vllm#39419 backport) — opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — GENESIS_ENABLE_PN17_FA2_LSE_CLAMP not set; default OFF. Enable on long-text-no-vision configs to close Cliff 1 mechanism A (FA2 softmax_lse over-allocation at long ctx). Diagnosis credit: noonghunna, Genesis Issue #11.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P58 — Async-scheduler -1 placeholder fix | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P58 async-scheduler -1 placeholder fix — opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4096 (default fallback). Set GENESIS_PREALLOC_TOKEN_BUDGET to override.
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis] skipped: P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (12, 0) → Native Triton FP8 (no override)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(12, 0) → native Triton FP8 path selected
[INFO:genesis.apply_all] Genesis Results: 27 applied, 72 skipped, 0 failed, 2 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 2 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[WARNING:genesis.apply_all] [Genesis] ⚠️  P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)      
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)         
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 6.2s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)      
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)         
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[workspace_lock_disable] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/workspace.py
[workspace_lock_disable] applied (lock-violation now logs WARNING, allocates anyway)
[tolist_cudagraph_fix] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py
[tolist_cudagraph_fix] Site B (_prefill_attention): applied
[tolist_cudagraph_fix] Site A (forward mixed-batch): applied
[tolist_cudagraph_fix] Patched /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py. Site A=applied, Site B=applied
WARNING 05-04 19:59:35 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:299] 
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev16+g7a1eb8ac2
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:299]   █▄█▀ █     █     █     █  model   /root/.cache/huggingface/qwen3.6-27b-nvfp4
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:299] 
(APIServer pid=1) INFO 05-04 19:59:35 [utils.py:233] non-default args: {'model_tag': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'trust_remote_code': True, 'dtype': 'float16', 'max_model_len': 96000, 'quantization': 'auto_round', 'served_model_name': ['qwen3.6-27b'], 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'max_num_batched_tokens': 4128, 'max_num_seqs': 1, 'enable_chunked_prefill': True, 'scheduler_reserve_full_isl': False, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3}}
(APIServer pid=1) INFO 05-04 19:59:40 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-04 19:59:40 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-04 19:59:40 [model.py:563] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) WARNING 05-04 19:59:40 [model.py:2030] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 05-04 19:59:40 [model.py:1692] Using max model len 96000
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args
(APIServer pid=1)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1692, in create_engine_config
(APIServer pid=1)     model_config = self.create_model_config()
(APIServer pid=1)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1536, in create_model_config
(APIServer pid=1)     return ModelConfig(
(APIServer pid=1)            ^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=1)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1)   Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (auto_round). [type=value_error, input_value=ArgsKwargs((), {'model': ...nderer_num_workers': 1}), input_type=ArgsKwargs]
(APIServer pid=1)     For further information visit https://errors.pydantic.dev/2.13/v/value_error

Then I try changing the data-type value from float16 to bfloat16 and it died with the following log output:

[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [12, 0], "is_ampere_datacenter": false, "is_ampere_consumer": false, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 5090
[INFO:genesis.apply_all]   canonical: RTX 5090  cc: (12, 0)  SM: 170  L2: 88 MB  BW: 1792 GB/s  regime: compute
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [REC] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P40=1
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [REC] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [OFF] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=False on SM=(12, 0) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=False (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-04 20:04:27 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-04 20:04:27 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61b — Qwen3 streaming partial-tag overlap guard | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P61b Qwen3 streaming partial-tag overlap guard — opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P62 — Structured-output spec-decode reasoning-end timing fix | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P62 structured-output spec-decode timing fix — opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P61 Qwen3 multi-tool first-occurrence — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60b GDN+ngram Triton kernel offset — opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60 GDN+ngram state recovery — opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P65 — TurboQuant spec-decode cudagraph downgrade | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P65 TurboQuant spec-decode cudagraph downgrade — opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P68 — Auto force tool_choice=required for long-context tool calls | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P69 — Long-context tool-format reminder injection | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P68/P69 long-context tool-call adherence — neither P68 nor P69 enabled; hook injection skipped to keep serving.py pristine. P68: opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage | P69: opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '12.0', 'fp8_mode': 'e4nv', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P101 TQ continuation 64-token slicing (vllm#41123 selective) — opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN9 — Independent drafter attention backend (vllm#39930) | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P38B P38 compile-safe in-source hook (Issue #14 fix) — opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — opt-in: set GENESIS_ENABLE_PN26_SPARSE_V=1 to enable sparse-V tile-skip kernel (BLASST λ=a/L formula by default)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT not set; default OFF. Backport of vllm#41268 (MatthewBonanni, OPEN). PyTorch 2.10+ introduces load-time fragmentation; this patch sets max_split_size_mb=20 during model load, restores on exit. Estimated win: 200-500 MiB on H100 (per #41268 author); unverified on Ampere — measure before relying on it.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN22 Local argmax for TP draft (vllm#39419 backport) — opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — GENESIS_ENABLE_PN17_FA2_LSE_CLAMP not set; default OFF. Enable on long-text-no-vision configs to close Cliff 1 mechanism A (FA2 softmax_lse over-allocation at long ctx). Diagnosis credit: noonghunna, Genesis Issue #11.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P58 — Async-scheduler -1 placeholder fix | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P58 async-scheduler -1 placeholder fix — opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4096 (default fallback). Set GENESIS_PREALLOC_TOKEN_BUDGET to override.
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis] skipped: P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (12, 0) → Native Triton FP8 (no override)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(12, 0) → native Triton FP8 path selected
[INFO:genesis.apply_all] Genesis Results: 27 applied, 72 skipped, 0 failed, 2 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 2 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[WARNING:genesis.apply_all] [Genesis] ⚠️  P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)      
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)         
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 6.2s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)      
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)         
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[workspace_lock_disable] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/workspace.py
[workspace_lock_disable] applied (lock-violation now logs WARNING, allocates anyway)
[tolist_cudagraph_fix] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py
[tolist_cudagraph_fix] Site B (_prefill_attention): applied
[tolist_cudagraph_fix] Site A (forward mixed-batch): applied
[tolist_cudagraph_fix] Patched /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py. Site A=applied, Site B=applied
WARNING 05-04 20:04:36 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:299] 
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev16+g7a1eb8ac2
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:299]   █▄█▀ █     █     █     █  model   /root/.cache/huggingface/qwen3.6-27b-nvfp4
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:299] 
(APIServer pid=1) INFO 05-04 20:04:36 [utils.py:233] non-default args: {'model_tag': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 96000, 'quantization': 'auto_round', 'served_model_name': ['qwen3.6-27b'], 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'max_num_batched_tokens': 4128, 'max_num_seqs': 1, 'enable_chunked_prefill': True, 'scheduler_reserve_full_isl': False, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3}}
(APIServer pid=1) INFO 05-04 20:04:42 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-04 20:04:42 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-04 20:04:42 [model.py:563] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 05-04 20:04:42 [model.py:1692] Using max model len 96000
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args
(APIServer pid=1)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=1)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1692, in create_engine_config
(APIServer pid=1)     model_config = self.create_model_config()
(APIServer pid=1)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1536, in create_model_config
(APIServer pid=1)     return ModelConfig(
(APIServer pid=1)            ^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=1)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=1) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1)   Value error, Quantization method specified in the model config (compressed-tensors) does not match the quantization method specified in the `quantization` argument (auto_round). [type=value_error, input_value=ArgsKwargs((), {'model': ...nderer_num_workers': 1}), input_type=ArgsKwargs]
(APIServer pid=1)     For further information visit https://errors.pydantic.dev/2.13/v/value_error

0 replies

apnar · 2026-05-04T20:15:26Z

apnar
May 4, 2026
Author

Tried changing Quantization to 'compressed-tensors' which got further but died with the following:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [12, 0], "is_ampere_datacenter": false, "is_ampere_consumer": false, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 5090
[INFO:genesis.apply_all]   canonical: RTX 5090  cc: (12, 0)  SM: 170  L2: 88 MB  BW: 1792 GB/s  regime: compute
[INFO:genesis.apply_all] 
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [REC] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P40=1
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [REC] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [OFF] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=False on SM=(12, 0) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=False (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-04 20:13:45 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-04 20:13:45 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61b — Qwen3 streaming partial-tag overlap guard | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P61b Qwen3 streaming partial-tag overlap guard — opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P62 — Structured-output spec-decode reasoning-end timing fix | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P62 structured-output spec-decode timing fix — opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P61 Qwen3 multi-tool first-occurrence — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60b GDN+ngram Triton kernel offset — opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60 GDN+ngram state recovery — opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P65 — TurboQuant spec-decode cudagraph downgrade | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P65 TurboQuant spec-decode cudagraph downgrade — opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P68 — Auto force tool_choice=required for long-context tool calls | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P69 — Long-context tool-format reminder injection | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P68/P69 long-context tool-call adherence — neither P68 nor P69 enabled; hook injection skipped to keep serving.py pristine. P68: opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage | P69: opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '12.0', 'fp8_mode': 'e4nv', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P101 TQ continuation 64-token slicing (vllm#41123 selective) — opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN9 — Independent drafter attention backend (vllm#39930) | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P38B P38 compile-safe in-source hook (Issue #14 fix) — opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — opt-in: set GENESIS_ENABLE_PN26_SPARSE_V=1 to enable sparse-V tile-skip kernel (BLASST λ=a/L formula by default)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT not set; default OFF. Backport of vllm#41268 (MatthewBonanni, OPEN). PyTorch 2.10+ introduces load-time fragmentation; this patch sets max_split_size_mb=20 during model load, restores on exit. Estimated win: 200-500 MiB on H100 (per #41268 author); unverified on Ampere — measure before relying on it.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN22 Local argmax for TP draft (vllm#39419 backport) — opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — GENESIS_ENABLE_PN17_FA2_LSE_CLAMP not set; default OFF. Enable on long-text-no-vision configs to close Cliff 1 mechanism A (FA2 softmax_lse over-allocation at long ctx). Diagnosis credit: noonghunna, Genesis Issue #11.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P58 — Async-scheduler -1 placeholder fix | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P58 async-scheduler -1 placeholder fix — opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4096 (default fallback). Set GENESIS_PREALLOC_TOKEN_BUDGET to override.
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis] skipped: P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (12, 0) → Native Triton FP8 (no override)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(12, 0) → native Triton FP8 path selected
[INFO:genesis.apply_all] Genesis Results: 27 applied, 72 skipped, 0 failed, 2 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 2 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[WARNING:genesis.apply_all] [Genesis] ⚠️  P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)      
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)         
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 6.3s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit                        
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)      
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)         
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)       
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN, 
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root 
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch 
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 — 
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 — 
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055) 
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)           
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)   
[workspace_lock_disable] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/workspace.py
[workspace_lock_disable] applied (lock-violation now logs WARNING, allocates anyway)
[tolist_cudagraph_fix] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py
[tolist_cudagraph_fix] Site B (_prefill_attention): applied
[tolist_cudagraph_fix] Site A (forward mixed-batch): applied
[tolist_cudagraph_fix] Patched /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py. Site A=applied, Site B=applied
WARNING 05-04 20:13:54 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:299] 
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev16+g7a1eb8ac2
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:299]   █▄█▀ █     █     █     █  model   /root/.cache/huggingface/qwen3.6-27b-nvfp4
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:299] 
(APIServer pid=1) INFO 05-04 20:13:54 [utils.py:233] non-default args: {'model_tag': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 96000, 'quantization': 'compressed-tensors', 'served_model_name': ['qwen3.6-27b'], 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'max_num_batched_tokens': 4128, 'max_num_seqs': 1, 'enable_chunked_prefill': True, 'scheduler_reserve_full_isl': False, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3}}
(APIServer pid=1) INFO 05-04 20:14:00 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-04 20:14:00 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-04 20:14:00 [model.py:563] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 05-04 20:14:00 [model.py:1692] Using max model len 96000
(APIServer pid=1) INFO 05-04 20:14:00 [cache.py:261] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-04 20:14:06 [model.py:563] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 05-04 20:14:06 [model.py:1692] Using max model len 262144
(APIServer pid=1) WARNING 05-04 20:14:06 [speculative.py:659] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 05-04 20:14:06 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4128.
(APIServer pid=1) WARNING 05-04 20:14:06 [config.py:367] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=1) INFO 05-04 20:14:06 [config.py:387] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) INFO 05-04 20:14:06 [vllm.py:841] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-04 20:14:06 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) WARNING 05-04 20:14:06 [vllm.py:1403] max_num_scheduled_tokens is set to 4128 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) INFO 05-04 20:14:06 [vllm.py:1563] [Genesis P66] Filtered cudagraph_capture_sizes for spec-decode uniform_query_len=4: removed 2 non-divisible sizes [1, 2]; kept [4, 8]. Prevents mixed-q_len capture (vllm#28015 mechanism).
(APIServer pid=1) INFO 05-04 20:14:07 [compilation.py:303] Enabled custom fusions: act_quant
(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
INFO 05-04 20:14:16 [nixl_utils.py:32] NIXL is available
(EngineCore pid=84) INFO 05-04 20:14:16 [core.py:109] Initializing a V1 LLM engine (v0.20.1rc1.dev16+g7a1eb8ac2) with config: model='/root/.cache/huggingface/qwen3.6-27b-nvfp4', speculative_config=SpeculativeConfig(method='mtp', model='/root/.cache/huggingface/qwen3.6-27b-nvfp4', num_spec_tokens=3), tokenizer='/root/.cache/huggingface/qwen3.6-27b-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=96000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.6-27b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4128], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=84) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=84) INFO 05-04 20:14:18 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.1.96.56:54055 backend=nccl
(EngineCore pid=84) INFO 05-04 20:14:18 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=84) INFO 05-04 20:14:18 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=84) WARNING 05-04 20:14:18 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=84) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=84) INFO 05-04 20:14:22 [gpu_model_runner.py:4778] Starting to load model /root/.cache/huggingface/qwen3.6-27b-nvfp4...
(EngineCore pid=84) INFO 05-04 20:14:22 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=84) INFO 05-04 20:14:22 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=84) INFO 05-04 20:14:22 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=84) INFO 05-04 20:14:22 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=84) INFO 05-04 20:14:22 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=84) INFO 05-04 20:14:23 [weight_utils.py:904] Filesystem type for checkpoints: ZFS. Checkpoint size: 26.59 GiB. Available RAM: 46.56 GiB.
(EngineCore pid=84) INFO 05-04 20:14:23 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (ZFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:02,  4.99it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:00<00:02,  4.81it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:00<00:02,  4.76it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:00<00:02,  4.70it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:01<00:02,  4.66it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:01<00:01,  4.68it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:01<00:01,  4.70it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:01<00:01,  4.71it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:01<00:01,  4.71it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:02<00:01,  4.71it/s]
(EngineCore pid=84) Process EngineCore:
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136] EngineCore failed to start.
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     super().__init__(
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     self._init_executor()
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     self.driver_worker.load_model()
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4794, in load_model
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     self.model = model_loader.load_model(
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     self.load_weights(model, model_config)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 709, in load_weights
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/reload/torchao_decorator.py", line 50, in patched_model_load_weights
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     return original_load_weights(self, weights, *args, **kwargs)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 355, in load_weights
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 302, in _load_module
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     yield from self._load_module(
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 275, in _load_module
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     loaded_params = module_load_weights(weights)
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 856, in load_weights
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]     param = params_dict[name]
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136]             ~~~~~~~~~~~^^^^^^
(EngineCore pid=84) ERROR 05-04 20:14:25 [core.py:1136] KeyError: 'blocks.0.attn.proj.weight'
(EngineCore pid=84) Traceback (most recent call last):
(EngineCore pid=84)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=84)     self.run()
(EngineCore pid=84)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=84)     self._target(*self._args, **self._kwargs)
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core
(EngineCore pid=84)     raise e
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=84)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=84)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=84)     super().__init__(
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 118, in __init__
(EngineCore pid=84)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=84)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=84)     self._init_executor()
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=84)     self.driver_worker.load_model()
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=84)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4794, in load_model
(EngineCore pid=84)     self.model = model_loader.load_model(
(EngineCore pid=84)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 64, in load_model
(EngineCore pid=84)     self.load_weights(model, model_config)
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 381, in load_weights
(EngineCore pid=84)     loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(EngineCore pid=84)                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 709, in load_weights
(EngineCore pid=84)     return loader.load_weights(weights, mapper=self.hf_to_vllm_mapper)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/reload/torchao_decorator.py", line 50, in patched_model_load_weights
(EngineCore pid=84)     return original_load_weights(self, weights, *args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 355, in load_weights
(EngineCore pid=84)     autoloaded_weights = set(self._load_module("", self.module, weights))
(EngineCore pid=84)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 302, in _load_module
(EngineCore pid=84)     yield from self._load_module(
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 275, in _load_module
(EngineCore pid=84)     loaded_params = module_load_weights(weights)
(EngineCore pid=84)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 856, in load_weights
(EngineCore pid=84)     param = params_dict[name]
(EngineCore pid=84)             ~~~~~~~~~~~^^^^^^
(EngineCore pid=84) KeyError: 'blocks.0.attn.proj.weight'
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:02<00:01,  3.94it/s]
(EngineCore pid=84) 
[rank0]:[W504 20:14:26.235269772 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

1 reply

noonghunna May 4, 2026
Maintainer

@apnar — progress! compressed-tensors got you past the quant-method validation error. The new failure is a vision-tower weight-name mismatch in vLLM's qwen3_vl loader:

File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_vl.py", line 856, in load_weights
    param = params_dict[name]
KeyError: 'blocks.0.attn.proj.weight'

blocks.0.attn.proj.weight is a Qwen2VL/Qwen3VL ViT block weight. The kaitchup quant probably packed the language_model layers in NVFP4 + compressed-tensors but renamed or restructured the vision tower weights in a way vLLM's qwen3_vl loader doesn't recognize. Either kaitchup quantized the ViT differently from the LM (different naming convention), or stripped the vision tower entirely without removing the qwen3_vl architecture marker from config.json.

Try `--language-model-only`

Add to your override compose command::

- --language-model-only

This tells vLLM to skip the vision tower entirely. You lose VL capability (no image inputs) but the language model should load. If your workload is text-only, this is the quick path forward.

If --language-model-only boots clean: ship the bench numbers, that's a real Blackwell NVFP4 datapoint and worth a BENCHMARKS row even text-only.

If it still fails: the kaitchup quant has a deeper structural issue with vLLM's loader — probably worth filing on the kaitchup HF repo. We could also try a different NVFP4 variant if any exist for Qwen3.6-27B (not aware of others).

Aside — Genesis NOW knows you have an RTX 5090

Your latest boot log shows:

[Genesis GPU profile] detected: NVIDIA GeForce RTX 5090
  canonical: RTX 5090  cc: (12, 0)  SM: 170  L2: 88 MB  BW: 1792 GB/s  regime: compute

Genesis's GPU-profile detector correctly identifies the 5090 (good — that's how it picks the recommendation matrix). But the platform-data dict still says is_blackwell: false. So Sander has the model in the matrix but the predicate function lags — exactly the gap I filed at genesis-vllm-patches#20. Two-line fix on Sander's side. When that lands you should see is_blackwell: true and any patches that gate on it engage.

Not a blocker for the NVFP4 attempt; just confirms our diagnosis on the upstream filing. Cross-linking the issue from your latest boot for context if you want to add a +1 there: Sander's been responsive on these.

Smaller asks

Your boot log also shows regime: compute — useful classification (Blackwell consumer is compute-bound vs memory-bound for our workload class). And BW: 1792 GB/s (vs 936 on 3090) is the headline gain you should see at decode time when NVFP4 actually runs through native FP4 tensor cores. Worth tracking.

When you do get NVFP4 working (via --language-model-only or another route), the canonical bench delta vs your earlier fp8_e5m2 + 48K result (122/162 TPS) is the most useful datapoint — that tells us whether NVFP4 is the real Blackwell win or whether more vLLM Blackwell-path optimization is needed.

apnar · 2026-05-04T23:20:35Z

apnar
May 4, 2026
Author

I got much further this time the language-model-only. Here was the output:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [12, 0], "is_ampere_datacenter": false, "is_ampere_consumer": false, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 5090
[INFO:genesis.apply_all]   canonical: RTX 5090  cc: (12, 0)  SM: 170  L2: 88 MB  BW: 1792 GB/s  regime: compute
[INFO:genesis.apply_all]
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [REC] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P40=1
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [REC] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [OFF] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=False on SM=(12, 0) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=False (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-04 23:11:52 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-04 23:11:52 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61b — Qwen3 streaming partial-tag overlap guard | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P61b Qwen3 streaming partial-tag overlap guard — opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P62 — Structured-output spec-decode reasoning-end timing fix | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P62 structured-output spec-decode timing fix — opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P61 Qwen3 multi-tool first-occurrence — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60b GDN+ngram Triton kernel offset — opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60 GDN+ngram state recovery — opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P65 — TurboQuant spec-decode cudagraph downgrade | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P65 TurboQuant spec-decode cudagraph downgrade — opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P68 — Auto force tool_choice=required for long-context tool calls | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P69 — Long-context tool-format reminder injection | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P68/P69 long-context tool-call adherence — neither P68 nor P69 enabled; hook injection skipped to keep serving.py pristine. P68: opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage | P69: opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '12.0', 'fp8_mode': 'e4nv', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P101 TQ continuation 64-token slicing (vllm#41123 selective) — opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN9 — Independent drafter attention backend (vllm#39930) | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P38B P38 compile-safe in-source hook (Issue #14 fix) — opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — opt-in: set GENESIS_ENABLE_PN26_SPARSE_V=1 to enable sparse-V tile-skip kernel (BLASST λ=a/L formula by default)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT not set; default OFF. Backport of vllm#41268 (MatthewBonanni, OPEN). PyTorch 2.10+ introduces load-time fragmentation; this patch sets max_split_size_mb=20 during model load, restores on exit. Estimated win: 200-500 MiB on H100 (per #41268 author); unverified on Ampere — measure before relying on it.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN22 Local argmax for TP draft (vllm#39419 backport) — opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — GENESIS_ENABLE_PN17_FA2_LSE_CLAMP not set; default OFF. Enable on long-text-no-vision configs to close Cliff 1 mechanism A (FA2 softmax_lse over-allocation at long ctx). Diagnosis credit: noonghunna, Genesis Issue #11.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P58 — Async-scheduler -1 placeholder fix | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P58 async-scheduler -1 placeholder fix — opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4096 (default fallback). Set GENESIS_PREALLOC_TOKEN_BUDGET to override.
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis] skipped: P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (12, 0) → Native Triton FP8 (no override)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(12, 0) → native Triton FP8 path selected
[INFO:genesis.apply_all] Genesis Results: 27 applied, 72 skipped, 0 failed, 2 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 2 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[WARNING:genesis.apply_all] [Genesis] ⚠️  P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN,
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 —
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 —
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055)
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 6.4s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN,
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 —
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 —
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055)
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)
[workspace_lock_disable] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/workspace.py
[workspace_lock_disable] applied (lock-violation now logs WARNING, allocates anyway)
[tolist_cudagraph_fix] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py
[tolist_cudagraph_fix] Site B (_prefill_attention): applied
[tolist_cudagraph_fix] Site A (forward mixed-batch): applied
[tolist_cudagraph_fix] Patched /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py. Site A=applied, Site B=applied
WARNING 05-04 23:12:00 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:299]
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev16+g7a1eb8ac2
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:299]   █▄█▀ █     █     █     █  model   /root/.cache/huggingface/qwen3.6-27b-nvfp4
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:299]
(APIServer pid=1) INFO 05-04 23:12:00 [utils.py:233] non-default args: {'model_tag': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 96000, 'quantization': 'compressed-tensors', 'served_model_name': ['qwen3.6-27b'], 'reasoning_parser': 'qwen3', 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4128, 'max_num_seqs': 1, 'enable_chunked_prefill': True, 'scheduler_reserve_full_isl': False, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3}}
(APIServer pid=1) INFO 05-04 23:12:06 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-04 23:12:06 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-04 23:12:06 [model.py:563] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 05-04 23:12:06 [model.py:1692] Using max model len 96000
(APIServer pid=1) INFO 05-04 23:12:06 [cache.py:261] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-04 23:12:11 [model.py:563] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 05-04 23:12:11 [model.py:1692] Using max model len 262144
(APIServer pid=1) WARNING 05-04 23:12:11 [speculative.py:659] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 05-04 23:12:11 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4128.
(APIServer pid=1) WARNING 05-04 23:12:11 [config.py:367] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=1) INFO 05-04 23:12:11 [config.py:387] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) INFO 05-04 23:12:11 [vllm.py:841] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-04 23:12:11 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) WARNING 05-04 23:12:11 [vllm.py:1403] max_num_scheduled_tokens is set to 4128 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) INFO 05-04 23:12:11 [vllm.py:1563] [Genesis P66] Filtered cudagraph_capture_sizes for spec-decode uniform_query_len=4: removed 2 non-divisible sizes [1, 2]; kept [4, 8]. Prevents mixed-q_len capture (vllm#28015 mechanism).
(APIServer pid=1) INFO 05-04 23:12:12 [compilation.py:303] Enabled custom fusions: act_quant
(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) INFO 05-04 23:12:13 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 05-04 23:12:17 [nixl_utils.py:32] NIXL is available
(EngineCore pid=84) INFO 05-04 23:12:17 [core.py:109] Initializing a V1 LLM engine (v0.20.1rc1.dev16+g7a1eb8ac2) with config: model='/root/.cache/huggingface/qwen3.6-27b-nvfp4', speculative_config=SpeculativeConfig(method='mtp', model='/root/.cache/huggingface/qwen3.6-27b-nvfp4', num_spec_tokens=3), tokenizer='/root/.cache/huggingface/qwen3.6-27b-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=96000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.6-27b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4128], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=84) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=84) INFO 05-04 23:12:18 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=84) INFO 05-04 23:12:18 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.1.96.6:60099 backend=nccl
(EngineCore pid=84) INFO 05-04 23:12:18 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=84) INFO 05-04 23:12:19 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=84) WARNING 05-04 23:12:19 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=84) INFO 05-04 23:12:19 [gpu_model_runner.py:4778] Starting to load model /root/.cache/huggingface/qwen3.6-27b-nvfp4...
(EngineCore pid=84) INFO 05-04 23:12:19 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=84) INFO 05-04 23:12:19 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=84) INFO 05-04 23:12:19 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=84) INFO 05-04 23:12:19 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=84) INFO 05-04 23:12:19 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=84) INFO 05-04 23:12:20 [weight_utils.py:904] Filesystem type for checkpoints: ZFS. Checkpoint size: 26.59 GiB. Available RAM: 51.14 GiB.
(EngineCore pid=84) INFO 05-04 23:12:20 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (ZFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:01<00:24,  1.74s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:02<00:18,  1.40s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:04<00:18,  1.57s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:06<00:18,  1.65s/it]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:06<00:12,  1.25s/it]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:07<00:08,  1.11it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:07<00:05,  1.48it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:07<00:03,  1.89it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:07<00:02,  2.33it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:08<00:01,  2.77it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:09<00:01,  1.68it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:11<00:00,  1.44it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:11<00:00,  1.31it/s]
(EngineCore pid=84)
(EngineCore pid=84) INFO 05-04 23:12:31 [default_loader.py:384] Loading weights took 11.48 seconds
(EngineCore pid=84) INFO 05-04 23:12:31 [gpu_model_runner.py:4802] Loading drafter model...
(EngineCore pid=84) INFO 05-04 23:12:31 [vllm.py:841] Asynchronous scheduling is enabled.
(EngineCore pid=84) INFO 05-04 23:12:31 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=84) WARNING 05-04 23:12:31 [vllm.py:1403] max_num_scheduled_tokens is set to 4128 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=84) INFO 05-04 23:12:31 [compilation.py:303] Enabled custom fusions: act_quant
(EngineCore pid=84) INFO 05-04 23:12:31 [weight_utils.py:904] Filesystem type for checkpoints: ZFS. Checkpoint size: 26.59 GiB. Available RAM: 51.15 GiB.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:00<00:00, 42.50it/s]
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter fc.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter layers.0.mlp.down_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter layers.0.mlp.gate_gate_up_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter layers.0.mlp.gate_up_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter layers.0.self_attn.qkqkv_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter layers.0.self_attn.o_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-04 23:12:32 [qwen3_5_mtp.py:325] Parameter layers.0.self_attn.qkv_proj.weight not found in params_dict, skip loading
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 26.93it/s]
(EngineCore pid=84)
(EngineCore pid=84) INFO 05-04 23:12:32 [default_loader.py:384] Loading weights took 0.56 seconds
(EngineCore pid=84) WARNING 05-04 23:12:32 [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj). This  will likely result in reduced accuracy. Please verify the model accuracy. Consider using a checkpoint with a shared global NVFP4 scale for fused layers.
(EngineCore pid=84) INFO 05-04 23:12:32 [llm_base_proposer.py:1460] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=84) INFO 05-04 23:12:32 [llm_base_proposer.py:1516] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=84) INFO 05-04 23:12:32 [gpu_model_runner.py:4880] Model loading took 25.29 GiB memory and 12.675088 seconds
(EngineCore pid=84) INFO 05-04 23:12:32 [interface.py:639] Setting attention block size to 1600 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=84) INFO 05-04 23:12:32 [interface.py:663] Padding mamba page size by 0.25% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=84) INFO 05-04 23:12:41 [backends.py:1069] Using cache directory: /root/.cache/vllm/torch_compile_cache/23534d80b0/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=84) INFO 05-04 23:12:41 [backends.py:1128] Dynamo bytecode transform time: 8.58 s
(EngineCore pid=84) INFO 05-04 23:12:43 [backends.py:376] Cache the graph of compile range (1, 4128) for later use
(EngineCore pid=84) INFO 05-04 23:13:05 [backends.py:391] Compiling a graph for compile range (1, 4128) takes 23.48 s
(EngineCore pid=84) INFO 05-04 23:13:09 [decorators.py:668] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/b260e8928b29bd4f0790e59ebf73d651eec7ccb9104d6f918d0430187a926260/rank_0_0/model
(EngineCore pid=84) INFO 05-04 23:13:09 [monitor.py:53] torch.compile took 36.98 s in total
(EngineCore pid=84) INFO 05-04 23:13:47 [monitor.py:81] Initial profiling/warmup run took 37.94 s
(EngineCore pid=84) INFO 05-04 23:13:48 [backends.py:1069] Using cache directory: /root/.cache/vllm/torch_compile_cache/23534d80b0/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=84) INFO 05-04 23:13:48 [backends.py:1128] Dynamo bytecode transform time: 0.47 s
(EngineCore pid=84) INFO 05-04 23:13:56 [backends.py:391] Compiling a graph for compile range (1, 4128) takes 8.21 s
(EngineCore pid=84) INFO 05-04 23:13:56 [decorators.py:668] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/d209b58e78088a3bcdc0cacb8f16ba4f7989bfb013699416110bf4059d23cc2b/rank_0_0/model
(EngineCore pid=84) INFO 05-04 23:13:56 [monitor.py:53] torch.compile took 8.98 s in total
(EngineCore pid=84) INFO 05-04 23:13:57 [monitor.py:81] Initial profiling/warmup run took 0.58 s
(EngineCore pid=84) WARNING 05-04 23:13:57 [kv_cache_utils.py:1181] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=84) WARNING 05-04 23:13:57 [compilation.py:1390] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
(EngineCore pid=84) INFO 05-04 23:13:57 [gpu_model_runner.py:5983] Profiling CUDA graph memory: PIECEWISE=2 (largest=8)
(EngineCore pid=84) INFO 05-04 23:13:59 [gpu_model_runner.py:6062] Estimated CUDA graph memory: 0.04 GiB total
(EngineCore pid=84) INFO 05-04 23:13:59 [gpu_worker.py:440] Available KV cache memory: 0.67 GiB
(EngineCore pid=84) INFO 05-04 23:13:59 [gpu_worker.py:455] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9200 is equivalent to --gpu-memory-utilization=0.9186 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9214. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=84) WARNING 05-04 23:13:59 [kv_cache_utils.py:1181] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136] EngineCore failed to start.
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136] Traceback (most recent call last):
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     super().__init__(
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     return func(*args, **kwargs)
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 261, in _initialize_kv_caches
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     kv_cache_configs = get_kv_cache_configs(
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 2065, in get_kv_cache_configs
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     _check_enough_kv_cache_memory(
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 717, in _check_enough_kv_cache_memory
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136]     raise ValueError(
(EngineCore pid=84) ERROR 05-04 23:13:59 [core.py:1136] ValueError: To serve at least one request with the models's max seq len (96000), (3.89 GiB KV cache is needed, which is larger than the available KV cache memory (0.67 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore pid=84) Process EngineCore:
(EngineCore pid=84) Traceback (most recent call last):
(EngineCore pid=84)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=84)     self.run()
(EngineCore pid=84)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=84)     self._target(*self._args, **self._kwargs)
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1140, in run_engine_core
(EngineCore pid=84)     raise e
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1110, in run_engine_core
(EngineCore pid=84)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=84)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 876, in __init__
(EngineCore pid=84)     super().__init__(
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=84)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=84)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=84)     return func(*args, **kwargs)
(EngineCore pid=84)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 261, in _initialize_kv_caches
(EngineCore pid=84)     kv_cache_configs = get_kv_cache_configs(
(EngineCore pid=84)                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 2065, in get_kv_cache_configs
(EngineCore pid=84)     _check_enough_kv_cache_memory(
(EngineCore pid=84)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 717, in _check_enough_kv_cache_memory
(EngineCore pid=84)     raise ValueError(
(EngineCore pid=84) ValueError: To serve at least one request with the models's max seq len (96000), (3.89 GiB KV cache is needed, which is larger than the available KV cache memory (0.67 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
[rank0]:[W504 23:14:00.490662589 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) Traceback (most recent call last):
(APIServer pid=1)   File "/usr/local/bin/vllm", line 10, in <module>
(APIServer pid=1)     sys.exit(main())
(APIServer pid=1)              ^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=1)     args.dispatch_function(args)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=1)     uvloop.run(run_server(args))
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=1)     return __asyncio.run(
(APIServer pid=1)            ^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=1)     return runner.run(main)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=1)     return self._loop.run_until_complete(task)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=1)     return await main
(APIServer pid=1)            ^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server
(APIServer pid=1)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker
(APIServer pid=1)     async with build_async_engine_client(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=1)     async with build_async_engine_client_from_engine_args(
(APIServer pid=1)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=1)     return await anext(self.gen)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=1)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=1)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=1)     return cls(
(APIServer pid=1)            ^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=1)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=1)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=1)     return AsyncMPClient(*client_args)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=1)     return func(*args, **kwargs)
(APIServer pid=1)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=1)     super().__init__(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=1)     with launch_core_engines(
(APIServer pid=1)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=1)     next(self.gen)
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1119, in launch_core_engines
(APIServer pid=1)     wait_for_engine_startup(
(APIServer pid=1)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1178, in wait_for_engine_startup
(APIServer pid=1)     raise RuntimeError(
(APIServer pid=1) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

I tried lowering the 96000 max-model-len back down to 48000, but it still failed with similar error.

(EngineCore pid=84) ERROR 05-04 23:19:02 [core.py:1136] ValueError: To serve at least one request with the models's max seq len (48000), (2.33 GiB KV cache is needed, which is larger than the available KV cache memory (0.67 GiB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.

Any thoughts on where I can find a bit more memory?

1 reply

noonghunna May 4, 2026
Maintainer

@apnar — short answer: the OOM isn't where the easy memory is; the variant you picked is doing what its name says and it's structurally tight on a 32 GB card. Three concrete tweaks below that should boot it, plus an honest take on whether it's worth chasing.

What the boot log shows

[weight_utils] Checkpoint size: 26.59 GiB        ← what's on disk
[gpu_model_runner] Model loading took 25.29 GiB memory and 12.7s
[gpu_model_runner] Estimated CUDA graph memory: 0.04 GiB total
[gpu_worker]      Available KV cache memory: 0.67 GiB

That 25.29 GiB load is expected, not a vLLM bug. The quant is kaitchup/Qwen3.6-27B-autoround-nvfp4-linearattn-BF16 — read the suffix: NVFP4 is applied only to the standard-attention layers; the DeltaNet linear-attention layers stay in BF16. On Qwen3.6-27B that's ~75% of the layers in BF16 + ~25% in FP4 + BF16 MTP head + vision tower. ~25 GB resident is the right shape for that mix.

So at --gpu-memory-utilization 0.92 on a 32 GB card you have ~29.4 GB usable; 25.29 GB model + cudagraph + non-torch working buffers leave only ~0.67 GB for KV — and a single 48K-token request needs 2.33 GB. That's the OOM.

Three tweaks (in order of how much memory they free)

1. Drop MTP — likely +1.0-1.5 GB

- --speculative-config
- 'null'

MTP draft weights live in GPU memory (BF16 head). On 35B-A3B-FP8 we measured ~1 GiB per GPU; on 27B with BF16 head it's a similar order of magnitude. You lose the +30-40% TPS that MTP provides, but you regain the headroom that lets the model boot at all. This is the highest-yield single change.

2. Raise gpu-memory-utilization — likely +0.5-1 GB

- --gpu-memory-utilization
- '0.96'                  # from 0.92

The boot log explicitly told you the equivalent is 0.9214 with cudagraph profiling factored in, and 0.96 is well within RTX 5090 driver headroom. The risk is a brief peak-pressure activation hitting OOM; the reward is enough KV pool to actually fit a request.

3. Lower max-model-len to 16K-24K initially — proves boot works

- --max-model-len
- '16000'                 # from 96000

Once you confirm the model boots and serves one request, raise it as far as KV pool allows. Boot-log line Available KV cache memory: X GiB after the two tweaks above — that times ~24K tokens-per-GiB (linearattn-bf16 KV is roughly that on this model, fp8_e4m3) is your effective ctx ceiling.

Combined first attempt

- --max-model-len
- '24000'
- --gpu-memory-utilization
- '0.96'
- --speculative-config
- 'null'

If that boots: you've got a working NVFP4 baseline. Then experimentally raise max-model-len step-by-step.

The honest take — is NVFP4 worth chasing here?

Probably not as your daily driver, for these reasons:

(a) You don't get the Blackwell FP4 tensor core speedup you'd hope for. Native FP4 compute only kicks in on the layers that are FP4-quantized. On this variant 75% of the work is BF16 DeltaNet — running on regular Blackwell BF16 path, no FP4 win. So the speedup ceiling vs your current INT4 AutoRound is bounded by the 25% attention fraction, not the headline NVFP4 number.

(b) You're already getting most of the Blackwell win on the standard INT4 AutoRound config. Your earlier numbers (122 narr / 162 code TPS, 29.8/32 GB used) are ~1.6× the 3090 Ti numbers — which is roughly the memory-bandwidth ratio. That means you're near the bandwidth-bound ceiling already. There's not 1.5× more TPS to recover by switching to a partial-FP4 quant; there might be 1.05-1.15× at best, and you give up 12 GB of KV pool to get it.

(c) The "real" Blackwell-FP4 win is gated on a future quant that does FP4 on both attention paths. That requires either an NVFP4 DeltaNet kernel (doesn't exist in vLLM today) or a different model architecture without DeltaNet. Neither is something we can shortcut from the user side.

(d) NVFP4 + AL>1 spec-decode interactions on Genesis-patched vLLM aren't well-characterized yet. You're a sample size of one on this configuration; we're discovering things like "PN8 silent-empty interaction" only this week through stiggy's bisect on a different config (issue #43). NVFP4 surface area is even thinner.

What I'd suggest instead

Stick with the AutoRound INT4 config you've already validated (122/162 TPS at 60K verify-stress all green) as your daily driver. If you want to spend cycles experimenting, the higher-yield Blackwell exploration is:

Drop the Marlin RO-mount (Tweak A from my earlier reply) — could be +5-15% TPS on your INT4 config
Switch KV to fp8_e4m3 (Tweak B) — small accuracy/correctness win on Blackwell native FP8
Raise max-model-len to 96K on your INT4 config (Tweak C) — gives the first cross-rig 32 GB single-card Cliff 2 measurement

Those three are likely +10-20% TPS combined and unlock real new data points. NVFP4 with a partial-quant variant is a science experiment with a low ceiling on this hardware/model combo.

If you boot NVFP4 anyway, what we'd love to see

If you do get it booting with the three tweaks above:

Wall TPS narr/code via bench.sh (3 warmup + 5 measured)
Boot log line Model loading took X.XX GiB
Boot log line Available KV cache memory: X GiB
Whether MTP-disabled NVFP4 beats MTP-enabled INT4 on TPS (probably not, but useful confirmation)

That's the experiment that closes the question. Even a "no, INT4+MTP wins" result is real signal worth posting.

Side note: your boot log shows the Genesis platform detection picking up compute_capability: [12, 0] correctly via the Genesis GPU profile detected: NVIDIA GeForce RTX 5090 block, but the legacy is_blackwell: false boolean is wrong (filed as Sandermage/genesis-vllm-patches#20). It doesn't affect this OOM, but heads-up that some [OFF] patches in your log might actually have Blackwell-aware variants gated on a fixed is_blackwell flag. Worth a re-test once Sander's fix lands.

apnar · 2026-05-05T03:14:09Z

apnar
May 5, 2026
Author

Managed to get it running with the changes you suggested. Actually didn't need to drop MTP. Numbers don't look good though. Here is the bench output:


========== NARRATIVE (prompt=65 chars, max_tokens=1000) ==========
=== warmups (3) ===
  warm-1     wall= 31.78s  ttft=   836ms  toks=1000  wall_TPS= 31.47  decode_TPS= 32.32
  warm-2     wall= 31.01s  ttft=    99ms  toks=1000  wall_TPS= 32.25  decode_TPS= 32.35
  warm-3     wall= 31.09s  ttft=    99ms  toks=1000  wall_TPS= 32.16  decode_TPS= 32.27

=== measured (5) ===
  run-1      wall= 31.03s  ttft=   100ms  toks=1000  wall_TPS= 32.22  decode_TPS= 32.33
  run-2      wall= 31.09s  ttft=   103ms  toks=1000  wall_TPS= 32.17  decode_TPS= 32.28
  run-3      wall= 31.06s  ttft=   100ms  toks=1000  wall_TPS= 32.20  decode_TPS= 32.30
  run-4      wall= 31.07s  ttft=   102ms  toks=1000  wall_TPS= 32.18  decode_TPS= 32.29
  run-5      wall= 31.05s  ttft=   100ms  toks=1000  wall_TPS= 32.20  decode_TPS= 32.31

=== summary [narrative] (n=5) ===
  wall_TPS       mean=  32.19   std=  0.02   CV= 0.1%   min=32.17   max=32.22
  decode_TPS     mean=  32.30   std=  0.02   CV= 0.1%   min=32.28   max=32.33
  TTFT          mean=   101ms  std=    1ms  min=100ms  max=103ms

========== CODE (prompt=78 chars, max_tokens=800) ==========
=== warmups (3) ===
  warm-1     wall= 21.62s  ttft=   100ms  toks= 695  wall_TPS= 32.15  decode_TPS= 32.30
  warm-2     wall= 12.13s  ttft=    99ms  toks= 389  wall_TPS= 32.07  decode_TPS= 32.34
  warm-3     wall= 17.59s  ttft=    99ms  toks= 565  wall_TPS= 32.13  decode_TPS= 32.31

=== measured (5) ===
  run-1      wall= 11.00s  ttft=    99ms  toks= 353  wall_TPS= 32.10  decode_TPS= 32.39
  run-2      wall= 18.32s  ttft=   100ms  toks= 589  wall_TPS= 32.15  decode_TPS= 32.32
  run-3      wall= 10.73s  ttft=    99ms  toks= 344  wall_TPS= 32.07  decode_TPS= 32.37
  run-4      wall= 22.96s  ttft=   102ms  toks= 738  wall_TPS= 32.14  decode_TPS= 32.28
  run-5      wall= 19.14s  ttft=    99ms  toks= 615  wall_TPS= 32.13  decode_TPS= 32.29

=== summary [code] (n=5) ===
  wall_TPS       mean=  32.12   std=  0.03   CV= 0.1%   min=32.07   max=32.15
  decode_TPS     mean=  32.33   std=  0.05   CV= 0.1%   min=32.28   max=32.39
  TTFT          mean=   100ms  std=    2ms  min=99ms  max=102ms

=== GPU state ===
0, 94 %, 29660 MiB, 32607 MiB, 410.86 W, 60

These are the command line options I used:


    - --model
    - /root/.cache/huggingface/qwen3.6-27b-nvfp4
    - --language-model-only
    - --served-model-name
    - qwen3.6-27b-autoround
    - --quantization
    - compressed-tensors
    - --dtype
    - bfloat16
    - --tensor-parallel-size
    - '1'
    - --max-model-len
    - '24000'
    - --gpu-memory-utilization
    - '0.96'
    - --max-num-seqs
    - '1'
    - --max-num-batched-tokens
    - '4128'
    - --kv-cache-dtype
    - fp8_e4m3
    - --trust-remote-code
    - --reasoning-parser
    - qwen3
    - --enable-auto-tool-choice
    - --tool-call-parser
    - qwen3_coder
    - --enable-prefix-caching
    - --enable-chunked-prefill
    - --no-scheduler-reserve-full-isl
    - --speculative-config
    - '{"method":"mtp","num_speculative_tokens":3}'

Here are the logs from that run:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
[INFO:genesis.apply_all] Genesis platform: {"vendor": {"is_nvidia_cuda": true, "is_amd_rocm": false, "is_intel_xpu": false, "is_cpu_only": false}, "nvidia": {"compute_capability": [12, 0], "is_ampere_datacenter": false, "is_ampere_consumer": false, "is_ada_lovelace": false, "is_hopper": false, "is_blackwell": false, "has_native_fp8": true}, "amd": {}, "versions": {"torch": [2, 11], "transformers": [5, 6, 2], "vllm": [0, 20, 1], "flash_attn_major": 2}, "paths": {"vllm_install_root": "/usr/local/lib/python3.12/dist-packages/vllm"}}
[INFO:genesis.apply_all] Genesis Unified Patch v7.0 — Ampere FP8 + TQ + MoE + Hybrid + bugfixes. Philosophy: МЫ ЧИНИМ, НЕ ЛОМАЕМ.
[INFO:genesis.apply_all] [Genesis registry] 100 dispatcher entries — schema-clean, dependency graph consistent.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.apply_all] [Genesis GPU profile] detected: NVIDIA GeForce RTX 5090
[INFO:genesis.apply_all]   canonical: RTX 5090  cc: (12, 0)  SM: 170  L2: 88 MB  BW: 1792 GB/s  regime: compute
[INFO:genesis.apply_all]
[INFO:genesis.apply_all] Per-patch recommendations:
[INFO:genesis.apply_all] ------------------------------------------------------------------------------
[INFO:genesis.apply_all]   [REC] P40                TQ k8v4 GQA grouping kernel (vllm#40792)
[INFO:genesis.apply_all]           gain: +5-15% (mixed regime), +15-30% (compute regime)
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P40=1
[INFO:genesis.apply_all]           why: Author measured +27% on H100. Empirically NS on A5000 (p=0.28). Cache-locality benefit needs L2 >= 24 MB.
[INFO:genesis.apply_all]   [ON ] P67                Multi-query verify kernel for spec-decode K+1
[INFO:genesis.apply_all]           gain: +25-35%
[INFO:genesis.apply_all]           why: +32% TPS on 35B-A3B-FP8 spec-decode K=3 verify (Genesis internal benchmark, all GPU classes tested).
[INFO:genesis.apply_all]   [REC] P82                SGLang-style acceptance threshold OR-clause
[INFO:genesis.apply_all]           gain: +8-12%
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_P82=1
[INFO:genesis.apply_all]           why: Cross-rig confirmed: +12% on A5000 FP8, +10.5% on 3090 INT4.
[INFO:genesis.apply_all]   [REC] PN8                MTP/draft online-quant propagation (vllm#40849)
[INFO:genesis.apply_all]           gain: 0% TPS, but ~1-2 GiB total VRAM headroom
[INFO:genesis.apply_all]           → recommend: export GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1
[INFO:genesis.apply_all]           why: Verified ~1 GiB VRAM saved per GPU on 35B-A3B-FP8 + MTP K=3. Use freed VRAM for higher gpu-mem-util or longer ctx.
[INFO:genesis.apply_all]   [OFF] P83+P84+P85        Prefix-cache cake-and-eat patches (vllm#38182)
[INFO:genesis.apply_all]           gain: (currently negative)
[INFO:genesis.apply_all]           (correctly skipped on this regime)
[INFO:genesis.apply_all]           why: Tested 4-arm A/B 2026-04-29: -29% TPS regression even with full stack including root-cause P84. Possible vllm cache machinery fixed-overhead per-step. WAIT for v0.20.2 pin bump.
[INFO:genesis.apply_all] ==============================================================================
[INFO:genesis.wiring.p8_kv_hybrid_reporting] [P8] using V2 import anchor (post-MambaSpec layout)
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (kv_cache_utils)] applied 3 sub-patches: p8_kv_imports, p8_kv_helper_injection, p8_kv_callsite
[INFO:genesis.wiring.text_patch] [P8 KV hybrid reporting (scheduler)] applied 2 sub-patches: p8_sched_import, p8_sched_callsite
[INFO:genesis.apply_all] [Genesis] applied: P8 KV hybrid reporting (per-token capacity) — kv_cache_utils=applied(ok), scheduler=applied(ok)
[INFO:genesis.wiring.text_patch] [P3 TurboQuant BF16->FP8 cast (Ampere fix)] applied 1 sub-patches: p3_bf16_fp8_cast
[INFO:genesis.apply_all] [Genesis] applied: P3 TurboQuant BF16->FP8 cast (Ampere fix) — BF16->FP8 cast guard inserted
[INFO:genesis.wiring.text_patch] [P6 TQ-aware block size alignment] applied 2 sub-patches: p6_import_tqspec, p6_tq_branch
[INFO:genesis.apply_all] [Genesis] applied: P6 TurboQuant-aware attention page size — TQ-aware page-size branch inserted
[INFO:genesis.wiring.text_patch] [P15 Qwen3 None/null tool arg] applied 1 sub-patches: p15_none_null
[INFO:genesis.apply_all] [Genesis] applied: P15 Qwen3 None/null tool arg parser — None/none mapping added to tool param parser
[INFO:genesis.wiring.text_patch] [P12 Qwen3 <tool_call> implicit reasoning end] upstream marker '_tool_call_token_id' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P12 Qwen3 <tool_call> implicit reasoning end — upstream_merged
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback/p27_nonstream_return_baseline] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P27 Qwen3 BEFORE-THINK fallback] applied 3 sub-patches: p27_nonstream_capture, p27_nonstream_return_pr35687, p27_stream_start
[INFO:genesis.apply_all] [Genesis] applied: P27 Qwen3 BEFORE-THINK fallback — BEFORE-THINK fallback wired (non-stream + stream)
[INFO:genesis.wiring.text_patch] [P34 Mamba zero-collapse deadlock guard] applied 1 sub-patches: p34_deadlock_guard
[INFO:genesis.apply_all] [Genesis] applied: P34 Mamba zero-collapse deadlock guard — zero-collapse deadlock guard inserted (fixes #40707 for hybrid Mamba + multimodal)
[INFO:genesis.apply_all] [Genesis] applied: P29 tool parser IndexError guard — upstream already contains bounded-index guards (no-op)
[INFO:genesis.marlin_fp32_reduce] [Genesis P23] Marlin FP32_REDUCE: disabled=False on SM=(12, 0) (auto-from-platform)
[INFO:genesis.apply_all] [Genesis] applied: P23 Marlin FP32_REDUCE env override — decision: fp32_reduce disabled=False (requires upstream wire into Marlin launcher to take effect)
[INFO:genesis.wiring.text_patch] [P4 TurboQuant hybrid model support] applied 2 sub-patches: p4_helper_fn, p4_tq_block
[INFO:genesis.apply_all] [Genesis] applied: P4 TurboQuant hybrid model support — text-patch succeeded
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)/p5_import_math] anchor not found — soft skip (sibling patches continue)
[INFO:genesis.wiring.text_patch] [P5 KV cache page size unification (v1_lcm_pad_max)] applied 1 sub-patches: p5_v1_lcm_pad_max_from_baseline
[INFO:genesis.apply_all] [Genesis] applied: P5 KV cache page size unification — text-patch v2 succeeded (pad-smaller-to-max)
[INFO:genesis.apply_all] [Genesis] skipped: P5b KV page-size pad-smaller-to-max (env-opt-in) — opt-in: set GENESIS_ENABLE_P5B=1 to enable pad-smaller-to-max KV page-size strategy (saves ~34% per-block VRAM on Qwen3.6 hybrid vs P5 v1 LCM-pad-up; BLAST-RADIUS is KV allocator → benchmark on VM 100 before enabling in prod)
INFO 05-05 02:55:56 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
INFO 05-05 02:55:57 [nixl_utils.py:32] NIXL is available
[INFO:genesis.wiring.p31_router_softmax] [Genesis P31] wrapped grouped_topk_router.grouped_topk (fp32 upcast active for grouped-MoE models)
[INFO:genesis.apply_all] [Genesis] applied: P31 MoE router fp32 softmax — grouped_topk wrapped (effective in this process)
[INFO:genesis.wiring.p22_tq_prealloc] [P22] PR #40655 merged (bhoomit moved init out) — auto-skip — patch retired in favor of upstream
[INFO:genesis.apply_all] [Genesis] skipped: P22 TurboQuant shared dequant prealloc — PR #40655 merged (bhoomit moved init out) — auto-skip
[INFO:genesis.wiring.text_patch] [P26 TQ prefill output prealloc] upstream marker 'if not hasattr(self, "_cu_2")' detected — patch obsolete, skip
[INFO:genesis.apply_all] [Genesis] skipped: P26 TurboQuant prefill output prealloc — upstream_merged
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61b — Qwen3 streaming partial-tag overlap guard | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P61b Qwen3 streaming partial-tag overlap guard — opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P62 — Structured-output spec-decode reasoning-end timing fix | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P62 structured-output spec-decode timing fix — opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P61 — Qwen3 multi-tool first-occurrence (DEPRECATED — superseded by P12 v2) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P61 Qwen3 multi-tool first-occurrence — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60b — GDN+ngram Triton kernel offset (Phase 2) | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60b GDN+ngram Triton kernel offset — opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P60 — GDN+ngram state recovery (Phase 1: SSM pre-copy) | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P60 GDN+ngram state recovery — opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P63 — MTP/Eagle drafter GDN state recovery (deprecated — wrong layer) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnost
[INFO:genesis.apply_all] [Genesis] skipped: P63 MTP/Eagle drafter GDN state recovery — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P63_MTP_GDN_STATE_RECOVERY=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P64 — qwen3coder MTP streaming early-return fix | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P64 qwen3coder_tool_parser.py — MTP streaming early-return removal] applied 2 sub-patches: p64_remove_early_return, p64_unify_emit_at_fnend
[INFO:genesis.wiring.text_patch] [P64 serving.py — MTP safety-net + Pydantic null fix] applied 2 sub-patches: p64_safety_net_widen, p64_callsite_guard
[INFO:genesis.apply_all] [Genesis] applied: P64 qwen3coder MTP streaming early-return fix — P64 applied: 2 files modified, 0 idempotent. qwen3coder streaming parser no longer drops parameters when MTP bundles last param + </function> in same delta. Safety net widened to fire on finish_reason alone.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P65 — TurboQuant spec-decode cudagraph downgrade | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P65 TurboQuant spec-decode cudagraph downgrade — opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWNGRADE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P66 — cudagraph_capture_sizes spec-decode divisibility filter | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P66 config/vllm.py — cudagraph_capture_sizes spec-decode filter] applied 1 sub-patches: p66_size_filter
[INFO:genesis.apply_all] [Genesis] applied: P66 cudagraph_capture_sizes spec-decode divisibility filter — P66 applied: cudagraph_capture_sizes will be filtered to sizes divisible by uniform_decode_query_len when spec-decode is active. Boot 2-4x faster, less peak GPU memory, no mixed-q_len capture risks.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P68 — Auto force tool_choice=required for long-context tool calls | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P69 — Long-context tool-format reminder injection | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P68/P69 long-context tool-call adherence — neither P68 nor P69 enabled; hook injection skipped to keep serving.py pristine. P68: opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to engage | P69: opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P70 — Auto-strict-ngram (force prompt_lookup_min>=8) | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P70 Auto-strict-ngram (force prompt_lookup_min>=8) — opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to engage
[INFO:genesis.wiring.p67_tq_multi_query_kernel] [Genesis P67 H2] baked env at module load: MAX_PRIOR_LEN=4096 DEBUG_COMPARE=False (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67 — TurboQuant multi-query kernel for spec-decode K+1 | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67 turboquant_attn.py — multi-query kernel hook] applied 1 sub-patches: p67_kernel_hook
[INFO:genesis.apply_all] [Genesis] applied: P67 TurboQuant multi-query kernel for spec-decode K+1 — P67 hook injected at top of _prefill_attention. Multi-query continuation prefill batches (spec-verify K+1 with prior cached KV) will route to Genesis Triton kernel when GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1. Diag: {'env_enabled': True, 'version': 'v7.39_aggressive_tune', 'sm': '12.0', 'fp8_mode': 'e4nv', 'autoconfig': {'BLOCK_KV': 32, 'num_warps': 8, 'num_stages': 3}, 'kernel_built': True}
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P71 — Block-verify rejection sampler (Sun 2024 ICLR) | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P71 Block-verify rejection sampler (vllm#40819 + gemini bug-fixes) — opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P78 — TurboQuant .tolist() capture-guard (adapted from noonghunna) | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P78 TurboQuant .tolist() capture-guard (adapted from noonghunna) — opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P77 — Adaptive ngram K controller (EMA + hysteresis + auto-disable) | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P77 Adaptive ngram K controller (EMA + hysteresis + auto-disable) — opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79b — Async × spec-decode proposer-sync backport (vllm#40610) | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79b Async × spec-decode proposer-sync backport (vllm#40610) — opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P79c — Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P79c Stale spec_token_ids cleanup for unscheduled requests (vllm#37629) — opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEANUP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P81 — fp8 block-scaled MM low-M decode tuning (vllm#40925) | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P81 fp8 block-scaled MM low-M decode tuning (vllm#40925) — opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P82 — SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) | opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P82 SGLang threshold_single OR-clause acceptance (BIASED — opt-in research) — opt-in only — set GENESIS_ENABLE_P82=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P83 — MTP keep-last-cached-block (vllm#38182 downstream symptom — P84 is real fix) | opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P83 MTP keep-last-cached-block (vllm#38182 mitigation) — opt-in only — set GENESIS_ENABLE_P83=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P84 — hash_block_size override (vllm#38182 actual root cause) | opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P84 hash_block_size override (vllm#38182 ACTUAL root cause) — opt-in only — set GENESIS_ENABLE_P84=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P100 — FlashInfer FULL CUDA graph for spec-decode (vllm#41127) | opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P100 FlashInfer FULL CUDA graph for spec-decode (vllm#41127) — opt-in only — set GENESIS_ENABLE_P100=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P101 — TQ continuation 64-token slicing (vllm#41123 SELECTIVE) | opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P101 TQ continuation 64-token slicing (vllm#41123 selective) — opt-in only — set GENESIS_ENABLE_P101=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P99 — WorkspaceManager.get_simultaneous memoization (perf hotfix) | opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P99 WorkspaceManager memoize get_simultaneous (perf hotfix) — opt-in only — set GENESIS_ENABLE_P99=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P98 — TQ WorkspaceManager revert (vllm#40941 perf hotfix) | opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P98 TQ WorkspaceManager revert (vllm#40941 perf hotfix) — opt-in only — set GENESIS_ENABLE_P98=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P94 — Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043) | opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P94 Spec-decode prepare_next_token_ids_padded zero-alloc (vllm#41043 backport) — opt-in only — set GENESIS_ENABLE_P94=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P95 — Marlin TP cudagraph cap on Ampere (vllm#40385) | opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P95 Marlin TP cudagraph cap on Ampere (vllm#40385 backport) — opt-in only — set GENESIS_ENABLE_P95=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P91 — AutoRound row-parallel group cdiv + start-idx fix (vllm#39460) | opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P91 AutoRound row-parallel group cdiv + start-idx fix (vllm#39460 backport) — opt-in only — set GENESIS_ENABLE_P91=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P87 — Marlin W4A16/W8A16 sub-tile output dim pad-on-load (vllm#40361) | opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P87 Marlin sub-tile output dim pad-on-load (vllm#40361 backport) — opt-in only — set GENESIS_ENABLE_P87=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN8 — MTP/draft online-quant propagation (vllm#40849) | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN8 MTP/draft online-quant propagation (vllm#40849 backport) — opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN9 — Independent drafter attention backend (vllm#39930) | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN9 independent drafter attention backend (vllm#39930 backport) — opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATTN=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN11 — GDN a/b contiguity in fix_query_key_value_ordering (vllm#41142) | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN11 GDN a/b contiguity (vllm#41142 backport) — opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P67c sparse-V integration into P67 split-M kernel — opt-in: set GENESIS_ENABLE_P67_SPARSE_V=1 to engage per-q_t sparse-V skip in P67 split-M kernel
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN34 — WorkspaceManager runtime lock relaxation (PN33 companion for runtime decode) | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN34 workspace lock runtime relaxation — opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY PN33 — Spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram) | config_detect: neutral:
[INFO:genesis.wiring.text_patch] [PN33 v1/worker/gpu_model_runner.py — spec-decode warmup K-aware sizing (vllm#37521 extended to MTP/ngram)] applied 1 sub-patches: pN33_warmup_k_draft_tokens
[INFO:genesis.apply_all] [Genesis] applied: PN33 spec-decode warmup K-aware (vllm#37521 extended) — PN33 applied: spec-decode warmup uses real num_speculative_tokens instead of dummy K=1. Closes (a) ampersandru mid-stream OOM via propose_draft_token_ids and (b) noonghunna workspace-lock AssertionError on TQ + MTP K=3 single-card. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if warmup OOMs.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN32 — GDN _forward_core chunked-prefill v2 (Cliff 2 fix for single-24GB-GPU OOM) | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN32 GDN chunked-prefill (Cliff 2 fix) — opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN31 — FA varlen persistent out buffer (issue #15, sister to P38) | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN31 FA varlen persistent out (issue #15) — opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_OUT=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN30 — DS conv state layout + spec-decode AL>1 fix (issue #17) | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN30 DS conv state + spec-decode AL>1 (issue #17) — opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN29 — GDN chunk_o scale-fold (vllm#41446 pattern (c)) | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN29 GDN chunk_o scale-fold (vllm#41446 pattern c) — opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN12 — FFN intermediate scratch pool — Cliff 1 fix on TQ3 path | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN12 FFN intermediate scratch pool (Cliff 1 fix) — opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN28 — merge_attn_states NaN guard (vllm#39148 backport) | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN28 merge_attn_states NaN guard (vllm#39148 backport) — opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P15B — FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P15B FA varlen max_seqlen_k clamp on TQ path (Issue #15 fix) — opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P38B — P38 compile-safe in-source hook (Issue #14 fix) | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P38B P38 compile-safe in-source hook (Issue #14 fix) — opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26b sparse-V tile-skip Genesis kernel (BLASST lambda=a/L for SM86) — opt-in: set GENESIS_ENABLE_PN26_SPARSE_V=1 to enable sparse-V tile-skip kernel (BLASST λ=a/L formula by default)
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN27 — Revert MoERunnerInterface PluggableLayer (vllm#41440) | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN27 revert MoERunnerInterface PluggableLayer (vllm#41440 backport) — opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN26 — TQ unified perf pack (centroids prebake + sparse V scaffold) | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN26 TQ unified perf pack (centroids prebake + sparse V scaffold) — opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN25 — SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile path) | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN25 SiluAndMul.forward_native opaque-op pool (Cliff 1 mech B compile-path) — opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN13 — CUDAGraphWrapper gc.collect/empty_cache lambda arity (vllm#41235) | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN13 CUDAGraphWrapper lambda arity (vllm#41235 backport) — opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARITY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN14 — TQ decode IOOB safe_page_idx clamp (vllm#40074) | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN14 TQ decode IOOB safe_page_idx clamp (vllm#40074 backport) — opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN19 Scoped max_split_size_mb during model load (vllm#41268) — GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT not set; default OFF. Backport of vllm#41268 (MatthewBonanni, OPEN). PyTorch 2.10+ introduces load-time fragmentation; this patch sets max_split_size_mb=20 during model load, restores on exit. Estimated win: 200-500 MiB on H100 (per #41268 author); unverified on Ampere — measure before relying on it.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN23 — DFlash combine_hidden_states dtype cast (vllm#40334) | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN23 DFlash combine_hidden_states dtype cast (vllm#40334 backport) — opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN21 — DFlash SWA support partial backport (vllm#40898) | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN21 DFlash SWA support partial backport (vllm#40898 backport) — opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN22 — Local argmax for TP draft (vllm#39419 backport) | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN22 Local argmax for TP draft (vllm#39419 backport) — opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN24 — DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN24 DFlash aux layer +1 indexing fix (vllm#40727 backport) — opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN17 FA2 softmax_lse runtime clamp (Issue #11 Cliff 1 mechanism A) — GENESIS_ENABLE_PN17_FA2_LSE_CLAMP not set; default OFF. Enable on long-text-no-vision configs to close Cliff 1 mechanism A (FA2 softmax_lse over-allocation at long ctx). Diagnosis credit: noonghunna, Genesis Issue #11.
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  PN16 — Lazy-reasoner request hook (per-request enable_thinking) | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: PN16 Lazy-reasoner request hook (per-request enable_thinking) — opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P86 — ngram batch_propose O(N*K) → O(N+K) direct-fill (vllm#40876) | opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P86 ngram batch_propose O(N+K) direct-fill (vllm#40876 backport) — opt-in only — set GENESIS_ENABLE_P86=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P85 — Hybrid fine-shadow prefix cache (vllm#38182 followup, MambaManager fix) | opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P85 Hybrid fine-shadow prefix cache (MambaManager fix for vllm#38182 followup) — opt-in only — set GENESIS_ENABLE_P85=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P75 — Auto-enable Suffix Decoding (Arctic Inference, vllm#25784) | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P75 Auto-enable Suffix Decoding (vllm#25784 Arctic Inference) — opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P74 — Auto chunk-clamp via long_prefill_token_threshold (P72 companion) | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P74 Auto chunk-clamp via long_prefill_token_threshold (P72 companion) — opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P72 — profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P72 profile_run M cap (unblocks --max-num-batched-tokens>4096 on MoE) — opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to engage
[INFO:genesis.wiring.p67b_spec_verify_routing] [Genesis P67b B2] baked env at module load: USE_UPSTREAM=True NUM_KV_SPLITS=(default = self.max_num_kv_splits) (no per-dispatch env reads)
[INFO:genesis.dispatcher] [Genesis Dispatcher] APPLY P67b — TurboQuant spec-verify forward() routing (FULL CG enable) | opt-in env (config: neutral)
[INFO:genesis.wiring.text_patch] [P67b turboquant_attn.py forward() spec-verify routing] applied 1 sub-patches: p67b_forward_spec_verify_branch
[INFO:genesis.apply_all] [Genesis] applied: P67b TurboQuant spec-verify forward() routing — P67b forward() spec-verify routing injected. K+1 batches now bypass _prefill_attention entirely (cudagraph-safe).
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P59 — Qwen3 reasoning embedded tool_call recovery | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P59 Qwen3 reasoning embedded tool_call recovery — opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P58 — Async-scheduler -1 placeholder fix | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.apply_all] [Genesis] skipped: P58 async-scheduler -1 placeholder fix — opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 to engage
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P57 — TQ spec-decode capture-safe buffers (deprecated — research artifact) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagno
[INFO:genesis.apply_all] [Genesis] skipped: P57 TQ spec-decode capture-safe buffers — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P57_SPEC_DECODE_CAPTURE_SAFE=1 only for diagnostics
[INFO:genesis.dispatcher] [Genesis Dispatcher] SKIP  P56 — TQ spec-decode safe-path guard (deprecated — superseded by P65) | opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.apply_all] [Genesis] skipped: P56 TQ spec-decode safe-path guard — opt-in only AND empirically deprecated — keeping skip; set GENESIS_ENABLE_P56_SPEC_DECODE_GUARD=1 only for diagnostics
[INFO:genesis.wiring.text_patch] [P44 TQ mixed-batch attn_out pool] applied 1 sub-patches: p44_mixed_attn_out_alloc
[INFO:genesis.apply_all] [Genesis] applied: P44 TQ mixed-batch attn_out pool — text-patch applied — mixed-batch attn_out routed through TurboQuantBufferManager pool (~80 MB zero-init eliminated per mixed-batch forward in multi-user serving)
[INFO:genesis.wiring.text_patch] [P46 GDN gating buffer pool] applied 2 sub-patches: p46_g_buffer, p46_beta_buffer
[INFO:genesis.apply_all] [Genesis] applied: P46 GDN gating buffer pool — text-patch applied — fused_gdn_gating now uses GdnGatingBufferManager pool (eliminates ~24k allocs/sec on Qwen3.6-35B-A3B decode)
[INFO:genesis.apply_all] [Genesis] skipped: P7b GDN dual-stream via torch.library.custom_op (opt-in) — opt-in: set GENESIS_ENABLE_P7B=1 to enable GDN dual-stream via torch.library.custom_op (graph-safe alternative to P7; validates numeric equiv + compile-cache freshness before prod)
[INFO:genesis.apply_all] [Genesis] skipped: P40 TurboQuant GQA-grouped decode stage1 (opt-in) — opt-in: set GENESIS_ENABLE_P40=1 to enable GQA-grouped decode stage1 (port of upstream PR #40792)
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39b] vllm_config fetch failed (Current vLLM config is not set. This typically means get_current_vllm_config() was called outside of a set_current_vllm_config() context, or a CustomOp was instantiated at module import time or model forward time when config is not set. For tests that directly test custom ops/modules, use the 'default_vllm_config' pytest fixture from tests/conftest.py.); defaults used
[INFO:genesis.wiring.p39a_fla_kkt] [Genesis P39a] rebound vllm.model_executor.layers.fla.ops.chunk_scaled_dot_kkt.chunk_scaled_dot_kkt_fwd (+1 caller mods: ['vllm.model_executor.layers.fla.ops.chunk'])
[INFO:genesis.apply_all] [Genesis] applied: P39a FLA chunk_scaled_dot_kkt persistent A pool — module-level fn replaced (1 caller module(s) also rebound — pool shared across GDN layers)
[INFO:genesis.wiring.p38_tq_continuation_memory] [Genesis P38] rebound TurboQuantAttentionImpl._continuation_prefill (persistent K_full/V_full buffers replace torch.cat peak)
[INFO:genesis.apply_all] [Genesis] applied: P38 TQ _continuation_prefill persistent workspace — class method replaced (persistent K_full/V_full workspace, no .contiguous()/torch.cat transient peaks)
[INFO:genesis.apply_all] [Genesis] skipped: P37 MoE intermediate cache pool (opt-in) — opt-in only — set GENESIS_ENABLE_P37=1 to engage. Manager API is registered and usable independently of this text-patch.
[INFO:genesis.apply_all] [Genesis] skipped: P36 TurboQuant shared decode buffers — redundant: upstream PR #40798 (workspace manager) active — skip
[INFO:genesis.apply_all] [Genesis] applied: P32/P33 TurboQuant cu_2 + synth_seq_lens preallocs — cu_2 + synth_seq_lens preallocs registered (invoked from ensure_turboquant_buffers, fires during profile_run)
[INFO:genesis.prealloc_budget] [Genesis P73] token budget resolved → 4096 (default fallback). Set GENESIS_PREALLOC_TOKEN_BUDGET to override.
[INFO:genesis.wiring.text_patch] [P28 GDN core_attn_out prealloc] applied 1 sub-patches: p28_core_attn_out_alloc
[INFO:genesis.wiring.p28_gdn_core_attn] [Genesis P28] wrapped GatedDeltaNetAttention.__init__ to attach _genesis_gdn_core_attn_buf on each instance
[INFO:genesis.apply_all] [Genesis] applied: P28 GDN core_attn_out prealloc — forward_cuda patched + __init__ wrapped
[INFO:genesis.gdn_dual_stream] [GDN dual-stream] CUDA aux stream initialized
[INFO:genesis.apply_all] [Genesis P7] dispatcher ready (parallel path)
[INFO:genesis.apply_all] [Genesis] skipped: P7 GDN dual-stream in_proj parallelism — deferred — incompatible with torch.compile fullgraph (CUDA streams not SymPy-graphable); custom op implementation required. Re-enable with GENESIS_ENABLE_P7=1 + --enforce-eager.
[INFO:genesis.apply_all] [Genesis] skipped: P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.wiring.text_patch] [P24 fused_moe num_warps/num_stages overlay] applied 2 sub-patches: p24_fp8_cfg_overlay, p24_general_cfg_overlay
[INFO:genesis.apply_all] [Genesis] applied: P24 fused_moe num_warps/num_stages overlay — num_warps / num_stages overlay wired into get_default_config (active only on Triton fused_moe path; Marlin unaffected)
[INFO:genesis.wiring.p14_block_table] [Genesis P14] rebound BlockTable.append_row + move_row (tail-zero-fill active for concurrent-request safety)
[INFO:genesis.apply_all] [Genesis] applied: P14 block_table tail zero-fill — BlockTable.append_row + move_row wrapped (effective in this process)
[INFO:genesis.tq_decode_tune] [Genesis P18b] TQ decode stage1 using upstream defaults (BLOCK_KV=4 num_warps=1 num_stages=1)
[INFO:genesis.apply_all] [Genesis] applied: P18b TurboQuant decode stage1 tune — no env override — using upstream defaults (4/1/1)
[INFO:genesis.apply_all] [Genesis P20] TQ _continuation_prefill FP16 helpers ready for TurboQuantAttentionImpl hook
[INFO:genesis.apply_all] [Genesis] applied: P20 TurboQuant continuation-prefill FP16 rotate — fp16-rotation helper ready for _continuation_prefill hook
[INFO:genesis.fp8_dispatcher] [Genesis FP8 dispatcher] SM (12, 0) → Native Triton FP8 (no override)
[INFO:genesis.apply_all] [Genesis] applied: P1/P2 FP8 kernel dispatcher — SM=(12, 0) → native Triton FP8 path selected
[INFO:genesis.apply_all] Genesis Results: 27 applied, 72 skipped, 0 failed, 2 ⚠️ partial-apply warning(s)
[WARNING:genesis.apply_all] [Genesis] 2 partial-apply warning(s) — patch(es) failed to match expected source pattern. Review below to confirm anchor drift vs upstream change vs config issue:
[WARNING:genesis.apply_all] [Genesis] ⚠️  P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator — GENESIS_ENABLE_P103 not set or platform not NVIDIA SM 8.0+
[WARNING:genesis.apply_all] [Genesis] ⚠️  P17/P18 Marlin MoE per-SM tuning — no tuning entry for SM (12, 0) — upstream heuristic will be used
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN,
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 —
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 —
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055)
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] validator: clean (no issues)
[INFO:genesis.apply_all] [Genesis compile-watchdog] apply_all elapsed: 6.7s
[INFO:genesis.dispatcher] [Genesis Dispatcher v2] apply matrix:
Patch | Status | Title                                         | Reason                                                       | Credit
------+--------+-----------------------------------------------+--------------------------------------------------------------+-------------------------------
P61b  | SKIP   | Qwen3 streaming partial-tag overlap guard     | opt-in only — set GENESIS_ENABLE_P61B_STREAMING_OVERLAP=1 to | ExtReMLapin (vllm#40783)
P62   | SKIP   | Structured-output spec-decode reasoning-end t | opt-in only — set GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING= | sfbemerk (vllm#36138), ciciror
P61   | SKIP   | Qwen3 multi-tool first-occurrence (DEPRECATED | opt-in only AND empirically deprecated — keeping skip; set G | ExtReMLapin (vllm#40783) — P61
P60b  | SKIP   | GDN+ngram Triton kernel offset (Phase 2)      | opt-in only — set GENESIS_ENABLE_P60B_TRITON_KERNEL=1 to eng | tdoublep (vllm#40738)
P60   | SKIP   | GDN+ngram state recovery (Phase 1: SSM pre-co | opt-in only — set GENESIS_ENABLE_P60_GDN_NGRAM_FIX=1 to enga | tdoublep (vllm#40738), bhaktat
P63   | SKIP   | MTP/Eagle drafter GDN state recovery (depreca | opt-in only AND empirically deprecated — keeping skip; set G | Genesis-original (hypothesis d
P64   | APPLY  | qwen3coder MTP streaming early-return fix     | opt-in env (config: neutral)                                 | kotori-yan (vllm#39598)
P65   | SKIP   | TurboQuant spec-decode cudagraph downgrade    | opt-in only — set GENESIS_ENABLE_P65_TURBOQUANT_SPEC_CG_DOWN | Genesis-original (root cause f
P66   | APPLY  | cudagraph_capture_sizes spec-decode divisibil | opt-in env (config: neutral)                                 | Genesis-original (mirrors fhl2
P68   | SKIP   | Auto force tool_choice=required for long-cont | opt-in only — set GENESIS_ENABLE_P68_AUTO_FORCE_TOOL=1 to en | Genesis-original (long-ctx too
P69   | SKIP   | Long-context tool-format reminder injection   | opt-in only — set GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER= | Genesis-original (long-ctx too
P70   | SKIP   | Auto-strict-ngram (force prompt_lookup_min>=8 | opt-in only — set GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM=1 to  | Genesis-original (vllm#40875 e
P67   | APPLY  | TurboQuant multi-query kernel for spec-decode | opt-in env (config: neutral)                                 | Genesis-original (proper fix f
P71   | SKIP   | Block-verify rejection sampler (Sun 2024 ICLR | opt-in only — set GENESIS_ENABLE_P71_BLOCK_VERIFY=1 to engag | Backport of vllm#40819 (Z. Gol
P78   | SKIP   | TurboQuant .tolist() capture-guard (adapted f | opt-in only — set GENESIS_ENABLE_P78_TOLIST_CAPTURE_GUARD=1  | Adapted from noonghunna's patc
P77   | SKIP   | Adaptive ngram K controller (EMA + hysteresis | opt-in only — set GENESIS_ENABLE_P77_ADAPTIVE_NGRAM_K=1 to e | Genesis-original (port of SGLa
P79b  | SKIP   | Async × spec-decode proposer-sync backport (v | opt-in only — set GENESIS_ENABLE_P79B_ASYNC_PROPOSER_SYNC=1  | Backport of vllm#40610 (OPEN d
P79c  | SKIP   | Stale spec_token_ids cleanup for unscheduled  | opt-in only — set GENESIS_ENABLE_P79C_STALE_SPEC_TOKEN_CLEAN | Backport of vllm#37629 (OPEN,
P81   | SKIP   | fp8 block-scaled MM low-M decode tuning (vllm | opt-in only — set GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8 | Backport of vllm#40925 (tonyli
P82   | SKIP   | SGLang threshold_single OR-clause acceptance  | opt-in only — set GENESIS_ENABLE_P82=1 to engage             | SGLang team (sgl-project/sglan
P83   | SKIP   | MTP keep-last-cached-block (vllm#38182 downst | opt-in only — set GENESIS_ENABLE_P83=1 to engage             | Root-cause analysis: vllm#3818
P84   | SKIP   | hash_block_size override (vllm#38182 actual r | opt-in only — set GENESIS_ENABLE_P84=1 to engage             | Genesis-original discovery 202
P100  | SKIP   | FlashInfer FULL CUDA graph for spec-decode (v | opt-in only — set GENESIS_ENABLE_P100=1 to engage            | Backport of vllm#41127 (open 2
P101  | SKIP   | TQ continuation 64-token slicing (vllm#41123  | opt-in only — set GENESIS_ENABLE_P101=1 to engage            | Selective backport of vllm#411
P99   | SKIP   | WorkspaceManager.get_simultaneous memoization | opt-in only — set GENESIS_ENABLE_P99=1 to engage             | Per Sander 2026-04-28: 'if rev
P98   | SKIP   | TQ WorkspaceManager revert (vllm#40941 perf h | opt-in only — set GENESIS_ENABLE_P98=1 to engage             | Reverts upstream PR #40941 (ME
P94   | SKIP   | Spec-decode prepare_next_token_ids_padded zer | opt-in only — set GENESIS_ENABLE_P94=1 to engage             | Backport of vllm#41043 (wanglu
P95   | SKIP   | Marlin TP cudagraph cap on Ampere (vllm#40385 | opt-in only — set GENESIS_ENABLE_P95=1 to engage             | Backport of vllm#40385 (OPEN a
P91   | SKIP   | AutoRound row-parallel group cdiv + start-idx | opt-in only — set GENESIS_ENABLE_P91=1 to engage             | Backport of non-MoE-specific p
P87   | SKIP   | Marlin W4A16/W8A16 sub-tile output dim pad-on | opt-in only — set GENESIS_ENABLE_P87=1 to engage             | Backport of vllm#40361 (OPEN).
PN8   | SKIP   | MTP/draft online-quant propagation (vllm#4084 | opt-in only — set GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT= | Backport of vllm#40849 (bhoomi
PN9   | SKIP   | Independent drafter attention backend (vllm#3 | opt-in only — set GENESIS_ENABLE_PN9_INDEPENDENT_DRAFTER_ATT | Backport of vllm#39930 (Matthe
PN11  | SKIP   | GDN a/b contiguity in fix_query_key_value_ord | opt-in only — set GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS=1 to | Backport of vllm#41142 (Yeuvoi
PN34  | SKIP   | WorkspaceManager runtime lock relaxation (PN3 | opt-in only — set GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 | Companion to PN33 — same root
PN33  | APPLY  | Spec-decode warmup K-aware sizing (vllm#37521 | config_detect: neutral:                                      | Backport of vllm-project/vllm#
PN32  | SKIP   | GDN _forward_core chunked-prefill v2 (Cliff 2 | opt-in only — set GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1  | Genesis-original v7.69 v2 (202
PN31  | SKIP   | FA varlen persistent out buffer (issue #15, s | opt-in only — set GENESIS_ENABLE_PN31_FA_VARLEN_PERSISTENT_O | Genesis-original sister patch
PN30  | SKIP   | DS conv state layout + spec-decode AL>1 fix ( | opt-in only — set GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE= | Genesis-original fix for issue
PN29  | SKIP   | GDN chunk_o scale-fold (vllm#41446 pattern (c | opt-in only — set GENESIS_ENABLE_PN29_GDN_SCALE_FOLD=1 to en | Backport of vllm#41446 (zobinH
PN12  | SKIP   | FFN intermediate scratch pool — Cliff 1 fix o | opt-in only — set GENESIS_ENABLE_PN12_FFN_INTERMEDIATE_POOL= | Genesis-original 2026-04-29 —
PN28  | SKIP   | merge_attn_states NaN guard (vllm#39148 backp | opt-in only — set GENESIS_ENABLE_PN28_MERGE_ATTN_NAN_GUARD=1 | Backport of vllm#39148 (jasonk
P15B  | SKIP   | FA varlen max_seqlen_k clamp on TQ path (Issu | opt-in only — set GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1 to e | Genesis-original 2026-05-01 fi
P38B  | SKIP   | P38 compile-safe in-source hook (Issue #14 fi | opt-in only — set GENESIS_ENABLE_P38B_COMPILE_SAFE=1 to enga | Genesis-original 2026-05-01 fi
PN27  | SKIP   | Revert MoERunnerInterface PluggableLayer (vll | opt-in only — set GENESIS_ENABLE_PN27_REVERT_PLUGGABLE_MOE=1 | Backport of vllm#41440 (auto-g
PN26  | SKIP   | TQ unified perf pack (centroids prebake + spa | opt-in only — set GENESIS_ENABLE_PN26_TQ_UNIFIED=1 to engage | Genesis-original 2026-05-01 un
PN25  | SKIP   | SiluAndMul.forward_native opaque-op pool (Cli | opt-in only — set GENESIS_ENABLE_PN25_SILU_INDUCTOR_SAFE=1 t | Genesis-original 2026-05-01 in
PN13  | SKIP   | CUDAGraphWrapper gc.collect/empty_cache lambd | opt-in only — set GENESIS_ENABLE_PN13_CUDA_GRAPH_LAMBDA_ARIT | Backport of vllm#41235 (roikor
PN14  | SKIP   | TQ decode IOOB safe_page_idx clamp (vllm#4007 | opt-in only — set GENESIS_ENABLE_PN14_TQ_DECODE_OOB_CLAMP=1  | Backport of vllm#40074 (devara
PN23  | SKIP   | DFlash combine_hidden_states dtype cast (vllm | opt-in only — set GENESIS_ENABLE_PN23_DFLASH_DTYPE_FIX=1 to  | Backport of vllm#40334 (cipher
PN21  | SKIP   | DFlash SWA support partial backport (vllm#408 | opt-in only — set GENESIS_ENABLE_PN21_DFLASH_SWA=1 to engage | Partial backport of vllm#40898
PN22  | SKIP   | Local argmax for TP draft (vllm#39419 backpor | opt-in only — set GENESIS_ENABLE_PN22_LOCAL_ARGMAX_TP=1 to e | Backport of vllm#39419 (EanWan
PN24  | SKIP   | DFlash aux layer +1 indexing fix (vllm#40727) | opt-in only — set GENESIS_ENABLE_PN24_DFLASH_AUX_LAYER_FIX=1 | Backport of vllm#40727 (benchi
PN16  | SKIP   | Lazy-reasoner request hook (per-request enabl | opt-in only — set GENESIS_ENABLE_PN16_LAZY_REASONER=1 to eng | Genesis-original 2026-04-29. H
P86   | SKIP   | ngram batch_propose O(N*K) → O(N+K) direct-fi | opt-in only — set GENESIS_ENABLE_P86=1 to engage             | Backport of vllm#40876 (aarona
P85   | SKIP   | Hybrid fine-shadow prefix cache (vllm#38182 f | opt-in only — set GENESIS_ENABLE_P85=1 to engage             | Genesis-original 2026-04-27 —
P75   | SKIP   | Auto-enable Suffix Decoding (Arctic Inference | opt-in only — set GENESIS_ENABLE_P75_SUFFIX_DECODING=1 to en | Backport-enabler of vllm#25784
P74   | SKIP   | Auto chunk-clamp via long_prefill_token_thres | opt-in only — set GENESIS_ENABLE_P74_CHUNK_CLAMP=1 to engage | Genesis-original (zero-VRAM-co
P72   | SKIP   | profile_run M cap (unblocks --max-num-batched | opt-in only — set GENESIS_ENABLE_P72_PROFILE_RUN_CAP=1 to en | Genesis-original (Dynamo fake-
P67b  | APPLY  | TurboQuant spec-verify forward() routing (FUL | opt-in env (config: neutral)                                 | Genesis-original (FULL CG enab
P59   | SKIP   | Qwen3 reasoning embedded tool_call recovery   | opt-in only — set GENESIS_ENABLE_P59_QWEN3_TOOL_RECOVERY=1 t | ZenoAFfectionate (vllm#39055)
P58   | SKIP   | Async-scheduler -1 placeholder fix            | opt-in only — set GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX=1 | z1ying (vllm#40768)
P57   | SKIP   | TQ spec-decode capture-safe buffers (deprecat | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40831), gdn_attn.
P56   | SKIP   | TQ spec-decode safe-path guard (deprecated —  | opt-in only AND empirically deprecated — keeping skip; set G | noonghunna (#40807, #40831)
[workspace_lock_disable] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/worker/workspace.py
[workspace_lock_disable] applied (lock-violation now logs WARNING, allocates anyway)
[tolist_cudagraph_fix] target: /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py
[tolist_cudagraph_fix] Site B (_prefill_attention): applied
[tolist_cudagraph_fix] Site A (forward mixed-batch): applied
[tolist_cudagraph_fix] Patched /usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/turboquant_attn.py. Site A=applied, Site B=applied
WARNING 05-05 02:56:05 [argparse_utils.py:257] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in a future version.
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:299]
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.1rc1.dev16+g7a1eb8ac2
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:299]   █▄█▀ █     █     █     █  model   /root/.cache/huggingface/qwen3.6-27b-nvfp4
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:299]
(APIServer pid=1) INFO 05-05 02:56:05 [utils.py:233] non-default args: {'model_tag': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'model': '/root/.cache/huggingface/qwen3.6-27b-nvfp4', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 24000, 'quantization': 'compressed-tensors', 'served_model_name': ['qwen3.6-27b-autoround'], 'reasoning_parser': 'qwen3', 'gpu_memory_utilization': 0.96, 'kv_cache_dtype': 'fp8_e4m3', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4128, 'max_num_seqs': 1, 'enable_chunked_prefill': True, 'scheduler_reserve_full_isl': False, 'speculative_config': {'method': 'mtp', 'num_speculative_tokens': 3}}
(APIServer pid=1) INFO 05-05 02:56:11 [nixl_utils.py:20] Setting UCX_RCACHE_MAX_UNRELEASED to '1024' to avoid a rare memory leak in UCX when using NIXL.
(APIServer pid=1) INFO 05-05 02:56:11 [nixl_utils.py:32] NIXL is available
(APIServer pid=1) INFO 05-05 02:56:11 [model.py:563] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 05-05 02:56:11 [model.py:1692] Using max model len 24000
(APIServer pid=1) INFO 05-05 02:56:11 [cache.py:261] Using fp8_e4m3 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=1) INFO 05-05 02:56:16 [model.py:563] Resolved architecture: Qwen3_5MTP
(APIServer pid=1) INFO 05-05 02:56:16 [model.py:1692] Using max model len 262144
(APIServer pid=1) WARNING 05-05 02:56:16 [speculative.py:659] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=1) INFO 05-05 02:56:16 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=4128.
(APIServer pid=1) WARNING 05-05 02:56:16 [config.py:367] Mamba cache mode is set to 'align' for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=1) INFO 05-05 02:56:16 [config.py:387] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=1) INFO 05-05 02:56:16 [vllm.py:841] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 05-05 02:56:16 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) WARNING 05-05 02:56:16 [vllm.py:1403] max_num_scheduled_tokens is set to 4128 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(APIServer pid=1) INFO 05-05 02:56:16 [vllm.py:1563] [Genesis P66] Filtered cudagraph_capture_sizes for spec-decode uniform_query_len=4: removed 2 non-divisible sizes [1, 2]; kept [4, 8]. Prevents mixed-q_len capture (vllm#28015 mechanism).
(APIServer pid=1) INFO 05-05 02:56:17 [compilation.py:303] Enabled custom fusions: act_quant
(APIServer pid=1) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1) INFO 05-05 02:56:17 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
INFO 05-05 02:56:22 [nixl_utils.py:32] NIXL is available
(EngineCore pid=84) INFO 05-05 02:56:22 [core.py:109] Initializing a V1 LLM engine (v0.20.1rc1.dev16+g7a1eb8ac2) with config: model='/root/.cache/huggingface/qwen3.6-27b-nvfp4', speculative_config=SpeculativeConfig(method='mtp', model='/root/.cache/huggingface/qwen3.6-27b-nvfp4', num_spec_tokens=3), tokenizer='/root/.cache/huggingface/qwen3.6-27b-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=24000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.6-27b-autoround, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::gdn_attention_core_xpu', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [4128], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=84) [transformers] `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=84) INFO 05-05 02:56:23 [registry.py:126] All limits of multimodal modalities supported by the model are set to 0, running in text-only mode.
(EngineCore pid=84) INFO 05-05 02:56:23 [parallel_state.py:1402] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.1.96.57:44897 backend=nccl
(EngineCore pid=84) INFO 05-05 02:56:23 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=84) INFO 05-05 02:56:24 [topk_topp_sampler.py:51] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=84) WARNING 05-05 02:56:24 [__init__.py:206] min_p and logit_bias parameters won't work with speculative decoding.
(EngineCore pid=84) INFO 05-05 02:56:24 [gpu_model_runner.py:4778] Starting to load model /root/.cache/huggingface/qwen3.6-27b-nvfp4...
(EngineCore pid=84) INFO 05-05 02:56:24 [cuda.py:423] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=84) INFO 05-05 02:56:24 [__init__.py:683] Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM
(EngineCore pid=84) INFO 05-05 02:56:24 [mm_encoder_attention.py:372] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=84) INFO 05-05 02:56:24 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=84) INFO 05-05 02:56:24 [cuda.py:368] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN'].
(EngineCore pid=84) INFO 05-05 02:56:25 [weight_utils.py:904] Filesystem type for checkpoints: ZFS. Checkpoint size: 26.59 GiB. Available RAM: 60.56 GiB.
(EngineCore pid=84) INFO 05-05 02:56:25 [weight_utils.py:927] Auto-prefetch is disabled because the filesystem (ZFS) is not a recognized network FS (NFS/Lustre). If you want to force prefetching, start vLLM with --safetensors-load-strategy=prefetch.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:01<00:23,  1.69s/it]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:03<00:21,  1.66s/it]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:03<00:12,  1.04s/it]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:03<00:08,  1.34it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:04<00:07,  1.42it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:04<00:04,  1.85it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:05<00:03,  2.28it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:05<00:02,  2.70it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:05<00:01,  3.09it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:05<00:01,  3.41it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:06<00:00,  4.26it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:06<00:00,  5.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:06<00:00,  2.39it/s]
(EngineCore pid=84)
(EngineCore pid=84) INFO 05-05 02:56:31 [default_loader.py:384] Loading weights took 6.31 seconds
(EngineCore pid=84) INFO 05-05 02:56:31 [gpu_model_runner.py:4802] Loading drafter model...
(EngineCore pid=84) INFO 05-05 02:56:31 [vllm.py:841] Asynchronous scheduling is enabled.
(EngineCore pid=84) INFO 05-05 02:56:31 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=84) WARNING 05-05 02:56:31 [vllm.py:1403] max_num_scheduled_tokens is set to 4128 based on the speculative decoding settings. This may lead to suboptimal performance. Consider increasing max_num_batched_tokens to accommodate the additional draft token slots, or decrease num_speculative_tokens or max_num_seqs.
(EngineCore pid=84) INFO 05-05 02:56:31 [compilation.py:303] Enabled custom fusions: act_quant
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
(EngineCore pid=84) INFO 05-05 02:56:31 [weight_utils.py:904] Filesystem type for checkpoints: ZFS. Checkpoint size: 26.59 GiB. Available RAM: 56.10 GiB.
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:00<00:00, 50.15it/s]
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter fc.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter layers.0.mlp.down_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter layers.0.mlp.gate_gate_up_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter layers.0.mlp.gate_up_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter layers.0.self_attn.qkqkv_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter layers.0.self_attn.o_proj.weight not found in params_dict, skip loading
(EngineCore pid=84) WARNING 05-05 02:56:32 [qwen3_5_mtp.py:325] Parameter layers.0.self_attn.qkv_proj.weight not found in params_dict, skip loading
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 29.23it/s]
(EngineCore pid=84)
(EngineCore pid=84) INFO 05-05 02:56:32 [default_loader.py:384] Loading weights took 0.51 seconds
(EngineCore pid=84) WARNING 05-05 02:56:32 [compressed_tensors_w4a4_nvfp4.py:97] In NVFP4 linear, the global scale for input or weight are different for parallel layers (e.g. q_proj, k_proj, v_proj). This  will likely result in reduced accuracy. Please verify the model accuracy. Consider using a checkpoint with a shared global NVFP4 scale for fused layers.
(EngineCore pid=84) INFO 05-05 02:56:32 [llm_base_proposer.py:1460] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=84) INFO 05-05 02:56:32 [llm_base_proposer.py:1516] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=84) INFO 05-05 02:56:32 [gpu_model_runner.py:4880] Model loading took 25.29 GiB memory and 7.450936 seconds
(EngineCore pid=84) INFO 05-05 02:56:32 [interface.py:639] Setting attention block size to 1600 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=84) INFO 05-05 02:56:32 [interface.py:663] Padding mamba page size by 0.25% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=84) INFO 05-05 02:56:41 [backends.py:1069] Using cache directory: /root/.cache/vllm/torch_compile_cache/1fcbfaf4f0/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=84) INFO 05-05 02:56:41 [backends.py:1128] Dynamo bytecode transform time: 8.57 s
(EngineCore pid=84) INFO 05-05 02:56:42 [backends.py:376] Cache the graph of compile range (1, 4128) for later use
(EngineCore pid=84) INFO 05-05 02:56:57 [backends.py:391] Compiling a graph for compile range (1, 4128) takes 16.11 s
(EngineCore pid=84) INFO 05-05 02:57:02 [decorators.py:668] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/7cf4040fe442dbe097b3d86c4ed3e3bccfc13d4650ab782eff7880bbb554018f/rank_0_0/model
(EngineCore pid=84) INFO 05-05 02:57:02 [monitor.py:53] torch.compile took 29.62 s in total
(EngineCore pid=84) INFO 05-05 02:57:06 [monitor.py:81] Initial profiling/warmup run took 3.76 s
(EngineCore pid=84) INFO 05-05 02:57:06 [backends.py:1069] Using cache directory: /root/.cache/vllm/torch_compile_cache/1fcbfaf4f0/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=84) INFO 05-05 02:57:06 [backends.py:1128] Dynamo bytecode transform time: 0.47 s
(EngineCore pid=84) INFO 05-05 02:57:10 [backends.py:391] Compiling a graph for compile range (1, 4128) takes 4.03 s
(EngineCore pid=84) INFO 05-05 02:57:10 [decorators.py:668] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/d8c9abc5bb8b9e0a40ac0e84a86fc04886e91e4127a27317ee7f989c6e3370bc/rank_0_0/model
(EngineCore pid=84) INFO 05-05 02:57:10 [monitor.py:53] torch.compile took 4.80 s in total
(EngineCore pid=84) INFO 05-05 02:57:11 [monitor.py:81] Initial profiling/warmup run took 0.58 s
(EngineCore pid=84) WARNING 05-05 02:57:11 [kv_cache_utils.py:1181] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=84) WARNING 05-05 02:57:11 [compilation.py:1390] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
(EngineCore pid=84) INFO 05-05 02:57:11 [gpu_model_runner.py:5983] Profiling CUDA graph memory: PIECEWISE=2 (largest=8)
(EngineCore pid=84) INFO 05-05 02:57:13 [gpu_model_runner.py:6062] Estimated CUDA graph memory: 0.04 GiB total
(EngineCore pid=84) INFO 05-05 02:57:13 [gpu_worker.py:440] Available KV cache memory: 1.93 GiB
(EngineCore pid=84) INFO 05-05 02:57:13 [gpu_worker.py:455] CUDA graph memory profiling is enabled (default since v0.21.0). The current --gpu-memory-utilization=0.9600 is equivalent to --gpu-memory-utilization=0.9586 without CUDA graph memory profiling. To maintain the same effective KV cache size as before, increase --gpu-memory-utilization to 0.9614. To disable, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0.
(EngineCore pid=84) WARNING 05-05 02:57:13 [kv_cache_utils.py:1181] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=84) INFO 05-05 02:57:13 [kv_cache_utils.py:1778] GPU KV cache size: 59,200 tokens
(EngineCore pid=84) INFO 05-05 02:57:13 [kv_cache_utils.py:1783] Maximum concurrency for 24,000 tokens per request: 1.23x
(EngineCore pid=84) 2026-05-05 02:57:13,488 - INFO - autotuner.py:457 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
[AutoTuner]: Tuning fp4_gemm: 100%|██████████| 13/13 [00:00<00:00, 38.86profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████| 13/13 [00:00<00:00, 63.23profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████| 13/13 [00:00<00:00, 65.87profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████| 13/13 [00:00<00:00, 97.49profile/s]
[AutoTuner]: Tuning fp4_gemm: 100%|██████████| 13/13 [00:00<00:00, 70.73profile/s]
(EngineCore pid=84) 2026-05-05 02:57:15,332 - INFO - autotuner.py:466 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 2/2 [00:00<00:00, 14.29it/s]
(EngineCore pid=84) INFO 05-05 02:57:16 [gpu_model_runner.py:6153] Graph capturing finished in 1 secs, took 0.03 GiB
(EngineCore pid=84) INFO 05-05 02:57:16 [gpu_worker.py:599] CUDA graph pool memory: 0.03 GiB (actual), 0.04 GiB (estimated), difference: 0.02 GiB (69.2%).
(EngineCore pid=84) INFO 05-05 02:57:16 [core.py:299] init engine (profile, create kv cache, warmup model) took 43.78 s (compilation: 34.42 s)
(EngineCore pid=84) INFO 05-05 02:57:16 [kernel.py:205] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1) INFO 05-05 02:57:16 [api_server.py:598] Supported tasks: ['generate']
(APIServer pid=1) INFO 05-05 02:57:16 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=1) WARNING 05-05 02:57:16 [model.py:1449] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 05-05 02:57:17 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) [transformers] The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(APIServer pid=1) INFO 05-05 02:57:21 [base.py:233] Multi-modal warmup completed in 3.896s
(APIServer pid=1) INFO 05-05 02:57:21 [api_server.py:602] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:37] Available routes are:
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 05-05 02:57:21 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.

(APIServer pid=1) INFO:     192.168.88.2:30237 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=1) INFO:     192.168.88.2:18426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 05-05 02:58:01 [loggers.py:271] Engine 000: Avg prompt throughput: 2.5 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 36.1%, Prefix cache hit rate: 0.0%
(APIServer pid=1) INFO 05-05 02:58:01 [metrics.py:101] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 12.23 tokens/s, Accepted: 0 tokens, Drafted: 552 tokens, Per-position acceptance rate: 0.000, 0.000, 0.000, Avg Draft acceptance rate: 0.0%

As an aside, might be worth adding the '--root-user-action' flag on the pip command in the entrypoint for compose files just to get rid the the initial warning.

0 replies

noonghunna · 2026-05-05T14:35:19Z

noonghunna
May 5, 2026
Maintainer

@apnar — thanks for sticking with the NVFP4 attempt and getting clean bench numbers out the other side. Honest read: 32 TPS is the expected envelope for that quant variant on this hardware right now, not a config issue you can tune around. Worth explaining why before you decide whether to keep it or fall back.

Why the numbers are this slow

Compare against your earlier INT4 AutoRound run on the same 5090:

Path	Narr TPS	Code TPS	Quant arch
AutoRound INT4 (`docker-compose.yml`)	122.7	162.3	All linear weights INT4-packed via Marlin
NVFP4 linearattn-BF16 (this run)	32.2	32.1	NVFP4 only on standard-attention; DeltaNet + linear-attn layers stay in BF16

The kaitchup variant's name is precise: Qwen3.6-27B-autoround-nvfp4-linearattn-BF16. NVFP4 is applied only to the standard-attention layers (~25% of the model on Qwen3.6-27B); the DeltaNet linear-attn layers (~75%) stay in BF16. So 75% of every forward step is doing BF16 GEMM on Blackwell, not FP4 tensor core math.

The CV 0.1% you observed is the giveaway — that's a textbook bandwidth-bound path: 32 TPS × ~26 GB residency reads per token ≈ 0.83 TB/s sustained, which is 46% of the 5090's 1.79 TB/s peak. Reasonable for a path that's neither using Marlin's packed kernel (which AutoRound INT4 has) nor Blackwell's native FP4 tensor cores (which a hypothetical full-NVFP4 quant would).

For context on what AutoRound INT4 is doing differently: it goes through vLLM's Marlin kernel which fuses dequant + GEMM in a packed format optimized for consumer GPU bandwidth. NVFP4 + compressed-tensors loader doesn't have that fusion path yet on consumer Blackwell — partly because of the genesis-vllm-patches#20 SM 12.0 detection bug we filed (Genesis can't fully recognize sm_120 as Blackwell, so platform-specific paths don't engage), and partly because vLLM's NVFP4 support is newer / less optimized than Marlin INT4.

Recommendation: AutoRound INT4 is the daily-driver on Blackwell consumer

For now:

Keep docker-compose.yml (AutoRound INT4) as your daily driver — 122/162 TPS, all 8/8 verify-full pass, mature kernel path
Park NVFP4 until vLLM ships an NVFP4-Marlin-equivalent kernel for consumer Blackwell. The marquee FP4 tensor core win the 5090 has is real, but software needs to catch up to expose it. We don't currently have visibility into when that lands — it depends on vLLM's Blackwell roadmap + Sandermage updating Genesis's SM 12.0 detection (per [bug] launch script handling of ports and container names is broken #20)

NVFP4 on Hopper (datacenter) is more mature today — the Hopper FP4 path has been in use longer and the loader is better tested on H100/H200. On consumer Blackwell (5090), it's preview-quality.

What would be useful from your side

If you have cycles for it:

Bench dual.yml or dual-turbo.yml on the 5090 single-card (TP=1) — these use AutoRound INT4 (which works well for you) but with a different KV format (fp8_e5m2 vs TQ3) and full ctx. Would give us the cleanest "5090 single-card AutoRound INT4 + various KV formats" comparison row in BENCHMARKS.
Try the v7.72.2-uplift branch (just merged, PR #59) on the AutoRound INT4 daily-driver — bumps the vLLM image to a newer nightly and the Genesis pin to v7.72.2. Worth a re-bench since the underlying compute kernels may have moved.

Either of those gives you a more useful "5090 owner" data point than the NVFP4 path can today. We'd love to add a 5090 single-card row to BENCHMARKS at the AutoRound INT4 numbers — that's a config we can confidently recommend to other 5090 owners.

Cross-link to genesis bug

Filing reference for #20: the SM 12.0 detection issue is at Sandermage/genesis-vllm-patches#20. Once that lands, several Blackwell-specific paths in Genesis (Marlin tuning, P40 grouping kernel) will start engaging cleanly on your rig, which should help even the AutoRound INT4 path. Worth watching that issue.

Genuinely thanks for being our 5090 cross-rig — the NVFP4 attempt was a worthwhile preview-class data point even though the numbers didn't justify daily-driver use. We're collecting these pin to the upstream tree as the Blackwell consumer envelope sharpens.

0 replies

noonghunna · 2026-05-05T15:13:30Z

noonghunna
May 5, 2026
Maintainer

A small meta-note as we close out today's loop on this thread, @apnar:

The back-and-forth here on the 5090 NVFP4 attempts is genuinely high-signal — first Blackwell consumer cross-rig data on club-3090, surfaced the Sandermage/genesis-vllm-patches#20 sm_120 detection bug, gave us real numbers to anchor "what does the AutoRound INT4 path look like on a 5090". All useful.

But we noticed the thread has accumulated several rounds of full-container-log dumps + Genesis dispatcher banners + bench output — content that's structurally a bug-tracking conversation, not the kind of open-ended hardware-questions discussion this channel is shaped for. We just sharpened the routing in disc #17 and CONTRIBUTING.md (PR #61) to make the convention explicit going forward:

Discussions = introductions, "is this expected?" before logs, design proposals, hardware-class questions
Issues = log dumps, tracebacks, report.sh output, reproducible failures — anything with structured rig context

Future iterations on this NVFP4 / 5090 work — would you mind taking them to a [bug] 5090 NVFP4 ... issue (using the bug-report template which leads with bash scripts/report.sh)? Two practical wins:

Findability — when the next 5090 owner shows up and hits the same OOM at NVFP4 model load, they'll find the issue via tracker search instead of digging through this thread
State machine — when genesis#20 lands and the sm_120 detection improves, the issue closes cleanly with a "fixed by upstream X, here's what changed" pointer. Discussions don't have that close-state surface.

This thread is fine as the design / hardware-class discussion for what 5090 ownership of club-3090 looks like — high-level "should we ship a 5090 default config?" sort of questions. The structured cross-rig bench data row (AutoRound INT4 daily-driver numbers) belongs in a Numbers from your rig issue when you have it. The bug reports (NVFP4 boot OOMs, slow throughput diagnoses, KV pool admission failures) belong in regular bug-report issues.

No judgment on what's already posted — this is a "going forward" convention, not a retroactive cleanup. We're not migrating this thread or asking you to re-file anything you've already shared. Genuinely thanks for the depth of cross-rig data here; it's been load-bearing.

0 replies

Initial data from 5090 run #51

Uh oh!

apnar May 4, 2026

club-3090 rig report

System

CPU + RAM

Disk

GPU hardware

NVLink

Topology

Full nvidia-smi

Display / desktop state

Container runtime

Stack version

Active container

verify-full.sh output

verify-stress.sh output

soak-test.sh (SOAK_MODE=continuous) output

bench.sh output

Replies: 10 comments · 4 replies

Uh oh!

apnar May 4, 2026 Author

bench.sh output

Uh oh!

Uh oh!

apnar May 4, 2026 Author

Uh oh!

noonghunna May 4, 2026 Maintainer

What the data says (5090 vs 3090 Ti, same system)

Tweak A (likely +5-15%): drop the Ampere-specific Marlin RO-mount

Tweak B (likely +5-10% accuracy, neutral TPS): swap KV format

Tweak C (lots more headroom): raise max-model-len

Tweak D (probably no-op but worth flagging): Genesis Blackwell auto-detection

On the soak-test docker error

What we'd love to see next

Uh oh!

Uh oh!

apnar May 4, 2026 Author

bench.sh output

Uh oh!

noonghunna May 4, 2026 Maintainer

Tweak A — you're right, NA for single-card

The Genesis sm_120 detection gap

Your B+C results — surprising and useful

NVFP4 — the real Blackwell experiment

Action items I owe you

Uh oh!

noonghunna May 4, 2026 Maintainer

Uh oh!

apnar May 4, 2026 Author

Uh oh!

apnar May 4, 2026 Author

Uh oh!

noonghunna May 4, 2026 Maintainer

Try --language-model-only

Aside — Genesis NOW knows you have an RTX 5090

Smaller asks

Uh oh!

apnar May 4, 2026 Author

Uh oh!

noonghunna May 4, 2026 Maintainer

What the boot log shows

Three tweaks (in order of how much memory they free)

1. Drop MTP — likely +1.0-1.5 GB

2. Raise gpu-memory-utilization — likely +0.5-1 GB

3. Lower max-model-len to 16K-24K initially — proves boot works

Combined first attempt

The honest take — is NVFP4 worth chasing here?

What I'd suggest instead

If you boot NVFP4 anyway, what we'd love to see

Uh oh!

apnar May 5, 2026 Author

Uh oh!

noonghunna May 5, 2026 Maintainer

Why the numbers are this slow

Recommendation: AutoRound INT4 is the daily-driver on Blackwell consumer

What would be useful from your side

Cross-link to genesis bug

Uh oh!

noonghunna May 5, 2026 Maintainer

apnar
May 4, 2026

Replies: 10 comments 4 replies

apnar
May 4, 2026
Author

apnar
May 4, 2026
Author

noonghunna
May 4, 2026
Maintainer

apnar
May 4, 2026
Author

noonghunna May 4, 2026
Maintainer

noonghunna May 4, 2026
Maintainer

apnar
May 4, 2026
Author

apnar
May 4, 2026
Author

noonghunna May 4, 2026
Maintainer

Try `--language-model-only`

apnar
May 4, 2026
Author

noonghunna May 4, 2026
Maintainer

apnar
May 5, 2026
Author

noonghunna
May 5, 2026
Maintainer

noonghunna
May 5, 2026
Maintainer