Genesis vLLM Patches v7.64 released — please test #19

Sandermage · 2026-04-30T22:49:01Z

Sandermage
Apr 30, 2026

Just shipped v7.64. Tried to address everything that came out of cross-rig
work over the last couple weeks, especially the cliffs you have been hitting
on the 3090s.

What is in

Bug fixes that close existing issues:

P67 now works on non-power-of-2 GQA — Qwen3.6-27B (GQA=24/4=6) no longer
falls through to the broken upstream path under TQ k8v4 + FULL_AND_PIECEWISE
cudagraph. Tool-call went 0/5 → 7/7 on the 2× A5000 validation. Closes the
GQA-pow-2 compile error class.
PN17 + PN19 actually apply now — both had a wiring kwargs bug and were
silently skipping. PN17 frees 50-100 MiB on long-context FA2 (resolves
Cliff 1 mech A from your diagnosis), PN19 frees 200-500 MiB during model load.
Per-model patch matrix bisected — 5 defensive patches help 27B (+9%
wall TPS) but regress 35B FP8 (−4%). 27B default carries them, 35B default
does not. Documented per-model so nobody auto-enables across configs.

New launch script variants:

start_27b_int4_TQ_k8v4.sh           # main 27B PROD (5× KV pool, MTP K=3)
start_27b_int4_TQ_k8v4_NGRAM.sh     # tool-use heavy
start_27b_int4_DFLASH.sh            # coding agents (135 TPS code workload)
start_27b_int4_fp8_e5m2_short.sh    # short-to-mid ctx
start_27b_int4_fp8_e5m2_long_256K.sh
start_35b_fp8_PROD.sh
start_35b_fp8_DFLASH.sh

6 new docs files: docs/GLOSSARY.md (terms), docs/HARDWARE.md (VRAM
budget + GPU class), docs/FAQ.md, docs/CONFIGS.md (add-your-own-model
walkthrough), docs/CLIFFS.md (8 cliffs catalogued), CONTRIBUTING.md.

Repo structure cleanup — tried to make navigation obvious so future-me
does not get lost. Doc map in README, per-launch scripts named by KV dtype +
workload.

Asking for

Tests on dual-3090 — the 27B+TQ k8v4 variant should land within ~5% of
our A5000 numbers (95-100 TPS @ 256-512t). If it does not on 3090, that is
interesting and I want to know.
Feedback on docs — especially CONFIGS.md. Was the walkthrough enough
to add your own model? What was missing?
Backport priorities — top candidates I am tracking for next release:
PR #40898 (DFlash SWA, +25% acceptance length), PR #39419 (local argmax TP,
+9-30% on TP=2), PR #41306 mitigation (--moe-backend=triton). Which of
these would matter most for your workload?
A star if it has been useful — https://github.com/Sandermage/genesis-vllm-patches

If something looks off — please tell me. Tests on my side show no problem,
but I am open to being wrong if you have a counter-example.

Cheers and thanks for keeping this thing honest.

noonghunna · 2026-05-01T00:17:56Z

noonghunna
May 1, 2026
Maintainer

Just upgraded our pin to v7.64 today and started cross-rig validation on RTX 3090 single-card (vLLM 0.19.2rc1.dev205+g07351e088, our older 0.19 pin). Notes from the boot:

Anchor drift status on our pin:

PN12 still drift-skipped — drift detection caught it ([Genesis] DRIFT skipped: PN12 ... required_anchor_missing is exactly what we needed; well done). Our local class-scoped anchor sidecar still required on this pin to actually patch SiluAndMul.forward_cuda. Fully expected since the underlying drift is dev205-specific (your closure of our PR #13 was correct).
PN17 applies cleanly with GENESIS_ENABLE_PN17_FA2_LSE_CLAMP=1. Replaces our P104 sidecar — diagnosis credit much appreciated. We have P104 disabled at the env level on long-text.
PN19 costs ~120 MiB KV pool on Ampere consumer (single 3090, 24 GB). With it enabled at 218K + 0.985 mem-util, engine init fails: 3.52 GiB KV cache is needed, available KV cache memory (3.4 GiB), estimated maximum model length is 206400. Disabled on our long-text variant; could re-enable if we drop ctx to ≤200K. Worth flagging in docs/CLIFFS.md that the documented "200-500 MiB win on H100" is actually negative on Ampere consumer (different allocator behavior under PyTorch 2.10+ load-time fragmentation, possibly).

Open issue v7.64 doesn't address — forward_native inductor bypass: PN12 only patches forward_cuda. With custom_ops=["none"] (default under Inductor), SiluAndMul.__call__ dispatches to forward_native, which Inductor inlines and lowers to empty_strided_cuda(...) — bypassing the pool entirely. We see Cliff 1 mech B fire on real OpenCode workloads (~30K sys+tools prefill) even with PN12 marked "applied". Tracking in club-3090#16. I'm currently testing a torch.library.custom_op-based sidecar that patches forward_native to route through an opaque op (so Inductor can't inline it). Happy to PR upstream if it works — the win is that PN12 closure becomes pin-independent (works under any custom_ops default).

Pin question: all of v7.64's empirical validation was on 0.20.1rc1.dev16+g7a1eb8ac2. We're considering a vLLM image bump to match yours — want to confirm: is that the recommended dev pin going forward, or are you tracking a moving target? If pinning, we'd update our compose to vllm/vllm-openai:nightly-7a1eb8ac2ec4ea69338c51dc7afd4b15010abfa8.

Backport priority feedback for v7.65:

PR #40898 (DFlash SWA, +25% AL) — biggest win on dual-card here, DFlash is our peak code-TPS path on TP=2. Strong yes.
PR #41306 mitigation (--moe-backend=triton) — useful for any future MoE we serve. Yes.
PR #39419 (local argmax TP) — limited value on PCIe-only TP=2 (our setup), more useful for NVLink rigs. Optional.

Also strongly +1 on the Cliff 8 hardening (partial_apply_warnings counter in boot summary). The silent-skip class is what bit us — drift detection is the right tool, and surfacing it in the summary so users can't miss it would close the loop.

Will share apples-to-apples 27B+TQ k8v4 numbers after the current bench finishes.

⭐ given.

0 replies

Sandermage · 2026-05-01T01:02:01Z

Sandermage
May 1, 2026
Author

Hey @noonghunna — thanks for the detailed boot report, this is exactly the kind of cross-rig validation I was hoping for. Quick AI-translated notes from Odessa (apologies for any English roughness).

On the pin question — when I write a vLLM SHA in the README, that means I've actually rebuilt at that SHA, re-run the validator on it, re-baked the patcher under it, and re-validated the reproducer (35B FP8 tool-call + 27B Lorbus). It's not a "we should be on this someday" — it's "this is what I'm running right now, and it works".

Update cadence is value-vs-regression, not a calendar. Concrete examples for context:

PR #40941 (TurboQuant share buffers) merged → bumped pin same week, freed ~57 GB @ 1M ctx, no downside.
PR #39055 + #40738 merged → retired P59/P60, patcher shed weight.
PR #40662 (synthetic acceptance) landed +12% TPS on 35B but broke OpenAI tools API tool-call. Did NOT bump despite the speedup. Speed without correctness is a regression.

So when you see 0.20.1rc1.dev16+g7a1eb8ac2 in our README, it has already cleared both gates. Safe to bump your image to match — that's what we're running on the A5000 fleet today (2026-05-01). Will flag in the README when we move.

On PN25 (forward_native inductor bypass) — you're absolutely right that PN12 leaks past the compile path. Genesis stack has the same flaw; we don't hit it in PROD only because our 27B Lorbus + cudagraph FULL_AND_PIECEWISE config short-circuits the inductor pipeline on this kernel. Future inductor-default configs would expose it. Just landed PN25 on dev (2b239a8) — same torch.library.custom_op approach you're testing. Sister-patch to PN12: PN12 covers eager forward_cuda, PN25 covers compile forward_native via opaque op. Independent convergence is a good sign — happy to compare notes if you push your branch and we can A/B against ours. Full writeup with file links posted in club-3090#16.

On PN19 ≠ H100 ergonomics — agreed, will flag in CLIFFS.md that the 200-500 MiB win is H100-specific. We saw similar non-transfer on P104 L2 persistence (regressed -16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't survive class jumps.

On Cliff 8 hardening — partial_apply_warnings counter in boot summary, ship in v7.65. Dispatcher has the data; just need to surface it next to the applied/skipped/failed line.

On backports — your priorities track mine:

PN21 (#40898 partial) shipped opt-in OFF on dev (1ac34a8). 5-6/7 tool-call vs 7/7 baseline on partial backport — not viable until full model-class changes land or upstream merges. File is in tree for future activation.
PN22 (#39419) shipped, +9.8% on 27B DFlash code where it fires (limited on PCIe TP, agreed).
#41306 still investigating; A1 attempt failed on FP8 e4m3fn block-scaled (ValueError, Triton MoE not supported). Needs a different test model.

Numbers from your apples-to-apples 27B+TQ k8v4 bench will be very useful — esp. if 3090 lands within 5% of our A5000 reference. Hard data on that gap is one of the things I haven't been able to gather on my own rig.

⭐ much appreciated, and thanks for keeping the cross-rig pipeline honest.

0 replies

noonghunna · 2026-05-01T01:12:09Z

noonghunna
May 1, 2026
Maintainer

Thanks for the pin clarity — clear gate criteria (rebuilt + validator + reproducer + tools-API regression check) is exactly what makes the SHA actionable rather than aspirational. Will track your README for future pin moves.

v0.20 result on our 3090 / Qwen3.6-27B + TQ3 + MTP K=3 config — different outcome from your A5000 fleet. Boot was clean (all v7.64 patches apply natively, including PN25 sister-pair), but engine crashed during MTP draft proposal at long prefill:

AssertionError: Workspace is locked but allocation from
'turboquant_attn.py:951:_decode_attention' requires 0.76 MB,
current size is 0.00 MB. Workspace growth is not allowed after locking.

Stack: propose_draft_token_ids → llm_base_proposer.propose → qwen3_5_mtp.forward → unified_attention_with_output → turboquant_attn.forward → _decode_attention → workspace.get_simultaneous → _ensure_workspace_size.

Root-caused to vllm#39226 — the strict WorkspaceManager.lock() semantics that landed in v0.20.0. TurboQuant _decode_attention requests 0.76 MB at runtime, but profile_run doesn't appear to exercise that path on our config so the workspace gets locked at 0.00 MB. The lock then refuses the grow.

Likely a config-shape mismatch between our setups (yours probably hits TQ decode during profile_run via different draft/cache geometry, ours doesn't — guessing TP=1 + INT4 group_size=128 Marlin + MTP K=3 vs A5000 TP=N + Lorbus). Tracking on a separate v0.20-experimental branch in our repo so master stays on dev205 until either (a) profile_run exercises the path on our config, or (b) WorkspaceManager gets a "protected callsite can grow" carve-out upstream.

On PN25 + P38 — both extremely relevant for our cliff investigation. Detailed reply on club-3090#16 but short version: PN25 is independent convergence on the same fix we just shipped locally; we'll plumb your dev version through and A/B against ours when next bench window opens. P38 directly addresses ampersandru's _continuation_prefill OOM — testing it is now top priority for our long-text variant.

Will also surface the H100-vs-Ampere PN19 footnote in our CLIFFS.md so users on consumer 3090 don't re-discover the negative.

⭐ — and "Speed without correctness is a regression" should be the tagline for this whole project tbh.

0 replies

noonghunna · 2026-05-01T12:33:22Z

noonghunna
May 1, 2026
Maintainer

27B + TQ k8v4 dual-3090 bench + CONFIGS.md feedback (closing out the asks from your v7.64 ship post).

1. TQ k8v4 dual-3090 bench

Our compose: 2× RTX 3090 24 GB PCIe (no NVLink), TP=2, AutoRound INT4, vLLM dev205+g07351e088, Genesis v7.64 (64dd18b), kv-cache-dtype=turboquant_k8v4, MTP K=3, max-model-len 262144, mem-util 0.90, max-num-seqs 2, max-num-batched-tokens 4128, prefix-caching off (per your launch script). Genesis env vars enabled: P64, P65, P66, P67, P98, P101, PN13, plus VLLM_FLOAT32_MATMUL_PRECISION=high + VLLM_SSM_CONV_STATE_LAYOUT=DS. Bench: scripts/bench.sh (3 warm + 5 measured, narrative ~600-char prompt → max 1000 tokens, code ~80-char prompt → max 800 tokens).

Metric	Narrative (n=5)	Code (n=5)
wall_TPS mean	61.45 (CV 3.8%)	77.51 (CV 2.4%)
decode_TPS mean	62.04 (CV 3.9%)	79.50 (CV 2.9%)
TTFT mean	150ms	152ms
MTP AL	3.22	3.22
MTP per-pos accept	0.91 / 0.76 / 0.56	same range
Avg draft accept	74.0%	74.0%
VRAM per card	20.4 GB / 24 GB (mem-util 0.90)	same

Comparing to your A5000 reference (bare_metal_27b_int4_TQ_k8v4.sh header: "Tested: 2× RTX A5000, wall_TPS 89.23 over N=500 stress"):

Our 3090 code wall_TPS 77.51 vs your A5000 89.23 → ~13% below.
That's outside the ±5% band you flagged as "interesting if it doesn't land within."

Most likely contributors to the gap:

Genesis env-var subset. Our compose enables ~7 of the ~25+ env vars in your bare-metal launch. Notably missing: P58, P60/P60b, P61/P61b, P62, P72 + GENESIS_PROFILE_RUN_CAP_M=4096, P74, P82, P83, P85, P87, P91, P99, P100, PN11, PREALLOC_TOKEN_BUDGET=4096, BUFFER_MODE=shared. P82/P85/P87/P91 in particular are documented as PROD-on for your stack and we know P87 is documented as +24% on Ampere AutoRound INT4 (I'm assuming we silently lose the 24% by not enabling it). I expect adding the full set closes most of the gap; haven't run the A/B yet.
vLLM pin. We're stuck on dev205+g07351e088 (v0.19) because v0.20 hits the workspace-lock regression in vllm#39226 on our config — can't even boot long-text on v0.20 (detail in our docs/CLIFFS.md). Your A5000 numbers are on 0.20.1rc1.dev16+g7a1eb8ac2, which has TQ FA3/FA4 prefill paths (vllm#40092). That alone could be a few %.
Hardware nuance. A5000 has slightly lower memory bandwidth (768 vs 936 GB/s on 3090) but a steady-state ECC + PRO driver path; 3090 is more bursty under sustained load. Probably ≤2-3% delta.

Side-by-side memory profile (might be useful for triage): at 0.90 mem-util we sit at 20.4 GB/24 GB per card, leaving ~3.5 GB headroom on each — plenty of activation room. So the gap isn't OOM-pressure; it's pure throughput.

Want me to A/B with the full env-var set? Happy to run if useful — would isolate whether the 13% is env-vars or pin/hardware.

2. CONFIGS.md walkthrough feedback

Walked through the doc end-to-end while the bench booted. Strong overall — quick decision tree at the top, "5 things to write down" before editing, per-bucket patch lists with 1-line "what does it do" each, and the worked Llama-3 70B example at the end ("generic patches work outside Qwen") all hit the target.

Friction points that surfaced when I tried to mentally re-execute it for our 27B/3090/Docker setup:

Script naming mismatch. Step 2 references start_27b_int4_fp8_e5m2_short_single_card.sh. Actual files in scripts/launch/ are bare_metal_27b_int4_no_TQ_short_single_card.sh, bare_metal_27b_int4_TQ_k8v4.sh, etc. Different prefix (start_* vs bare_metal_*) and one says no_TQ vs fp8_e5m2. New users will ls scripts/launch/ and not find what the doc references. The doc names are clearer; the script names are more accurate. Pick one and align — biggest-impact single change.
Docker compose path is invisible. The whole guide assumes bash scripts/start_<...>.sh. Many operators deploy via Docker compose (we do). Your bare-metal scripts don't run as-is inside containers (they pip install ...). A short Step 5b: "If you deploy via Docker, mirror the bare-metal env vars + flags into your compose's environment: block + command: list" with a link to a reference compose would close the loop.

TQ k8v4 deps are scattered. The "OOM → switch to k8v4" tip in Step 6 says "needs P4 + P67 + P98" but the Step 4 walkthrough lists P4, P67/P67b, P98, P101, PN8. P4 has no description in either doc. Suggested fix: a single copy-paste TQ-k8v4-minimal block:

export GENESIS_ENABLE_P4=1                      # required, removes hybrid TQ rejection
export GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL=1
export GENESIS_ENABLE_P98=1                     # required, WorkspaceManager fix vs vllm#40941
export GENESIS_ENABLE_P101=1                    # TQ packed-slot layout

API key is repo-baked. Step 5's smoke test uses Authorization: Bearer genesis-local. Outside-Genesis users won't have that. One-liner addition: "If you launched without --api-key, drop the Authorization header."
Step 5 bench tools/genesis_bench_suite.py works — confirmed it's in tree. But the doc doesn't say what shape input it expects. Add: "Edit NARR_PROMPTS / CODE_PROMPTS in the script if your model has a different chat template format."
Spec-decode trio (ngram / MTP / DFlash) listed without gating signals. Step 3's "pick one" doesn't repeat "MTP requires HF repo carrying mtp_* weights / DFlash requires the z-lab drafter checkpoint download". Step 1 mentions these briefly; Step 3 doesn't link back. Cheap fix: cross-reference Step 1 from Step 3.
DFlash on hybrid models has a vLLM PR backport requirement. Worth a one-line caveat in the DFlash spec-decode option that PR #40898 partial backport leaves +25% AL gap open.
Step 7 — submit-back flow doesn't mention CONTRIBUTING.md's bench-output format spec. Link the format spec.

Smallest single-change with biggest impact: fix script naming (#1). That's the first thing every new operator runs into and the doc directly disagrees with ls.

Both items closed out. Backport priorities + ⭐ already in our previous reply. P38 silent-no-op trace filed separately as genesis-vllm-patches#14.

0 replies

Sandermage · 2026-05-01T14:58:04Z

Sandermage
May 1, 2026
Author

Hey @noonghunna — thanks for the dual-3090 bench data + CONFIGS feedback. Gonna address both halves carefully so the facts stay grounded.

1. The 13% gap on TQ k8v4

Your gut is right that the env-var subset is the dominant contributor. Yes please run the A/B with the full set — this is the cleanest data point we can get for the doc.

One factual correction first: P82 is actually OFF in our 27B PROD launch script, not on. Verifiable in scripts/start_27b_int4_TQ_k8v4.sh:47 — GENESIS_ENABLE_P82=0. It's biased on small-batch single-stream Lorbus INT4 + MTP K=3 per memory feedback_p82_*; ships disabled. So your set should NOT include P82=1 if you're matching our PROD config.

The patches that are PROD-on for the 27B TQ k8v4 path (verified from current launch script):

P4=1 P58=1 P60=1 P60B=1 P61=1 P61B=1 P62=1 P64=1 P66=1 P67=1
P68=1 P69=1 P72=1 P74=1 P83=1 P85=1 P87=1 P91=1 P94=1 P98=1
P99=1 P100=1 P101=1 P103=1
PN8=1 PN9=1 PN11=1 PN12=1 PN13=1 PN14=1 PN17=1 PN19=1 PN22=1
PN26b=1 (with BLOCK_KV=8 num_warps=4 threshold=0.01)
P38B=1 P15B=1  ← NEW (issues #14 #15 fixes — see below)
GENESIS_PROFILE_RUN_CAP_M=4096
GENESIS_PREALLOC_TOKEN_BUDGET=4096
GENESIS_BUFFER_MODE=shared

P82 stays OFF; P78 stays OFF. (source-of-truth file — copy from line 36–53.)

On your three contributing factors:

Env-var subset — agreed dominant. P87 is documented +24% on AutoRound INT4 specifically (per memory project_genesis_int8_27b_optimization_recipe.md); the +24% number was measured on Marlin path (group_size=128), so AutoRound INT4 should reproduce. P85 is the prefix-cache-aware path that composes with P87.
vLLM pin (dev205 vs g7a1eb8ac2) — agreed a few %. The pin you cite (vllm#39226 workspace-lock blocking your v0.20 boot) — that is a real upstream regression. Our PROD on g7a1eb8ac2 doesn't hit it because we run --max-num-seqs=2 × MTP K=3 which shapes the workspace differently from your TP=1 + MTP K=3 single-card config. Investigated yesterday: see Genesis_internal_docs/workspace_lock_investigation_20260501.md (in our private notes — short version: enabling GENESIS_ENABLE_P98=1 on long-text composes correctly with our pin and avoids the lock; you already have P98=1 per your env list).
Hardware nuance (3090 vs A5000) — agreed ≤2-3%.

So at full env-var set + same pin: expected ~5-8% residual gap at most, closer to your hardware-only floor.

2. CONFIGS.md feedback — all 8 friction points addressed

Every one was actionable. Pushed fixes to dev in 6f4b937. Per-point response:

#1 Script naming mismatch. Fixed. Updated table to reference real files (scripts/launch/start_27b_int4_no_TQ_short_single_card.sh etc.). Added explicit naming convention block: start_* = Docker, bare_metal_* = host install; _no_TQ_* is the historical name for fp8_e5m2 KV (without TurboQuant). Aligned the doc to filesystem reality rather than the other way round.

#2 Docker compose path invisible. Added new Step 2b — Docker compose mirror with worked compose snippet (~25 env vars from start_27b_int4_TQ_k8v4.sh translated to environment: + command: blocks). Explicit pointer to the script as source-of-truth.

#3 TQ k8v4 deps scattered + P4 has no description. P4 description was actually present at line 243 — "P4 — required, removes hybrid TQ rejection" — but you're right it was hard to find. Added two consolidated copy-paste blocks: "Required for boot" (P4 + P67 + P98 + P101 + PN8) and "Recommended PROD additions" (~25 env vars). The required block is now the first thing readers see in the TQ k8v4 section.

#4 API key repo-baked. Added explicit fallback note in Step 5's smoke test: "If you launched without --api-key, drop the Authorization header." Also notes the launch scripts include --api-key genesis-local.

#5 tools/genesis_bench_suite.py input shape. Added warning about chat-template mismatch as the silent killer (Qwen vs Llama vs Mistral): "wrong template causes the bench to look slow because every reply gets a 'I cannot help with that' stop after a few tokens."

#6 Spec-decode trio without gating signals. Step 3's spec-decode block now back-links Step 1 §4 for each method capability check (ngram always works / MTP needs mtp_* weights / DFlash needs z-lab drafter download).

#7 DFlash on hybrid models — PR #40898 caveat. Added 1-line caveat next to the DFlash spec-decode option pointing at vllm#40898 OPEN status + Genesis PN21 partial backport state. ~25% acceptance-length gap acknowledged until upstream merges.

#8 Step 7 — submit-back format spec. Step 7 already links CONTRIBUTING.md for bench-output format. No change needed beyond what's there.

Smallest-impact thanks for the prioritization — script naming was indeed the biggest first-touch friction and got fixed first.

3. Heads-up — your issues #14 + #15 are landed on `dev`

Independently of the v7.64 discussion thread, both your bug reports got landed:

Issue #14 (P38 silent no-op on TQ KV) — fixed via P38B compile-safe in-source hook (f289e07). Diagnosis was exactly right: aot_compile_fullgraph captures the original method body; class-attribute rebind doesn't propagate. Fix: text-patch the source file to inject a delegate hook at the start of _continuation_prefill body, then install _genesis_p38_dispatch class attribute. Source-level edit means aot_compile captures the hook itself. Independent convergence with your patch_pn12_compile_safe_custom_op.py — same problem class, two viable mechanisms (custom_op vs in-source delegate). Detailed reply at issue #14 comment.
Issue #15 (FA varlen workspace cliff) — fixed via P15B FA varlen max_seqlen_k clamp (same commit). Direct backport of your suggestion path 1 — text-patch turboquant_attn.py:_flash_attn_varlen to compute actual max from cu_seqlens_k and clamp before invoking the FA wrapper. Trade-off (one GPU→CPU sync per call on infrequent path) is statistically invisible at our 100t output bench. Detailed reply at issue #15 comment.
Issue #9 (P68/P69 threshold) — closed. Default raised 8000→50000 in v7.65 (d73fa9d).

Both P38B + P15B are opt-in (GENESIS_ENABLE_P38B_COMPILE_SAFE=1 and GENESIS_ENABLE_P15B_FA_VARLEN_CLAMP=1); enabled in our PROD launch from the same v7.65 commits. Validated boot + sustained 50-req on 27B PROD: 7/7 tool-call, 0 errors, no observable runtime regression vs PN26b alone.

Worth A/B-ing against your config — P15B specifically targets the long-vision OOM you traced (50 MiB workspace alloc at flash_attn_interface.py:300).

4. New work landing on `dev` you may want to track

PN26b sparse-V kernel (866023c) — Genesis-original Triton fork of upstream PR #41422 design + BLASST λ=a/L scaffold + per-CTA atomic skip-rate counter. Per-model winner config: 35B PROD (BLOCK_KV=4, num_warps=4) +3.9% mean / +14.7% max; 27B PROD (BLOCK_KV=8, num_warps=4) +1.2% mean. First sparse-V kernel deployed for SM86 (Ampere consumer) — TRT-LLM #9821 + FlashInfer #2477 ship for SM90+ only.
Cliff 8 hardening (434c8ce) — partial_apply_warnings counter in boot summary. Surfaces silent anchor-drift skips that were previously buried in the same skipped count as opt-in OFF. (Promised in our previous reply.)
Discussion answer: pin policy + value-vs-regression criteria already in our prior reply. Holds.

When you run the A/B with the full env-var set, please share the wall_TPS — we'll cross-check against our A5000 reference and update bare_metal_27b_int4_TQ_k8v4.sh header if your dual-3090 number is consistent.

⭐ Thanks again for keeping the cross-rig pipeline honest and surfacing the doc gaps.

0 replies

noonghunna · 2026-05-01T15:24:00Z

noonghunna
May 1, 2026
Maintainer

A/B bench results — full env-var set on TQ k8v4 dual-3090. ⭐

Following your reply: bumped Genesis pin to dev tip (d89a089 — includes P38B + P15B + PN26b + PN28), aligned env-var set to your start_27b_int4_TQ_k8v4.sh:36-53 (P82=0 corrected, full PROD list applied), kept the rest of the rig identical (vLLM dev205+g07351e088, RTX 3090 dual TP=2 PCIe-only, max_model_len=262144, mem_util=0.90, max_num_seqs=2, max_num_batched_tokens=4128, kv-cache-dtype=turboquant_k8v4, MTP K=3, no prefix-caching). All v7.65 dev patches active per dispatcher: P15B + P38B + PN26b sparse-V (BLOCK_KV=8 num_warps=4 threshold=0.01) all applied with explicit "sparse-V kernel deployed for SM86" log line.

Bench (n=5 measured, 3 warm, scripts/bench.sh canonical narrative + code prompts):

Metric	Earlier subset (Phase 1)	FULL env set + dev branch	Δ vs subset	A5000 ref (89.23)
Code wall_TPS mean	77.51 (CV 2.4%)	116.59 (CV 3.1%)	+50.5%	+30.7% over
Code decode_TPS mean	79.50	120.86 (CV 2.5%)	+52.0%	—
Narrative wall_TPS mean	61.45 (CV 3.8%)	92.12 (CV 2.3%)	+49.9%	within 95-100 target
Narrative decode_TPS mean	62.04	93.33 (CV 2.3%)	+50.4%	—
MTP AL	3.22	3.12-3.40 (range across 3 metric ticks)	similar	—
TTFT	152ms	137ms (code) / 135ms (narrative)	-10%	—
GPU util	not captured	100% / 100%	TP=2 saturated	—
VRAM per card	20.4 GB	20.4 GB	same	—

Headline: code wall_TPS 116.59 on dual 3090, +30.7% over your A5000 89.23 reference. Narrative 92.12 lands at the bottom edge of your 95-100 target band (likely the bench prompt class — narrative has more variable acceptance length, code is repetitive enough for MTP to amortize hard).

Variance analysis on the +50% jump (full vs subset):

The patches that were absent from our earlier subset and present now:

P87 (AutoRound INT4 row-parallel scales fix, +24% on Marlin path) — likely the single biggest contributor for AutoRound INT4
P85 (prefix-cache-aware path that composes with P87)
P98 (TQ WorkspaceManager revert — was already on, but the rest of the env stack lets it amortize better)
PN26b sparse-V kernel (BLOCK_KV=8 num_warps=4, your 27B-specific tuning) — first SM86 sparse-V kernel in tree
P58 / P60 / P60B / P61 / P61B / P62 — the spec-decode + tool-call quality stack
P72 + GENESIS_PROFILE_RUN_CAP_M=4096 + GENESIS_PREALLOC_TOKEN_BUDGET=4096 — profile/prealloc tuning that aligns workspace sizing
PN8 / PN9 / PN11 / PN12 / PN13 / PN14 — full hybrid GDN coverage stack
P38B + P15B (just landed in f289e07) — close the cliff cascade

If you want a per-patch ablation, I can run a few targeted A/Bs (e.g. +P87 only vs +PN26b only vs full set) — each takes ~10 min so a 5-cell sweep is feasible. Most useful for the doc would probably be (a) subset baseline (already have), (b) +P87 + P85 alone, (c) +PN26b alone, (d) full set (have). Let me know if you want that breakdown.

Notable: PN26b's "first sparse-V kernel deployed for SM86 (Ampere consumer)" log line is correct — Ampere consumer users now have a path to sparse-V tile-skip that doesn't exist anywhere upstream. That alone is a sizable contribution for the SM86 fleet beyond just our rig.

P38B + P15B both apply cleanly on our v0.20-blocked config → boot clean, sustained workload clean, no observable regression vs the pre-fix state.

Pin migration plan unblocked. With v7.65 carrying P38B + P15B + PN26b + PN25 + Cliff 8 hardening + P68/P69 threshold default, master can move to v0.20.1rc1.dev16 + Genesis v7.65 in one PR. Holding for v7.65 release tag — happy to test against any RC you cut.

CONFIGS.md fixes look great — pulled 6f4b937 and re-walked the doc; Step 2b Docker compose mirror block is exactly the pointer that was missing for non-bare-metal operators. The two consolidated copy-paste blocks (Required for boot / Recommended PROD additions) reorganize the TQ k8v4 section so a new operator can follow it without scrolling. P4 description is clearly the first thing in the new layout — good fix.

Update for the bare-metal launch header would be dual-3090: wall_TPS 116.59 code / 92.12 narrative over N=5 (MTP AL 3.12-3.40, CV ≤3.1%). Reproducer: noonghunna/club-3090 v0.20-experimental branch + models/qwen3.6-27b/vllm/compose/temp-k8v4-full.yml (will commit cleanly + push).

⭐ Thanks for landing #14 + #15 within hours.

0 replies

Sandermage · 2026-05-01T18:01:56Z

Sandermage
May 1, 2026
Author

Thank you for the tests and data. They are extremely important to me and help make the project better and of higher quality.

I haven't finished with the new PN26b sparse-V kernel yet; in fact, I've been working on it for the last 6 hours while also fixing bugs.

I read all the comments and use a bot to track all repository activity, so I instantly see bug reports and suggestions from the community. I try to implement them as long as they don't distract too much from the project's main direction.

For the next 2-3 days, I don't plan on pushing anything to the main branch. Everything will go to dev for now. Once both you and I are confident that everything is solid, I will merge the updates into main. This is how our workflow will operate moving forward:

dev: for testing and new features

main: for stable releases

I apologize that I don't always reply. There's just not enough time for everything, so I dedicate most of it to the project and other personal priorities. If my lack of responses comes across as rude at times, please forgive me—that is absolutely not my intention.

Have a great time of day, everyone (whether it's morning, afternoon, or late evening like it is for me right now). I try to hear everyone out, though it isn't always possible since we all have different perspectives on certain things. But that doesn't stop us from doing good and creating something valuable for all of us.

Wishing everyone peace and a clear sky!

1 reply

noonghunna May 1, 2026
Maintainer

Thanks @Sandermage — and no apology needed on response time, the bot tracking + your turnaround on #14/#15 within hours has been remarkable. Really appreciate the workflow clarity in your reply (dev = testing/new features, main = stable releases) — that's exactly the right cadence and resolves any tension between "needs cross-rig validation" and "shipped to community." We're aligned on it.

Quick status from our side: the v0.20 + Genesis v7.65 dev tip migration just landed on our master (PR #23) — the +50% TPS bench from yesterday is now reproducible end-to-end on our compose stack, and verify-stress 33K + 50K both PASS on every variant (the cliff that fired on EVERY dev205 config no longer reproduces). Context restored: long-vision 140K → 198K, long-text 185K → 214K. Re-bench detail in the PR. While aligning env-vars verbatim against your start_27b_int4_TQ_k8v4.sh I caught and fixed three silent no-ops on our side (PN9 ATT→ATTN, PN22, PN26 sub-config naming) — these would have masked any A/B work going forward. Your Cliff 8 hardening (partial_apply_warnings) would have surfaced these immediately; strong +1 on shipping that.

One new bug from PN25 testing — filed as genesis-vllm-patches#16 — _register_op_once() doesn't survive vLLM worker-spawn (in-process flag resets, re-decoration calls infer_schema during dynamo trace). Suggested ~10-line fix in the issue. Doesn't repro on your PROD because your config short-circuits inductor for that kernel; bites our TQ3+TP=1+MTP K=3 path. With this fixed we can enable PN25 default-on and close our #16.

Workflow-wise: tracking your dev branch makes a lot of sense for us (we're explicitly the cross-rig validation target you mentioned); we'll re-pin to main once dev→main lands. Wishing you peace and clear sky too — the work is appreciated.

Sandermage · 2026-05-02T01:45:42Z

Sandermage
May 2, 2026
Author

Hey @noonghunna and everyone — substantial update.

Pushed v7.66 to dev (commits 1304c56..fc89395) and live-validated on 4 model configs on our 2× A5000 rig. Boot-tested all of them end-to-end against the actual vllm install, not just unit tests — sanity check after caught two real bugs that pytest missed.

The patches, by status

PN33 — root-cause spec-decode warmup fix (DEFAULT ON)

Backport of vllm-project/vllm#37521 (itailang) but EXTENDED beyond its use_eagle() gate to cover MTP and ngram. The vanilla warmup uses dummy K=1 draft tokens regardless of real num_speculative_tokens. That under-counts the rejection sampler memory at profile time, which causes BOTH (a) ampersandru's mid-stream OOM via propose_draft_token_ids and (b) the workspace lock AssertionError you hit on dev205 + MTP K=3 single-card. Same root cause, two symptoms.

Default ON when spec-decode is active. Disable via GENESIS_DISABLE_PN33_SPEC_DECODE_WARMUP_K=1 if K-sized warmup itself OOMs on a tight rig (better-than-runtime-OOM diagnosis).

Live-verified: PN33 marker present in patched gpu_model_runner.py on all 4 boot configs. K-aware code (list(range(_genesis_pn33_K))) in place.

PN25 + P7b v7.66 — direct_register_custom_op refactor

Switched silu_and_mul_pooled and dual_linear_parallel registration from @torch.library.custom_op to vLLM canonical direct_register_custom_op from vllm/utils/torch_utils.py:899. Library("genesis", "FRAGMENT") at module level. Same fork-safe hasattr() pre-check guard as before. Schema introspection happens at module import (synchronous, before any Dynamo trace context), eliminating the "infer_schema skipped frame" Dynamo crash class.

Live-verified: torch.ops.genesis.silu_and_mul_pooled and torch.ops.genesis.dual_linear_parallel both register cleanly via the new path. Library(kind=FRAGMENT, ns=genesis) correctly created.

PN25 v7.67 — REJECTED on live test

Tried @torch.compiler.disable decorator on the staticmethod with direct pool acquire body (no custom_op dispatch). Reasoning: SGLang ships this pattern on FLA's chunk_gated_delta_rule. But on Qwen3.6-27B + TQ k8v4 + MTP K=3 boot:

torch._dynamo.exc.Unsupported: logging.Logger method not supported for non-export cases

Stack showed Dynamo tracing INTO forward_native body despite the decorator, hitting log.info() inside acquire_silu_out. Hypothesis: the decorator on a @staticmethod accessed through vLLM's custom_op._forward_method dispatcher (vllm/model_executor/custom_op.py:136) does NOT propagate — the dispatcher reaches the underlying function via getattr bypassing the decorator's frame guard.

SGLang's working @torch.compiler.disable patterns are on module-level functions, not @staticmethod on classes called via dispatchers. Pattern doesn't transfer here. Reverted to v7.66 (custom_op route). Three alternative mechanisms documented in source for future investigation.

PN32 — audit only, no code change

Confirmed num_tokens = hidden_states.size(0) is total batched tokens (continuous batching sum), not per-sequence. Threshold semantics correct as shipped.

Live-validation matrix

Config	Container	Boot	TPS @ 256t	Tool-call	Active patches
27B INT4 + TQ k8v4 + MTP K=3	UP	OK	98.3	clean	PN33+PN25(v7.66)+45 others
35B-A3B FP8 + MTP K=3	UP	OK	183.7	clean	PN33+PN25+PN26b+PN8
35B-A3B DFlash	UP	OK	155.0	clean	PN33+PN22+PN23+PN24+PN8
27B INT4 + DFlash drafter K=5	UP	OK	129.3	clean	PN33+PN22+PN23+PN24+PN12+PN17

All 4 configs: PN33 patch APPLY (verified in live gpu_model_runner.py markers post-boot). Tool-call clean across all 4.

The 27B INT4 + DFlash drafter result (129.3 TPS on 2× A5000) lines up well against your published 78 narr / 128 code TPS on 2× 3090 — same drafter recipe, similar consumer Ampere.

Known sharp edges

PN33 marker check tightened after a v1 false-positive that matched generic vllm use_eagle() references — current marker pins to the PR-#37521-specific line spec_decode_tokens = [i for i in range(self.num_spec_tokens)]. Won't false-skip again on innocent use_eagle() callsites.
DFlash configs need --dtype bfloat16 (not float16) because of dtype mismatch in combine_hidden_states — PN23 backports the upstream fix (vllm#40334) but the bfloat16 path is the shipped recipe.
PN25 v7.67 not shipped — design left as future direction in silu_and_mul_customop.py header. Three concrete alternative mechanisms documented for whoever picks it up next.

What I'd love help with

PN33 retest by ampersandru on the 1× 3090 + MTP K=3 + long-vision repro that hit mid-stream OOM. The root-cause fix should make it disappear; need empirical confirmation.
PN25 v7.66 (current path) on the OpenCode reproducer (5,900-char sys + 10 tool schemas) — does it close Cliff 1 mech B for real workloads now?
PN30 + PN31 cross-rig validation if you get a window — both still default OFF pending your data.
PN32 from any single-24GB-GPU users hitting >50K-token prompts.

Honest note — stepping back for a few days

Pretty wrung out from the last week. Going to read what people post but reply windows might be slower for 2-3 days. Keep the data coming whenever it shows up — every bench result and config detail matters.

Your dual-3090 wall_TPS 116.59 number (+30.7% over A5000 reference) is exactly the kind of validation that justifies the effort.

Thanks for the patient bug-hunting. Wishing peace and a clear sky to everyone.

— Sander, Ukraine, Odessa
(translated from Russian via AI; please forgive any awkward phrasing)

1 reply

noonghunna May 2, 2026
Maintainer

Cross-rig validation on 1×3090 TP=1 + 2×3090 TP=2 (Qwen3.6-27B Lorbus AutoRound INT4 + TQ3 KV + MTP K=3 + spawn workers, vLLM 0.20.1rc1.dev16+g7a1eb8ac2). v7.66 dev tip fc89395, all 4 TQ3 composes (long-text / long-vision / bounded-thinking / dual-turbo).

Quick context: take the rest you said you'd take — this is data, not action items. Bumping our pin to your v7.66 dev tip went smoothly, no regressions, but a few of the patches need more iteration on consumer Ampere TP=1. Detailed notes below.

PN33 — partial close on TP=1

You said: "PN33 closes both (a) ampersandru's mid-stream OOM via propose_draft_token_ids and (b) the workspace lock AssertionError you hit on dev205 + MTP K=3 single-card. Same root cause, two symptoms."

Our empirical result is that PN33 closes one of those two but not the other:

Test	PN33 result
Engine boot (`profile_run` workspace lock)	✅ closed
Runtime decode (`turboquant_attn.py:1350` lock)	❌ still fires

With PN33 enabled (and our patch_workspace_lock_disable.py sidecar disabled), the engine boots cleanly. But the first decode request fails with:

AssertionError: Workspace is locked but allocation from
'turboquant_attn.py:1350:_decode_attention' requires 0.76 MB,
current size is 0.00 MB. Workspace growth is not allowed after locking.

Same trace as before the bump. Net: we kept the patch_workspace_lock_disable.py sidecar mounted. The bug surface is narrowed (PN33 closed boot-time profile_run path) but not closed for runtime decode. Maybe TP=2 short-circuits this codepath; maybe ampersandru's repro hit only the boot-side issue. Worth flagging for whoever picks up the next pass.

PN25 v7.66 — still doesn't work on TP=1

You refactored from @torch.library.custom_op to direct_register_custom_op + Library("genesis", "FRAGMENT") at module level, with the goal of moving schema introspection ahead of the dynamo trace. Same goal, same idea — but the new failure mode on TP=1 is Library("genesis", "FRAGMENT") itself failing inside the trace at instantiate_user_defined_class_object:

File "vllm/_genesis/kernels/silu_and_mul_customop.py", line 224, in _make_genesis_lib
    _GENESIS_LIB = Library("genesis", "FRAGMENT")
  File "torch/_dynamo/polyfills/__init__.py", line 410, in instantiate_user_defined_class_object
    obj.__init__(*args, **kwargs)

Different mechanism from v7.65's @custom_op + infer_schema, same root cause: any torch.library construction call from inside dynamo trace context disallowed on our TP=1 spawn.

We still have to ship our patch_pn25_genesis_register_fix.py v3 backport — it text-patches activation.py to add a top-level _GENESIS_PN25_SILU_AND_MUL_OP = get_op_callable() block right after logger = init_logger(__name__). Registration runs at module import time, BEFORE any trace context exists, so neither @custom_op nor Library calls land inside a trace. forward_native reads the cached global; nothing custom-op-related happens during the trace.

The fact that this approach survives BOTH v7.65 AND v7.66 underlying mechanisms suggests it might be the right idiom for any TP=1 spawn target, regardless of how registration is implemented internally. Happy to PR upstream if useful.

PN30 — your `.contiguous()` is layout-incorrect; we wrote a corrected version

We owe you a heads-up here. Your PN30 a9977d8 materializes state[src_block_id, :, offset:].contiguous() and raw-memcpys it into state[dest_block_id]. That's source-correct but NOT destination-correct: for Qwen3.6-27B (dim=10240, state_len=6), with offset=1, you produce a compact 10240×5 buffer, then memcpy it into a destination whose rows are strided by state_len=6. Row 1+ end up at the wrong destination offset, corrupting the DS conv state row strides.

The TQ store CUDA assert at triton_turboquant_store.py:425 we filed earlier (you flagged it as the regression we found) was the eventual surfacing — corrupted state turning into a kernel-detectable invariant violation several layers downstream, not the root offender.

We wrote a corrected fix that lives in vllm/v1/worker/mamba_utils.py:collect_mamba_copy_meta (where both source AND destination block ids are known). For DS conv offset > 0:

tmp = state[dest_block_id].clone()                          # full dst-shaped temp
tmp[..., :tail].copy_(state[src_block_id, ..., offset:])    # tail of src into prefix of dst
# batch_memcpy gets a normal contiguous block-to-block copy

This preserves DS row stride. Reuses your existing _GENESIS_PN30_TEMP_TENSORS lifecycle (just changes the temp construction). The compact-temp approach can't preserve DS layout because the destination's stride information is lost when you flatten to a compact buffer.

Diagnosis credit goes to ChatGPT/Codex CLI cross-check 2026-05-02 (we'd been chasing the wrong layer). Code in our repo: patch_pn30_dst_shaped_temp_fix.py.

Validation: probes 4 (multi-turn agent) and 5 (LCB-coding shape) — both crashed with your upstream PN30 — now pass cleanly with the corrected fix on all 4 TQ3 composes. DS layout retained → +6% TPS preserved.

Longer-term, vLLM upstream should still replace the temp block with a direct strided copy kernel (avoid the intermediate clone entirely). Dst-shaped temp is correct but allocates a full block per offset>0 call.

This should replace your existing PN30 part-1. Happy to PR.

PN31 — still doesn't fit on 24 GB

Same finding as our earlier reply. Per-shape persistent out buffer growth + PN12+PN25 pool residence outpaces activation budget at DeltaNet chunk_fwd_o on a 24 GB single GPU. Dropped 0.95 → 0.93 mem-util to give more headroom — still crashes when a new shape appears at 29K depth. Without PN31 + lower mem-util, the 25K tool-RETURN path passes anyway. So the bug PN31 was meant to fix has a simpler workaround (give activation room) on our hardware class.

If you do another pass: cap per-shape buffer count, lazy-alloc after N occurrences, or memory-budget guard might let it work on tighter rigs.

PN32 — not yet tested

Will test on a future pass. Need a clean Cliff 2 (>50K single-prompt) probe setup. Currently the architectural Cliff 2 fails even on TP=2 because the GDN forward state buffer is per-rank, not split across TP factor. If PN32 works it'd be a real unlock.

Validation matrix on v7.66

Compose	Probes pass	Failure
long-text	6/7	Cliff 2 architectural
long-vision	6/7	Cliff 2 architectural
bounded-thinking	6/7	Cliff 2 architectural
dual-turbo (TP=2)	6/7	Cliff 2 architectural

All non-architectural probes pass on every variant. The 7-probe suite (scripts/verify-stress.sh in our repo) covers: small-rung longctx (10K + 30K), 25K tool RETURN, IDE-agent one-shot (sys + 10 tool schemas + user, OpenCode reproducer shape), multi-turn agent (sys + tools + 4-turn history), LCB-coding (LeetCode problem + structured plan), reasoning-heavy (math + max_tokens=8192), and large-rung longctx (60K + 90K — Cliff 2 territory, deferred to last so it doesn't cascade-fail probes 2-6).

If you run it on your 2× A5000 PROD (both TP=1 and TP=2 if you have time), probes 3 + 4 are the direct repro for PN25 worker-spawn; probes 4 + 5 for PN30; probe 2 for PN31. Single command:

bash scripts/verify-stress.sh

Net for our shipping configs

Four local sidecars retained on master:

patch_pn25_genesis_register_fix.py (PN25 v3 import-time, TP=1 only — drop when Sander's mechanism survives spawn here)
patch_pn30_dst_shaped_temp_fix.py (replaces your compact .contiguous() — should go upstream)
patch_workspace_lock_disable.py (PN33 narrowed it but didn't close the runtime decode path — drop when you have a fix that covers turboquant_attn.py:1350)
patch_tolist_cudagraph.py (cudagraph capture fix, unchanged)

Nothing actively broken; all shipped configs work for users. Just three patches that should land upstream so we can drop our sidecar weight.

Wishing you peace and clear sky. Take the days you need — the test mesh keeps running.

— club-3090 / Wasif

Sandermage · 2026-05-02T08:38:58Z

Sandermage
May 2, 2026
Author

@noonghunna and the @ChatGPT/Codex CLI team — thank you. Big update.

Pulled all three of your v7.66 cross-rig findings into Genesis directly as v7.68 (commit ab3f5ce on dev). Boot-validated on our 27B INT4 + TQ k8v4 + MTP K=3 + TP=2 PROD; ready for your 1×3090 + TP=1 retest whenever you have a window.

What landed in Genesis directly

PN30 v7.68 — dst-shaped temp (your 9af1a52)

Your diagnosis was correct end-to-end. v7.65 .contiguous() produced a compact buffer (10240×5) but the destination is strided by full state_len (10240×6) — memcpy packed compact rows into wrong destination offsets, corrupting DS conv state row strides on every offset>0 copy. The TQ store CUDA assert at multi-turn agent shapes was the surfacing point, not the root offender.

Ported your fix as PN30 part3 patching collect_mamba_copy_meta:

tmp = state[dest_block_id].clone()
tmp[..., :tail].copy_(state[src_block_id, ..., token_offset:token_offset + tail])

Plus part1 (the old compact path) is now fail-closed RuntimeError so if anything ever reaches it we crash explicitly rather than silently corrupt. Reuses your existing _GENESIS_PN30_TEMP_TENSORS lifecycle from part2. Marker bumped to v7.68.

PN25 v7.68 — import-time registration (your a62ad78)

Your insight about activation.py module-import timing is the key — vLLM imports activation.py during model construction in each spawned worker, BEFORE profile_run enters aot_compile_fullgraph. So registration runs in eager Python, never inside a Dynamo trace. v7.66's direct_register_custom_op was the right mechanism but called from the wrong place (inside the trace).

Ported your patch_pn25_genesis_register_fix.py directly into Genesis as PN25 v7.68:

Sub-patch 1 inserts the import-time register+cache block at the top of activation.py
Sub-patch 2 rewrites forward_native body to read only the cached _GENESIS_PN25_SILU_AND_MUL_OP global
Added Dynamo trace-context guards in get_op_callable() as defense-in-depth (returns None if called from inside a trace)

Same pattern extended preventively to P7b (gdn_dual_stream_customop). P7b is opt-in / default OFF in our PROD so we hadn't hit this, but anyone flipping GENESIS_ENABLE_P7B_DUAL_STREAM_CUSTOM_OP=1 on TP=1 spawn would crash with the identical bug class. Same import-time pattern: register at module-import of gdn_linear_attn.py, cache as _GENESIS_P7B_DUAL_LINEAR_OP, in_proj call site reads only the global.

PN34 (NEW) — runtime workspace lock relaxation (your 2b5ab4d)

Your patch_workspace_lock_disable.py setup-time sidecar promoted into Genesis as PN34 — companion to PN33 for the runtime decode path that PN33's boot-time fix doesn't cover. Same code: relax the strict if self._locked: raise AssertionError(...) to one-shot WARN+grow-anyway. requires_patches: ["PN33"] declared in dispatcher.

Default OFF (it's relaxing a strict-debug assertion, so explicit opt-in via GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1). Retires when vllm#40706 lands.

Why I missed all three the first time — honest answer

I tested patch application (does the text-patch land cleanly), not patch correctness against the bug-triggering workload. Specifically:

Never tested PN25 on TP=1 spawn (you reported it was the repro target, I tested only TP=2)
Never tested PN30 against multi-turn agent / LCB-coding (the actual probes 4+5 that catch DS row stride corruption)
Never tested PN31 on a real 24 GB single GPU under PN12+PN25 pool pressure
Never traced PN30 layout semantics carefully — .contiguous() looked syntactically right, ChatGPT/Codex caught what I missed

The system fix is on me: I'm setting up another rig with one consumer card next week to actually run your reproducers locally. No more arguing for workarounds when the right answer is to test against the actual bug surface.

Server validation (post-backport)

27B INT4 + TQ k8v4 + MTP K=3 + TP=2:

Boot OK, 49 patches APPLY, 0 failures
PN25 v7.68 marker present in patched activation.py — verified via grep on live install
PN33 marker present in patched gpu_model_runner.py
TPS @ 256t: 104.0 mean, CV 0.5% over 5 runs (vs 98.3 on v7.66 — slight improvement, no regression)
Tool-call: clean (get_weather({"city":"Paris"}), finish=tool_calls)

What I'd love help with next

When you have a window:

PN30 v7.68 retest on 1×3090 + 2×3090 — should match your local sidecar's behavior (same code path, just shipped from Genesis directly so no more setup.sh hook needed)
PN25 v7.68 retest on TP=1 spawn config — the import-time pattern should close the OpenCode reproducer cleanly
PN34 retest — if PN33 alone closes your runtime decode workspace_lock for the configs you tested, PN34 might be unnecessary. Worth knowing whether to keep it OFF-by-default or promote.
Drop your local sidecars — patch_pn30_dst_shaped_temp_fix.py, patch_pn25_genesis_register_fix.py, patch_workspace_lock_disable.py should all be redundant on v7.68 dev tip. Setup.sh hooks can come out.

If anything regresses I'd rather hear about it within a day than have you carry sidecars forever.

On the next-week test rig

Setting up a second box with a single A5000 (24 GB, SM86 Ampere consumer — same memory budget + same compute capability as the 3090 you're testing on) to actually run your reproducers locally instead of asking you to do all the cross-rig validation. Specifically:

1× RTX A5000 24 GB
vLLM 0.20.1rc1.dev16+g7a1eb8ac2 matching your pin
Will run verify-stress.sh from your repo as the gate before claiming a fix lands

Not a substitute for your cross-rig data on the 3090s themselves, but should mean fewer "works on TP=2, breaks on TP=1" round-trips through your bug filings — A5000 single-card hits the same TP=1 spawn config + 24 GB activation budget that triggered all three of the bugs you found.

— Sander, Ukraine, Odessa
(translated from Russian via AI; please forgive any awkward phrasing)

0 replies

Sandermage · 2026-05-02T09:32:16Z

Sandermage
May 2, 2026
Author

Quick follow-up — ran two static-analysis audits today (Gemini + ChatGPT/Codex CLI) on the genesis-vllm-patches tree to catch latent issues that pytest + live-boot couldn't. Closing the loop on what they surfaced.

Real bugs caught + fixed

G-001 (Codex, Critical) — model_detect.py:185 had base undefined in an exception path. NameError on layer_types probe failures was masked by the dispatcher's conservative apply fallback (return True, "model_detect probe failed (...)") — meaning a genuine model-incompat exception could have flipped to "apply patches anyway" and applied hybrid GDN patches to a non-hybrid model. Fixed: base → source_label.

G-002 (Codex, High) — vllm/_genesis/__init__.py eagerly imported prealloc, which imports torch at module level. Result: every torch-less CLI / pre-commit / static-analysis tool failed ModuleNotFoundError before reaching their entry point. Fixed: lazy __getattr__ for prealloc via importlib (with care to avoid from x import y re-entering __getattr__ infinitely).

G-003 + G-004 (Codex, High×2) — ResponseCacheMiddleware had two distinct ways to violate its "never raises / always returns response" contract: (a) float(temperature) on {"temperature":"abc"} leaked ValueError up to the client as 500; (b) corrupt JSON-non-serializable cached entry → _send_cached_response returned without sending → connection hang. Both fixed; _send_cached_response now returns bool and caller falls through to MISS path.

G-006 (Codex, Medium) — apply_all_plugins() ran BEFORE the core patch loop despite the comment saying "After core patches finish, walk plugins". Community plugin authors relying on the documented contract would find post-modification anchors absent. Moved plugin apply to after the core loop.

G-007 (Codex, Medium) — validate_registry() ran before register_plugins() injected community entries → plugin entries skipped boot-time validation. Re-validation added after plugin apply pass.

G-008 (Codex, Medium) — 7 env-var references in PATCHES.md / INSTALL.md didn't match the actual dispatcher.py names (e.g. docs said GENESIS_ENABLE_PN30_DS_LAYOUT but dispatcher had GENESIS_ENABLE_PN30_DS_LAYOUT_SPEC_DECODE). Operators copy-pasting got no-op env vars while their patch silently stayed disabled. All 7 synced.

P103 latent NameError (separate Gemini audit) — vllm/_genesis/wiring/hybrid/patch_103_fla_cliff2_chunked.py:197 used bare T in the chunked-prefill loop without defining it (T = q.shape[1] was missing). Cliff 2 chunked path silently crashed NameError on every trigger since v7.62.20 ship date. PROD didn't surface this because continuous batching keeps q.shape[1] ≤ max_num_batched_tokens (4096) — well under the _MAX_T threshold that gates the chunked branch. Fixed.

Plus cleanup pass

The same audits flagged G-005 (streaming docs lying about SSE replay), G-009 (PATCHES.md P72 row truncated), G-010 (rig-specific paths in scripts — partially closed with env-var override + README rationale), G-011 (bench_suffix_sweep speed_runs computed but unused — biased Pareto ranking), G-012 (Redis cache size off-by-N for stats keys), G-013 (PN16 broken doc reference to gitignored docs/_internal/...), G-014 (TextPatcher Python-only marker — documented), G-016 (195 ruff F401/F841/RUF059 → 0). All closed.

Numbers

1494 tests pass, 73 skipped, 0 failures — unchanged across the 4 audit-fix commits
195 ruff cleanup errors → 0
4 commits on dev: 82c64c1 + 6f9c5eb (Codex), 5743c03 (Gemini P103), ab3f5ce (your v7.68 cross-rig fixes from earlier today)

Honest note

The two latent bugs that hurt most (G-001 conservative apply override; P103 silent Cliff 2 skip) are exactly the class that pytest + live-boot don't catch — boot doesn't trigger the rare exception path, and PROD continuous batching never crosses the chunked-prefill threshold. Static analysis (ruff F821, name resolution) found them in 30 seconds.

Going to bake static analysis into a pre-commit hook so this is automated going forward.

Stepping away for the rest of today — eyes are tired. Will read what comes in but reply windows likely tomorrow. Whatever cross-rig data you turn around on PN30 v7.68 / PN25 v7.68 / PN34 will be valuable whenever it lands.

— Sander, Ukraine, Odessa
(translated from Russian via AI; please forgive any awkward phrasing)

0 replies

noonghunna · 2026-05-02T10:43:04Z

noonghunna
May 2, 2026
Maintainer

@Sandermage — pulled v7.68 dev tip (18e65e3) on a fresh v7.68-cliff2-test branch and dropped our 3 sidecars (PN25 v3 + PN30 dst-shaped + workspace_lock_disable) to validate your accept-and-fold.

Three findings worth flagging before you cut v7.69. TL;DR: PN25 v7.68 ✅ + PN34 ✅. PN30 v7.68 ❌, P103 ❌, PN32 ❌ on TP=1 + 24GB.

✅ PN25 v7.68 — works clean on TP=1

direct_register_custom_op + Library("genesis", "FRAGMENT") survives worker spawn. FFN intermediate pool active across all dynamo traces. Cliff 1 mech B inductor leak: closed.

✅ PN34 — works clean (but default OFF caught us)

GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 required explicitly — without it, runtime decode hits AssertionError: Workspace is locked at turboquant_attn.py:1350. First boot crashed for exactly this reason; once enabled, replaces our patch_workspace_lock_disable.py cleanly. Maybe worth promoting to default-on for the TQ + MTP K=3 path?

❌ PN30 v7.68 — drift-marker false-positive breaks the patch

Your part3 has upstream_drift_markers=["[Genesis PN30"] (generic prefix). When part1 inserts marker "[Genesis PN30 issue #17] Module-level state ..." and part2 inserts "[Genesis PN30 issue #17] Even on n==0 ...", part3's idempotency check sees its prefix already in-file and treats it as upstream-merged.

[PN30 v1/worker/mamba_utils.py — collect_mamba_copy_meta dst-shaped DS temp
 (issue #17, v7.68)] upstream marker '[Genesis PN30' detected — patch obsolete, skip
[Genesis] FAILED: PN30 DS conv state + spec-decode AL>1 (issue #17)
 — PN30 part3 v1/worker/mamba_utils.py:collect_mamba_copy_meta did not apply safely:
   upstream_merged — marker '[Genesis PN30' present

apply_all then escalates to FAILED → vLLM aborts. Fix: change part3's drift marker to something part3-specific, e.g. "[Genesis PN30 v7.68 dst-shaped]" so it only matches its own insertion.

❌ P103 — wrap reports "rebound at 0 caller sites"

[Genesis] applied: P103 FLA Cliff 2 chunked fwd_h+fwd_o orchestrator
— P103 v7.62.20 applied: chunk.py::chunk_gated_delta_rule_fwd wrapped
  with chunked fwd_h+fwd_o (MAX_T=16384, rebound at 0 caller sites).

Confirmed broken: probe 7 (60K) hit Cliff 2 OOM and the trace went straight through vllm/model_executor/layers/fla/ops/chunk.py:71 chunk_gated_delta_rule_fwd — your wrap was never engaged. The 5743c03 NameError fix landed but the binding mechanism still has 0 callers on Qwen3.6-27B. Worth instrumenting: where does chunk_gated_delta_rule import chunk_gated_delta_rule_fwd from on hybrid models, and why isn't the rebind reaching it?

❌ PN32 alone doesn't close Cliff 2 on TP=1 + 24GB

After enabling PN32 + (broken) P103 with PN30 disabled (workaround for Finding 1) → Cliff 2 fired EARLIER than v7.66, at a 30K prompt instead of the usual 50-60K. PN32 chunks gdn_linear_attn.forward_cuda but the inner chunk_gated_delta_rule_fwd's k.new_empty(B, NT, H, V, K) still allocates the full h tensor.

OOM trace at 30K:

File "vllm/model_executor/layers/fla/ops/chunk.py", line 71, in chunk_gated_delta_rule_fwd
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB.
GPU 0 has a total capacity of 23.56 GiB of which 44.50 MiB is free.

Two ways this could land:

P103 needs to actually rebind so the inner fwd_h chunks too — at which point PN32 + working P103 should compose to close Cliff 2
PN32 itself needs to internally chunk the FLA call, not delegate to a fully-allocated chunk_gated_delta_rule_fwd

The 2×A5000 PROD doesn't hit this because TP=2 splits the GDN forward state across ranks; on TP=1 the full state lands on one card.

What we did

Stayed on master (Genesis v7.66 fc89395 + our 3 local sidecars — they still produce the cleanest TP=1 + 24GB stack we have). The v7.68-cliff2-test branch is on disk locally as a snapshot for cross-rig data.

Happy to share full boot logs, dispatcher matrix, and OOM tracebacks for any of these — let me know which would be most useful for v7.69 triage.

0 replies

Sandermage · 2026-05-02T11:25:25Z

Sandermage
May 2, 2026
Author

@noonghunna — pulled all three findings into v7.69 on dev (commit 2db18df). Full root-cause analysis + fixes + 18 new tests. 1512 pass / 73 skip / 0 fail.

F1 — PN30 part3 drift-marker false-positive ✅ FIXED

Your diagnosis was exact. part2 (separate TextPatcher in same apply() call) inserts [Genesis PN30 issue #17] into the same file as part3. Then part3's Layer 3 drift check matches part2's own insertion → SKIP upstream_merged → required-fail → vLLM aborts.

Tightened part3's drift markers to [Genesis PN30 v7.68 dst-shaped] — specific to part3's own replacement comment, can't collide with sibling sub-patches. Re-runs unaffected (Layer 2 idempotency marker fires before Layer 3). Two regression tests pin the fix.

F2 — P103 setattr lost on `exec vllm serve` ✅ FIXED

Read your docker-compose.long-text.yml entrypoint:

python3 -m vllm._genesis.patches.apply_all
exec vllm serve "$@"

That's the bug. apply_all ran setattr(chunk_mod, "chunk_gated_delta_rule_fwd", wrapper) in the entrypoint shell process. exec vllm serve REPLACED the process image. Setattr gone. Workers spawned fresh, original chunk_gated_delta_rule_fwd ran, OOM at chunk.py:71.

The "rebound at 0 caller sites" log was misleading — internal callers in chunk.py resolve via chunk_mod.__dict__ (which we DID setattr), so 0 external aliases is normal. The setattr WAS active in the entrypoint shell; it just didn't survive the exec. Log message rewritten in v7.69.

Fix: text-patch on chunk.py itself appending a self-install hook at end of file:

try:
    import os as _genesis_p103_os
    if _genesis_p103_os.environ.get("GENESIS_ENABLE_P103", "").strip().lower() in ("1","true","yes","on"):
        from vllm._genesis.wiring.hybrid.patch_103_fla_cliff2_chunked import (
            _genesis_p103_install_at_import as _genesis_p103_install,
        )
        _genesis_p103_install(globals())
except Exception:
    pass

This runs every time chunk.py is imported (in any process — workers, fork, spawn, exec). New helper _genesis_p103_install_at_import(globals()) monkey-patches chunk_gated_delta_rule_fwd in the just-loaded module dict. Survives any startup mechanism. Legacy setattr step preserved as defense-in-depth for the current process.

11 new tests cover the helper, env gating, idempotency, missing-deps soft-fail, anchor structure, ordering, and module docstring rationale.

F3 — PN32 v1 chunked at wrong level ✅ FIXED

Your finding 3 was right end-to-end. PN32 v1 sliced mixed_qkv/b/a at the outer forward_cuda level, BUT inside _forward_core the call to self.chunk_gated_delta_rule(...) still got attn_metadata.non_spec_query_start_loc describing the FULL prompt — PN32 v1 didn't slice metadata. FLA allocated h based on full T regardless. That's why you OOM'd EARLIER (30K) with PN32 v1 than baseline (50-60K) — chunking overhead with no inner allocation reduction.

PN32 v2 (v7.69) — rewritten to patch _forward_core directly. Wraps the prefill branch's self.chunk_gated_delta_rule(...) call:

slice query/key/value/g/beta along T (dim=1)
build chunk-local cu_seqlens=torch.tensor([0, chunk_len])
pass chunk_indices=None, chunk_offsets=None (FLA recomputes from cu_seqlens)
thread initial_state via prior chunk's last_recurrent_state
concat outputs along T (dim=1)
single-seq prefill only — multi-seq bypasses to original (chunking across cu_seqlens boundaries needs inner state-cache surgery not exposed at this layer)

PN32 v2 composes with P103: v2 chunks the OUTER FLA call (the chunk_gated_delta_rule invocation); P103 chunks INSIDE FLA's chunk_gated_delta_rule_fwd (the inner h tensor). Together they give full Cliff 2 coverage. 19 tests rewritten for v2 semantics.

Recommended pairing for single-24GB-GPU Cliff 2:

GENESIS_ENABLE_P103=1                              # required (closes inner h)
GENESIS_ENABLE_PN32_GDN_CHUNKED_PREFILL=1          # closes outer FLA call buffer
GENESIS_PN32_GDN_CHUNK_SIZE=8192                   # default
GENESIS_PN32_GDN_CHUNK_THRESHOLD=16384             # default
GENESIS_FLA_FWD_H_MAX_T=16384                      # P103 default

DEPENDENCIES section added to PN32 v2 module docstring + composition matrix in dispatcher credit. So genesis explain PN32 and genesis explain P103 now show the recipe.

What I'd love help with next

When you have a window:

Pull v7.69 dev tip (2db18df) and re-run probes 4 + 5 + 7 on the long-text compose. PN30 part3 should apply cleanly now (not skip with upstream_merged). Drop your patch_workspace_lock_disable.py sidecar if PN34 + PN33 cover it (worth confirming).
Probe 7 (60K) + GENESIS_ENABLE_P103=1 alone — should now work because the chunk.py self-install hook fires in workers. If P103 alone closes Cliff 2 at 60K on your rig, that's strong evidence the install model is correct.
Probe 7 + P103 + PN32 v2 — best memory profile. Should push the Cliff 2 ceiling well above your previous 50-60K limit.
Drop the --engine-plugins-or-pip-install question — v7.69 P103 doesn't need either. The chunk.py text-patch is durable. Same self-install pattern PN25 v7.68 uses for activation.py.

If anything regresses I'd rather hear about it within a day than have you carry sidecars forever. The cross-rig data you're producing is what catches class-of-bug failures my A5000 PROD never sees — already four turn-around fixes from your reports in the last 48 hours.

Thanks for the patient bug-hunting. Wishing peace and a clear sky to everyone.

— Sander, Ukraine, Odessa
(translated from Russian via AI; please forgive any awkward phrasing)

0 replies

Sandermage · 2026-05-02T11:31:04Z

Sandermage
May 2, 2026
Author

If this version passes validation and doesn't crash on your end, I will merge it into the main branch and we can lock in the first stable release :).

It would be incredibly helpful if people could keep sharing their testing data...
But things will get even more interesting once I manage to acquire server (workstation) Blackwell hardware for myself.
I intentionally avoid consumer-grade solutions like the RTX 40x0 and 50x0 series for my own setup.
First, because consumer cards inherently aren't designed for 24/7 operation and lack ECC memory.
Second, their power consumption is way too high compared to specialized cards.
However, any data you can provide will always be useful.

0 replies

noonghunna · 2026-05-02T15:40:42Z

noonghunna
May 2, 2026
Maintainer

@Sandermage — first off, v7.69 turnaround speed is something else. F1 + F2 + F3 all rooted causes diagnosed correctly in your replies, 18 new tests, dispatcher composition matrix for P103 + PN32 v2 — that's a tight loop. Pulled 2db18df to a fresh v7.69-cliff2-test branch; ran the long-text + dual-turbo probes 4/5/7 with the recommended env bundle plus a multi-round bisect informed by ChatGPT/Codex. Full results below.

✅ F1 (PN30 part3 drift-marker) — confirmed working

DS layout active throughout. PN30 v7.68 part1+2+3 all APPLY clean, no upstream_merged false-positive. Apply_all elapsed: clean.

✅ F2 (P103 self-install hook in chunk.py) — confirmed firing on TP=1

Trace at runtime hits vllm/_genesis/wiring/hybrid/patch_103_fla_cliff2_chunked.py:218 chunked_fwd — wrap engages on every FLA call after exec vllm serve. The "rebound at 0 caller sites" log message in v7.68 was misleading; v7.69 corrects the message. Self-install pattern is solid.

⚠️ F3 (PN32 v2 + P103 chunked path) — engages but doesn't close 60K Cliff 2 on this config

This is where the cross-rig story gets interesting. We added a temp diagnostic log right before P103's gate to capture what's actually happening on real serving. Captured 442 P103 invocations on the 60K probe:

T value	Invocations	Notes
4128	394	vLLM's chunked-prefill chunk size (capped by `max_num_batched_tokens=4128`)
64	48	cudagraph warmup or MTP verify path
> 4128	0	Never seen

q.shape[0] = 1 always. cu_shape = torch.Size([2]) always. So _single_seq_cu = True and _true_varlen_multi_seq = False for every invocation.

The P103 chunked path never engages because q.shape[1] <= _MAX_T (4128 ≤ 16384) is always true on real serving — vLLM's outer chunked-prefill is already capping T at 4128, well below MAX_T. PN32 v2's outer-level chunking has the same effect (chunk size 8192, threshold 16384). Neither closure mechanism fires.

We tested forcing it via GENESIS_FLA_FWD_H_MAX_T=2048. Chunked path engaged, per-call allocation halved (50→24 MiB), but cumulative state grew slightly and OOM fired earlier in absolute call count. Confirmed Codex's hypothesis: the issue isn't a single allocation that needs splitting; it's accumulated activation residency that the 50 MiB late-stage allocation can't fit into.

The actual closure: vllm#35975 backport + mem-util tuning + bisect data

ChatGPT/Codex round 2 diagnosed it as headroom rather than gate logic, and called out vllm#35975 (open) as a directly-relevant fix — skips inputs_embeds GPU buffer for text-only models, claims ~64 MiB savings.

We backported it locally as a setup-time text-patch. Combined with mem-util tuning, the matrix:

Config	Boot resident	60K MTP-on	Wall	Notes
0.95 (baseline)	23,164 MiB	❌ OOM 50/24.5 free	n/a	Cliff 2 fires
0.95 + #35975	22,720 MiB	❌ OOM 50/46.5 free	n/a	#35975 freed 444 MiB at boot, only 22 MiB margin at peak
0.92 + #35975	21,980 MiB	✅ HTTP 200	689s	Cliff 2 closed. ~580 MiB end-of-run margin, AL=4.00
0.93 + #35975	22,260 MiB	✅ HTTP 200	623s	Cliff 2 closed. ~494 MiB margin. Best balanced point

Plus MTP-off + 0.95: 60K passes in 504s with full 5+ GiB KV pool — different shipping variant for users who want max KV pool.

Per Codex's framing post-bisect, three explicit variants to ship:

Balanced MTP — long-text.yml updated: Genesis v7.69 + your full env bundle + Codex P103 gate fix + #35975 sidecar + 0.93 mem-util + MTP K=3 retained. Cliff 2 closed at 60K with spec-decode acceleration. KV concurrency at 180K: ~1.4x.
Max-context safety — long-text-no-mtp.yml: Same minus --speculative-config, full 0.95 mem-util. For long-shot RAG/codebase prompts where slow decode is OK in exchange for max KV pool stability.
Future upstream win — vllm#37429 hybrid Mamba/attention KV cache sizing. If it applies cleanly, could free residency without trading mem-util. Untested, separate branch experiment.

90K probe at 0.93 + max_tokens=1 (prefill-only timing) is in flight as we draft this; result will update the recipe with a confirmed Cliff 2 ceiling figure.

On Codex's P103 gate fix recommendation

We applied the gate change anyway:

-if cu_seqlens is not None or q.shape[1] <= _MAX_T:
+_single_seq_cu = (cu_seqlens is not None and q.shape[0] == 1
+                  and cu_seqlens.shape[0] == 2)
+_true_varlen_multi_seq = cu_seqlens is not None and not _single_seq_cu
+if _true_varlen_multi_seq or q.shape[1] <= _MAX_T:

Plus canonicalize cu_seqlens=None inside the chunked path (since [0,T] is dense B=1 semantically). It's the right semantic fix — the previous gate blocked single-seq cu_seqlens unnecessarily, which would matter on configs without vLLM's outer chunked-prefill capping T. Worth shipping in v7.70 even though it's not what closes Cliff 2 on our specific config. Diff in this discussion's attached file (or I can open a PR if you'd prefer).

PN32 v2's analogous gate (multi-seq bypass) has the same property. For users running spec-decode + long single-prompt, the right composition is: chunked-prefill at outer level (4128 cap) + #35975 freeing residency + 0.92 mem-util freeing activation budget. P103's chunked path is a defense-in-depth for cases where T > 16384 reaches FLA directly (synthetic benchmarks, non-chunked-prefill configs).

Cross-stack signal: PFlash from Luce-Org

@troymroberts surfaced this in club-3090#25 — Sandro Puppo's announcement + the lucebox blog. Not asking you to integrate, but flagging because the architectural overlap with PN26b is interesting:

PFlash is a long-context prefill accelerator (vs DFlash which is decode). Uses small drafter (Qwen3-0.6B) + block-sparse attention to score token importance, compresses 128K → ~6.5K tokens before target prefill runs.
Headline claim: TTFT 24.8s vs 257s vanilla llama.cpp at 128K (~10.4× speedup)
Block-sparse drafter attention on SM86 — the kernel surface is exactly what your PN26b targets. If a community vLLM port of PFlash emerges, your existing sparse-V infrastructure could be directly reusable.
C++/CUDA only today, lives in lucebox-hub. Tracked in our docs/UPSTREAM.md as a watch entry.

club-3090 plans to explore integration once lucebox-hub server stabilizes (currently has the daemon-mode + greedy-only quirks documented). Mentioning here in case you were unaware of it.

📱 Twitter handle?

Last housekeeping ask: what's your Twitter / X handle? We're starting to do more public posts (rewrote the pinned welcome to "club-3090 is open to all CUDA hardware" yesterday; likely more once v7.69 lands stable + we have NVLink + 4×3090 + modded-3080 cross-rig data points). We want to credit you properly when posting — your patches are the single biggest reason this stack performs the way it does.

If you'd rather stay off social platforms, that's fine — just say so and we'll credit you as @Sandermage on GitHub instead.

On Blackwell server > consumer

Your reasoning (24/7 reliability, ECC, lower power per useful FLOP) is the right framing for PROD. When the Blackwell server tier comes within reach, the cross-rig story flips — we become your test surface for Ampere consumer, you become the reference for Blackwell datacenter. Useful split.

— @noonghunna

Wishing peace and clear sky.

0 replies

Sandermage · 2026-05-02T16:00:07Z

Sandermage
May 2, 2026
Author

Thanks for the insights. I’m also considering integrating Codex into my validation workflow on a permanent basis; it’s a great time-saver for catching things I might have missed or messed up.

Regarding PFlash, I’ve looked into it, but it’s not a top priority for now, even though the tech is interesting. We’ll see how it goes—I don’t want to make promises I can’t keep. Right now, my main focus is stabilizing the entire stack; new features and improvements will come after that.

I’m currently reworking the structure and developing an installer so that everything can be set up with a single click or command, bringing the whole project into a more logical and streamlined form.

I’ll be away from my desk until Monday. Taking a little break to recharge, otherwise my brain is going to go 'boom' :). All fixes and updates will resume on Monday.

I’m not very active on Twitter (X) yet—mostly just reading—but you can find me here:

X: https://x.com/AleksandrBarzov

Instagram: https://www.instagram.com/sander_odessa/

Facebook: https://www.facebook.com/sander.odessa/

I probably need to change my approach to social media, since the patcher and everything I’m doing is primarily aimed at the English-speaking community.

Thanks again for the feedback, and have a great weekend!

0 replies

Sandermage · 2026-05-05T18:22:19Z

Sandermage
May 5, 2026
Author

I'll write you as soon as I'm done..

This OOM issue really bothered me. To be honest, I wanted to close the subject since it works on 2 GPUs and I don't see any problems on my end.. But then I thought.. Wait.. If I solve the problem on a single card and optimize memory usage so that Qwen3.6 27B and other models of that size can run smoothly on a single 24GB GPU with a decent context window.. It will yield the same optimization and savings effect on 2 GPUs, as well as on cards with 32-48-72-96-142 GB of memory, and so on..

So, for now, I've rolled back the repo patcher to before I started experimenting, because the AI that handles my pushes decided to just dump everything in there.. And right now, I'm reworking the entire memory management stack.

For now, I'm tailoring it to the versions and the type of caching I currently use, and after that, I'll expand it based on what actually works out..

After that, it will strictly be fixes and tweaks to the current code to reach a truly stable version. And only then will I touch or try anything new. Because dealing with memory and VRAM caching really took up my whole day today..

I'll try not to drag it out.

The current version on the dev branch is working..

0 replies

Sandermage · 2026-05-06T00:52:35Z

Sandermage
May 6, 2026
Author

The update is delayed, as I am currently doing a comprehensive rebuild of how the patcher handles memory and models.
I'm optimizing the entire patcher and fixing the bugs/errors I've found.
Once all the fixes and improvements are finished, I'll release the update with updated .md files that will include new instructions and explanations of the changes..

Below is an example of the config for the models. Going forward, the patcher will operate and launch the model using this type of config.
This was done to eliminate clutter and errors in model deployment and to unify the approach to launching different models. This way, it will be enough to just share a ready-made config tailored for a specific GPU with the community, which will include all the necessary variables and data.
The paths in the config are mine; they will be substituted automatically during scanning.
I have also developed a patcher launcher that will take over starting all the required services.
There will be several launch options: in Docker, and on a system without Docker.
Instructions for it will be included as well..
I will also slightly rework the patcher's directory tree for a more logical and compact layout of all scripts, and clean it up from dead and unnecessary code/files/scripts..

Next, as soon as we verify that everything is stable and working as intended, I'll move on to the full integration of gemma 4.

Best wishes to everyone. Peace and clear skies..
@noonghunna I will be online from 10:30-24.00 Kyiv time. If it's something urgent or for faster communication, message me on Telegram.

# SPDX-License-Identifier: Apache-2.0
# Genesis builtin model config — Sander's PROD reference
# Source: scripts/launch/start_35b_fp8_PROD.sh + bench 2026-05-05

key: a5000-2x-35b-prod
title: 2× RTX A5000 — 35B-A3B FP8 PROD
description: >-
  Sander's production reference rig. Qwen3.6-35B-A3B FP8 MoE +
  TurboQuant k8v4 KV + MTP K=3 spec-decode. Validated 192.6 TPS
  sustained on 2× A5000 24 GB.
schema_version: 1
maintainer: sandermage
last_validated: '2026-05-05'
genesis_pin: 991dc1a
vllm_pin_required: 0.20.2rc1.dev9+g01d4d1ad3
lifecycle: stable
workload_tag: balanced

hardware:
  gpu_match_keys: [rtx a5000]
  n_gpus: 2
  min_vram_per_gpu_mib: 22000
  cuda_capability_min: [8, 6]

model_path: /models/Qwen3.6-35B-A3B-FP8
served_model_name: qwen3.6-35b-a3b
quantization: null  # FP8 native, no extra arg
kv_cache_dtype: turboquant_k8v4

max_model_len: 320000
gpu_memory_utilization: 0.90
max_num_seqs: 2
max_num_batched_tokens: 4096
enable_chunked_prefill: true
dtype: float16
cudagraph_mode: FULL_AND_PIECEWISE   # vllm default — Genesis stack standard
enforce_eager: false
disable_custom_all_reduce: true
language_model_only: true
trust_remote_code: true

enable_auto_tool_choice: true
tool_call_parser: qwen3_coder
reasoning_parser: qwen3

spec_decode:
  method: mtp
  num_speculative_tokens: 3

# Genesis patcher env (P*, PN*, GENESIS_*)
genesis_env:
  GENESIS_ENABLE_P58_ASYNC_PLACEHOLDER_FIX: '1'
  GENESIS_ENABLE_P60_GDN_NGRAM_FIX: '1'
  GENESIS_ENABLE_P60B_TRITON_KERNEL: '1'
  GENESIS_ENABLE_P61_QWEN3_MULTI_TOOL: '1'
  GENESIS_ENABLE_P61B_STREAMING_OVERLAP: '1'
  GENESIS_ENABLE_P62_STRUCT_OUT_SPEC_TIMING: '1'
  GENESIS_ENABLE_P64_QWEN3CODER_MTP_STREAMING: '1'
  GENESIS_ENABLE_P66_CUDAGRAPH_SIZE_FILTER: '1'
  GENESIS_ENABLE_P67_TQ_MULTI_QUERY_KERNEL: '1'
  GENESIS_P67_USE_UPSTREAM: '1'
  GENESIS_P67_NUM_KV_SPLITS: '32'
  GENESIS_ENABLE_P68_AUTO_FORCE_TOOL: '1'
  GENESIS_ENABLE_P69_LONG_CTX_TOOL_REMINDER: '1'
  GENESIS_ENABLE_P70_AUTO_STRICT_NGRAM: '1'
  GENESIS_P68_P69_LONG_CTX_THRESHOLD_CHARS: '50000'
  GENESIS_ENABLE_P37: '1'
  GENESIS_TQ_MAX_MODEL_LEN: '320000'
  GENESIS_ENABLE_P72_PROFILE_RUN_CAP: '1'
  GENESIS_PROFILE_RUN_CAP_M: '4096'
  GENESIS_ENABLE_P74_CHUNK_CLAMP: '1'
  GENESIS_ENABLE_P81_FP8_BLOCK_SCALED_M_LE_8: '1'
  GENESIS_ENABLE_P82: '1'
  GENESIS_ENABLE_PN8_MTP_DRAFT_ONLINE_QUANT: '1'
  GENESIS_ENABLE_PN11_GDN_AB_CONTIGUOUS: '1'
  GENESIS_ENABLE_P99: '1'
  GENESIS_ENABLE_P101: '1'
  GENESIS_P82_THRESHOLD_SINGLE: '0.3'
  GENESIS_PREALLOC_TOKEN_BUDGET: '4096'
  GENESIS_ENABLE_PN19_SCOPED_MAX_SPLIT: '1'    # vllm#41268 backport — max_split_size_mb=20 scope-restored during model load
  GENESIS_BUFFER_MODE: shared

# System env (PYTORCH_*, VLLM_*, NCCL_*, OMP_*, CUDA_*, TRITON_*)
system_env:
  PYTORCH_CUDA_ALLOC_CONF: 'expandable_segments:True,max_split_size_mb:256'
  VLLM_NO_USAGE_STATS: '1'
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_FLOAT32_MATMUL_PRECISION: high
  VLLM_USE_FLASHINFER_SAMPLER: '1'
  VLLM_USE_FUSED_MOE_GROUPED_TOPK: '1'
  VLLM_ALLOW_LONG_MAX_MODEL_LEN: '1'
  VLLM_WORKER_MULTIPROC_METHOD: spawn
  VLLM_MARLIN_USE_ATOMIC_ADD: '1'
  VLLM_MOE_USE_DEEP_GEMM: '0'
  VLLM_USE_DEEP_GEMM: '0'
  VLLM_USE_FLASHINFER_MOE_FP8: '0'
  VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS: '1'
  TRITON_CACHE_DIR: /root/.triton/cache
  NCCL_P2P_DISABLE: '1'
  OMP_NUM_THREADS: '1'
  CUDA_DEVICE_MAX_CONNECTIONS: '8'

vllm_extra_args:
  - --no-scheduler-reserve-full-isl
  - --performance-mode interactivity
  - --attention-config.flash_attn_version 2
  - --generation-config vllm

api_key: genesis-local
host: 0.0.0.0

docker:
  image: vllm/vllm-openai:nightly
  container_name: vllm-server-mtp-test
  port: 8000
  shm_size: 8g
  memory_limit: 64g
  network: genesis-vllm-patches_default
  gpus: all
  mounts:
    - /nfs/genesis/models:/models:ro
    - /home/sander/.cache/huggingface:/root/.cache/huggingface:ro
    - /home/sander/Genesis_Project/vllm_engine/triton-cache-mtp-test:/root/.triton/cache
    - /home/sander/Genesis_Project/vllm_engine/compile-cache-prod-mirror-test:/root/.cache/vllm/torch_compile_cache
    - /home/sander/genesis-vllm-patches/vllm/_genesis:/usr/local/lib/python3.12/dist-packages/vllm/_genesis:ro
    - /home/sander/genesis-vllm-patches/tools/genesis_vllm_plugin:/plugin:ro
  extra_run_flags:
    - --security-opt label=disable

reference_metrics:
  measured_at: '2026-05-05T18:35:00Z'
  bench_method: bench_35b.sh × 5 sections (short_gen×10 / long_gen×3 / tool×10 / stability×5 / concurrent×4)
  long_gen_sustained_tps: 192.6
  long_gen_mean_lat_s: 5.19
  short_gen_tps: 225.6
  tool_call_score: 9/10
  stability_mean_s: 1.387
  stability_cv_pct: 1.80
  concurrent_4_total_s: 5.14
  vram_used_mib_per_gpu: [22265, 21558]
  vram_total_mib: 43823
  genesis_pin: 991dc1a
  vllm_pin: 0.20.2rc1.dev9+g01d4d1ad3

verify_tolerances:
  tps_drop_pct_max: 5.0
  tool_call_min: 9/10
  stability_cv_pct_max: 6.0
  vram_increase_mib_max: 2000

verified_on:
  - 'sandermage/2xA5000-A2: 192.6 TPS sustained, 9/10 tool, 991dc1a, 2026-05-05'

notes:
  - 'ℹ Pin gate enforces vllm 0.20.2rc1.dev9+g01d4d1ad3 (KNOWN_GOOD_VLLM_PINS)'
  - 'ℹ MoE FP8 routes through Marlin (P87/P91) on Ampere SM 8.6'
  - '⚠ Do NOT enable --enable-prefix-caching with TQ k8v4 + spec_decode (DS conv state crash)'
  - '⚠ GQA=8 (pow-2) routes through P67 multi-query — DO NOT remove GENESIS_ENABLE_P67=1'

0 replies

Sandermage · 2026-05-06T15:48:53Z

Sandermage
May 6, 2026
Author

The changes to the patcher are mostly ready. There are a few things left to finish, but primarily, I'm waiting for tomorrow's nightly build, as it will include the necessary bug fixes and improvements. Once it's out, I'll test the patcher on it one more time and update the repository.

Memory management has been improved; the next step is to experiment with the utilization value, fine-tuning it within the 0.85-0.92 range.

The complete list of new features and changes will be available after the release.

0 replies

noonghunna · 2026-05-06T16:20:22Z

noonghunna
May 6, 2026
Maintainer

@Sandermage — appreciate the rolling update. The "rework over hotfix" call really is the right one for this — landing memory mgmt as a unified pass that benefits every VRAM class (24/32/48/72/96/142) is a much bigger win than another single-card sidecar.

We're aligned on the "waiting for tomorrow's nightly" cadence on our side too — vLLM PR #41745 (Gemma 4 MTP first-party support) merged today at 14:39 UTC at commit 27e0057, but the latest published vllm/vllm-openai:nightly-* tag is from 06:08 UTC (pre-merge). So the next nightly drop (~2026-05-07 06:08 UTC) is the gate for both your v7.73.x release-candidate validation and our overlay drop on the MTP path. Will rebase + retest in the same window.

The mem-util range note (0.85-0.92) is useful for us — our current TQ3 composes ship 0.92-0.95 (long-text/long-vision/dual-turbo), so when v7.73.x lands we'll re-validate those defaults against your tuned range. If your rework moves the cliff inward (more conservative mem-util as the new safe default in exchange for a stable 24 GB single-card path), that's a win we'd take in a heartbeat — we've been advising single-card users to switch engines (llamacpp/default) as a workaround, which costs them ~50% TPS. Closing the mem-util gap structurally is the better story.

Quick progress on our side while you've been heads-down

Two threads worth a heads-up:

Gemma 4 spec-decode fully shipped on club-3090: TP=2 numbers landed yesterday + today across both spec-decode methods. MTP (your eventual integration target) gives 109 narr / 142 code TPS, AL ~4.0, soak PASS. DFlash (Z-Lab's block-diffusion drafter via vLLM PR #41703, ChatGPT/Codex-rebased onto current main) gives 95 narr / 168 code TPS, AL 5.23, soak PASS at n=7 (Pareto-optimal vs n=8). Different operating regimes — DFlash wins code +18%, MTP wins narrative +15%. Full bench + n-sweep + comparison at discussion #67 (and the comparison table is the data you'd want as a starting point when you do the Gemma 4 integration on the patcher side). When you get to Gemma 4, the AutoRound INT4 + BF16 drafter combo is what holds together — drafter dtype doesn't have to match target.

TQ3 compose hygiene cleanup (#82 by @NHClimber87): one of our composes was missing GENESIS_ENABLE_PN34_WORKSPACE_LOCK_RELAX=1 despite our own docs/CLIFFS.md flagging it as required. Fixed in 3167497, then propagated to four other TQ3 composes (ab69f65) that had only P98 enabled (which auto-skips on v0.20 due to the UNIFORM_SINGLE_TOKEN_DECODE drift-marker false-positive — original side-note still pending). Net: every TQ3 compose now uniformly carries PN34 + P98 belt+suspenders. Hygiene-wise this is a reminder that whenever you tag a new release with patch promotions, we need to systematically grep every GENESIS_ENABLE_* opt-in across all composes and normalize. Adding that to our pre-flight checklist for the v7.73.x landing.

Standing by

The one-line A/B (relax has_no_chunk_metadata to True) is still staged here — but happy to ditch it entirely in favor of validating your structural rework when v7.73.x ships. Will pin master to the new release-candidate the moment you publish, run the same verify-full + soak-continuous matrix across all five single-card variants (long-text, long-text-no-mtp, long-vision, bounded-thinking, default) at your recommended mem-util band, and post results on the issue and here.

Telegram noted — will reach out there if anything urgent surfaces during validation. Otherwise async on the discussion is fine and matches your work cadence.

Peace and clear skies.

0 replies

Sandermage · 2026-05-08T01:40:48Z

Sandermage
May 8, 2026
Author

Updates and changes have taken longer than expected. I’m reworking a lot of elements to make them more intuitive and efficient. There are a massive number of changes within the project itself and the supporting utilities to ensure a higher quality and more streamlined user experience.
Additionally, starting with the next version, the project is being renamed to Sander Core Engine (or sndr_core).
The project remains open-source. However, there are certain components I develop for personal use or that cannot be made public for various reasons; these will be housed in sndr_engine. This restructuring isn't about monetization, but rather about a clear separation of concerns. It allows me to use my own developments in non-public projects within a single module—without compromising structural integrity or breaking patches and other features.
I am planning the next update for Tuesday or Wednesday. It will be fully tested on the latest available vLLM nightly build.
I will provide a detailed breakdown of all new features and changes in the documentation soon. For now, I can say that future upgrades will be much simpler—just a single command to pull all necessary updates.
Wishing everyone a great day and peace.

0 replies

noonghunna · 2026-05-08T19:35:22Z

noonghunna
May 8, 2026
Maintainer

@Sandermage — appreciate the rolling status, and good name choice ("Sander Core Engine" / sndr_core is cleaner than genesis-vllm-patches and matches your shift toward a unified utility framework rather than a patch collection). The split between sndr_core (open) and sndr_engine (your private dev space) makes sense — keeps the open-source surface focused on what's actually testable cross-rig while letting you iterate on internal tooling without forced decisions about what to publish.

The single-command upgrade flow is going to be a real onboarding win — every new club-3090 user we've helped this week has hit some variant of "where do I git clone, what env vars do I set, which compose loads it" and the answer has been a 4-step walkthrough. Collapsing that to one command will cut a lot of friction for people just trying things out.

We'll re-test our composes against the v7.73.x release (whether labeled v7.73 or as an sndr_core first cut) on the day it lands — same v7.72.2 → v7.73.x uplift cycle as the prior pin migrations. If the new design moves the mem-util cliff inward (more conservative defaults in exchange for stable single-card 24 GB), that's a structurally better story we'd take in a heartbeat.

Tuesday/Wednesday lands clean — no rush, the rework warrants the time. Will mention on disc #67 (the Gemma 4 thread we just updated) when v7.73.x / sndr_core lands so the audience there knows where to look for the next pin bump. 👋

0 replies

Sandermage · 2026-05-08T21:58:37Z

Sandermage
May 8, 2026
Author

64d624838e86de86573c9c121c2cb0cccc3485b256e7e9efae746dd4f959c9f6

The lighthouse was a symbol of Odesa, and it probably still remains one. It's a beautiful city, and the true natives of Odesa who were born and raised here are practically their own separate nation :)))
I was lucky enough to catch the people of that old Odesa, with their unique dialect.. :))
That's exactly why I use a lighthouse in my logos and images. It's a tribute to nostalgia and a way to emphasize strength and freedom..

And I myself am a native of Odesa going back several generations.. It is my beloved hometown. And probably the best place on Earth..

0 replies

Sandermage · 2026-05-09T13:34:07Z

Sandermage
May 9, 2026
Author

genesis_env:
  GENESIS_ENABLE_PN95_TIER_AWARE_CACHE: '1'
  GENESIS_PN95_CONFIG_KEY: a5000-2x-tier-aware-example
  GENESIS_PN95_TICK_EVERY: '100'
  GENESIS_PN95_DEMOTE_FREE_MIB_THRESHOLD: '2048'

cache_config:
  tiers:
    - device: gpu
      capacity_gib: 22.0       # ~92% от 24 GiB
      low_water_pct: 0.75
    - device: cpu
      capacity_gib: 8.0        # per-worker, итого 16 GiB pinned RAM
      pinned: true
      vision_first: true
  exclude_mamba_ssm: true
  vision_demote_first: true

0 replies

Sandermage · 2026-05-09T17:26:12Z

Sandermage
May 9, 2026
Author

To run the new version, the system must have 16+2GB of free RAM!
These are not the requirements of the utility itself, but rather the requirements for running models with a large context window. This memory will be reserved for the needs of the system and the LLMs.
Detailed information will be provided after the release. I am still working on the system and refining it.

0 replies

noonghunna · 2026-05-12T12:59:00Z

noonghunna
May 12, 2026
Maintainer

@Sandermage — the PN95 tier-aware-cache shape is exactly the right primitive set. Keeping Mamba SSM state on GPU (exclude_mamba_ssm: true) is non-negotiable on Qwen3-Next / hybrid-attention models — that state isn't recomputable cheaply. Demoting vision tokens first (vision_first: true) is the right pressure-release valve since vision tokens are long-context-distance-bound but rarely re-attended after the initial pass. The 16+2 GB host-RAM floor is reasonable; any rig running 27-31B above ~100K ctx already has the headroom.

Two things from our side that might be worth threading into the sndr_core release cycle when you're ready:

1. vllm#41434 landed 2026-05-08 — eliminates several GPU↔CPU syncs in attention impls. Measuring ~15% vanilla-path TPS lift on Qwen3-Next between pre-#41434 (nightly-01d4d1ad3, your current v7.72.2 known-good) and post-#41434 (nightly-1acd67a79). On Genesis-loaded paths your patch stack already buys back more than that (P67 alone is ~30% on K+1 spec-verify), so net you're still ahead today — but stacking #41434 under the v7.72.2 / sndr_core patches should be additive once the allowlist refreshes. Wanted to flag so it's on the radar for the next pin bump.

2. Cross-model head-to-head this week — Discussion #119: Gemma 4 31B vs Qwen 3.6 27B on dual 3090 across 6 configs (INT8 PTH / bf16 / TQ3 patch-only / TQ3+MTP Genesis-backed), same vLLM pin + rebench-full harness for legs 1-5, Genesis leg pinned at your allowlist 01d4d1ad3. The empirical signal for P67 as the actual fix is sharp: leg 5 (patch-only TQ3+MTP) lands at 18/150 quality + 0/30 aider + 34/100 silent-empty in soak; leg 6 (Genesis-backed) at 86/150 + 18/30 + 0/100 — same intent, two paths, P67 closes it. Genesis-backed TQ3+MTP also delivers 1.22M KV pool / 4.66× concurrency at 262K, within ~5pp of INT8 PTH quality. Thought you'd find the cross-model framing useful.

No pressure on the rename/refactor timing — the framework rework is clearly the right architectural move. Happy to fold a sndr_core-backed leg into the matrix as soon as you have a candidate ready for cross-rig testing, and to A/B PN95 tier-aware-cache against the long-ctx prefill cliffs we've been tracking on Qwen3-Next single-card. 🍻

0 replies

Sandermage · 2026-05-12T16:52:07Z

Sandermage
May 12, 2026
Author

As soon as I can release the new version, everything will become much simpler... and better.
The project will continue to provide patches and improvements as before, but it will also introduce a platform and guidelines allowing others to submit their own patches for review and integration.
Configs, tests, classes, and much more. Including the complete functionality for running and testing...
There's a lot to write about, but it will be better to just see it... I'm not announcing any dates yet, as I don't know them for sure myself.
But sndr_core is no longer just a collection of patches; it is a full system for deploying vLLM natively, in Docker, or in k8s using a set of configs and profiles. It really will be better to see it... As soon as the project and the new version are ready, I will release it immediately. For now, it's in the testing, refinement, and bug-fixing stage...
Overall, it's about 50-60% complete right now.

0 replies

Sandermage · 2026-05-12T17:00:39Z

Sandermage
May 12, 2026
Author

It all started as a quick, makeshift project for myself—just tweaking things here and there, and suddenly it was published.
Today, I am building a comprehensive platform that will handle the entire spectrum of tasks required for the stable, full-scale operation and deployment of LLMs.
This involves patches, integrations, and the development of custom solutions and improvements. It includes adding new features to vLLM, memory management, handling models, optimization, and much more...
You could say it's now a real product, both for myself and for the community. I hope it will enhance—or, if I may dream big, completely revolutionize—the way we support and work with AI and large language models on both consumer and server platforms...

0 replies

noonghunna · 2026-05-12T17:17:20Z

noonghunna
May 12, 2026
Maintainer

@Sandermage — 50-60% on a project whose scope has grown like this is genuinely impressive. The pivot from "collection of patches" to "deployment platform with config profiles + a review pipeline for outside contributors" is the architectural shift that separates a hobby fork from infrastructure. Most community patch work caps out at the patch surface; you're past that.

Rooting for you. When sndr_core is ready for cross-rig testing, we have the rebench-full 5-phase pipeline (bench / verify-stress / quality / soak / aider-polyglot-30) wired up across dual 3090 + dual 4090 + Razer Core X TB3 + various mixed-arch eGPU rigs from the community — happy to fold it into your validation gate as a "did this profile regress on consumer Ampere" canary. Your profile system would also let us swap config sets and re-run the structured-CoT / TQ3+MTP head-to-head matrix without compose surgery, which is exactly the friction the platform targets.

Two upstream-vLLM items we've been tracking — no action needed, just so they're on your radar for the next pin-allowlist refresh:

vllm#41434 (GPU↔CPU sync elimination in attention impls, merged 2026-05-08) — flagged earlier in this thread. ~15% vanilla Qwen3-Next TPS lift between pre-#41434 (01d4d1ad3, your current allowlist) and post (1acd67a79). Genesis stack buys back more than that on its own, so on Genesis paths you're still ahead today — stacking #41434 under the next sndr_core pin should be additive.
vllm#35936 (qwen3_coder + tool_choice="required" parser bypass) — independent cross-rig repro posted today. Upstream PR sits stale 4 weeks with merge conflicts; affects any client that sends tool_choice="required" to Qwen3-Coder family. Trivial workaround (stay on tool_choice="auto"), but worth a known-good gate in sndr_core's profile validation when you have the bandwidth.

Ship it when it's right. 🍻

0 replies

Sandermage · 2026-05-19T00:13:18Z

Sandermage
May 19, 2026
Author

Hi everyone,
The release will hopefully happen this week, though it’s not set in stone just yet. The updated project itself is ready. There are just a few rough edges left, which I’ll be fixing and polishing up in the near future.
The main "culprit" behind the delay is Gemma 4 — I'm currently working through its 31B and 26B models. As soon as I sort out the current tasks and am satisfied with the performance, quality, and the specific features I want it to support, I’ll roll out the release.
Plus, my goal right now is to bring the project up to a certain standard of quality. After that, updates will drop much more frequently by being split into:
1. Patch and bug-fix releases: Not daily, but I’m thinking every 3 to 5 days.
2. Major releases: Feature updates, quality improvements, new developments, and so on — roughly every 2 to 3 weeks.
That is, of course, if plans don't change.

1 reply

noonghunna May 19, 2026
Maintainer

Thank you for the update Sander! Can't wait to get going when its ready.

Sandermage · 2026-06-17T23:05:32Z

Sandermage
Jun 17, 2026
Author

Hi everyone! If you're still following the project, the release should be out by the end of the week. There are a ton of changes and bug fixes under the hood. What's coming: • Full support for Qwen 3.6 and Gemma 4 models, including DiffusionGemma. • Up next: integration with SGLang. Right now, I'm mostly polishing things up and iron out the remaining kinks. Sorry for the delay — it turns out I had to rewrite and tweak quite a lot to get it up to my standards. All the details will be available with the release. The screenshots below show the new Web GUI panel for the project, which will go live right along with the update!

1 reply

noonghunna Jun 17, 2026
Maintainer

welcome back! This looks really slick!

Genesis vLLM Patches v7.64 released — please test #19

Uh oh!

Sandermage Apr 30, 2026

What is in

Asking for

Replies: 40 comments · 8 replies

Uh oh!

noonghunna May 1, 2026 Maintainer

Uh oh!

Uh oh!

Sandermage May 1, 2026 Author

Uh oh!

noonghunna May 1, 2026 Maintainer

Uh oh!

noonghunna May 1, 2026 Maintainer

1. TQ k8v4 dual-3090 bench

2. CONFIGS.md walkthrough feedback

Uh oh!

Uh oh!

Sandermage May 1, 2026 Author

1. The 13% gap on TQ k8v4

2. CONFIGS.md feedback — all 8 friction points addressed

3. Heads-up — your issues #14 + #15 are landed on dev

4. New work landing on dev you may want to track

Uh oh!

noonghunna May 1, 2026 Maintainer

Uh oh!

Sandermage May 1, 2026 Author

Uh oh!

noonghunna May 1, 2026 Maintainer

Uh oh!

Sandermage May 2, 2026 Author

The patches, by status

Live-validation matrix

Known sharp edges

What I'd love help with

Honest note — stepping back for a few days

Uh oh!

noonghunna May 2, 2026 Maintainer

PN33 — partial close on TP=1

PN25 v7.66 — still doesn't work on TP=1

PN30 — your .contiguous() is layout-incorrect; we wrote a corrected version

PN31 — still doesn't fit on 24 GB

PN32 — not yet tested

Validation matrix on v7.66

Net for our shipping configs

Uh oh!

Sandermage May 2, 2026 Author

What landed in Genesis directly

Why I missed all three the first time — honest answer

Server validation (post-backport)

What I'd love help with next

On the next-week test rig

Uh oh!

Sandermage May 2, 2026 Author

Real bugs caught + fixed

Plus cleanup pass

Numbers

Honest note

Uh oh!

noonghunna May 2, 2026 Maintainer

✅ PN25 v7.68 — works clean on TP=1

✅ PN34 — works clean (but default OFF caught us)

❌ PN30 v7.68 — drift-marker false-positive breaks the patch

❌ P103 — wrap reports "rebound at 0 caller sites"

❌ PN32 alone doesn't close Cliff 2 on TP=1 + 24GB

What we did

Uh oh!

Sandermage May 2, 2026 Author

F1 — PN30 part3 drift-marker false-positive ✅ FIXED

F2 — P103 setattr lost on exec vllm serve ✅ FIXED

F3 — PN32 v1 chunked at wrong level ✅ FIXED

What I'd love help with next

Uh oh!

Sandermage May 2, 2026 Author

Uh oh!

noonghunna May 2, 2026 Maintainer

✅ F1 (PN30 part3 drift-marker) — confirmed working

✅ F2 (P103 self-install hook in chunk.py) — confirmed firing on TP=1

⚠️ F3 (PN32 v2 + P103 chunked path) — engages but doesn't close 60K Cliff 2 on this config

Sandermage
Apr 30, 2026

Replies: 40 comments 8 replies

noonghunna
May 1, 2026
Maintainer

Sandermage
May 1, 2026
Author

noonghunna
May 1, 2026
Maintainer

noonghunna
May 1, 2026
Maintainer

Sandermage
May 1, 2026
Author

3. Heads-up — your issues #14 + #15 are landed on `dev`

4. New work landing on `dev` you may want to track

noonghunna
May 1, 2026
Maintainer

Sandermage
May 1, 2026
Author

noonghunna May 1, 2026
Maintainer

Sandermage
May 2, 2026
Author

noonghunna May 2, 2026
Maintainer

PN30 — your `.contiguous()` is layout-incorrect; we wrote a corrected version

Sandermage
May 2, 2026
Author

Sandermage
May 2, 2026
Author

noonghunna
May 2, 2026
Maintainer

Sandermage
May 2, 2026
Author

F2 — P103 setattr lost on `exec vllm serve` ✅ FIXED

Sandermage
May 2, 2026
Author

noonghunna
May 2, 2026
Maintainer

Sandermage
May 2, 2026
Author

Sandermage
May 5, 2026
Author

Sandermage
May 6, 2026
Author