Genesis vLLM Patches v7.64 released — please test #19
Replies: 40 comments 8 replies
-
|
Just upgraded our pin to v7.64 today and started cross-rig validation on RTX 3090 single-card (vLLM Anchor drift status on our pin:
Open issue v7.64 doesn't address — Pin question: all of v7.64's empirical validation was on Backport priority feedback for v7.65:
Also strongly +1 on the Cliff 8 hardening ( Will share apples-to-apples 27B+TQ k8v4 numbers after the current bench finishes. ⭐ given. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @noonghunna — thanks for the detailed boot report, this is exactly the kind of cross-rig validation I was hoping for. Quick AI-translated notes from Odessa (apologies for any English roughness). On the pin question — when I write a vLLM SHA in the README, that means I've actually rebuilt at that SHA, re-run the validator on it, re-baked the patcher under it, and re-validated the reproducer (35B FP8 tool-call + 27B Lorbus). It's not a "we should be on this someday" — it's "this is what I'm running right now, and it works". Update cadence is value-vs-regression, not a calendar. Concrete examples for context:
So when you see On PN25 (forward_native inductor bypass) — you're absolutely right that PN12 leaks past the compile path. Genesis stack has the same flaw; we don't hit it in PROD only because our 27B Lorbus + cudagraph FULL_AND_PIECEWISE config short-circuits the inductor pipeline on this kernel. Future inductor-default configs would expose it. Just landed PN25 on On PN19 ≠ H100 ergonomics — agreed, will flag in CLIFFS.md that the 200-500 MiB win is H100-specific. We saw similar non-transfer on P104 L2 persistence (regressed -16.2% on 32+ layer KV >> L2 setups). Generic allocator hints don't survive class jumps. On Cliff 8 hardening — On backports — your priorities track mine:
Numbers from your apples-to-apples 27B+TQ k8v4 bench will be very useful — esp. if 3090 lands within 5% of our A5000 reference. Hard data on that gap is one of the things I haven't been able to gather on my own rig. ⭐ much appreciated, and thanks for keeping the cross-rig pipeline honest. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the pin clarity — clear gate criteria (rebuilt + validator + reproducer + tools-API regression check) is exactly what makes the SHA actionable rather than aspirational. Will track your README for future pin moves. v0.20 result on our 3090 / Qwen3.6-27B + TQ3 + MTP K=3 config — different outcome from your A5000 fleet. Boot was clean (all v7.64 patches apply natively, including PN25 sister-pair), but engine crashed during MTP draft proposal at long prefill: Stack: Root-caused to vllm#39226 — the strict Likely a config-shape mismatch between our setups (yours probably hits TQ decode during profile_run via different draft/cache geometry, ours doesn't — guessing TP=1 + INT4 group_size=128 Marlin + MTP K=3 vs A5000 TP=N + Lorbus). Tracking on a separate On PN25 + P38 — both extremely relevant for our cliff investigation. Detailed reply on club-3090#16 but short version: PN25 is independent convergence on the same fix we just shipped locally; we'll plumb your Will also surface the H100-vs-Ampere PN19 footnote in our CLIFFS.md so users on consumer 3090 don't re-discover the negative. ⭐ — and "Speed without correctness is a regression" should be the tagline for this whole project tbh. |
Beta Was this translation helpful? Give feedback.
-
|
27B + TQ k8v4 dual-3090 bench + CONFIGS.md feedback (closing out the asks from your v7.64 ship post). 1. TQ k8v4 dual-3090 benchOur compose: 2× RTX 3090 24 GB PCIe (no NVLink), TP=2, AutoRound INT4, vLLM
Comparing to your A5000 reference (
Most likely contributors to the gap:
Side-by-side memory profile (might be useful for triage): at 0.90 mem-util we sit at 20.4 GB/24 GB per card, leaving ~3.5 GB headroom on each — plenty of activation room. So the gap isn't OOM-pressure; it's pure throughput. Want me to A/B with the full env-var set? Happy to run if useful — would isolate whether the 13% is env-vars or pin/hardware. 2. CONFIGS.md walkthrough feedbackWalked through the doc end-to-end while the bench booted. Strong overall — quick decision tree at the top, "5 things to write down" before editing, per-bucket patch lists with 1-line "what does it do" each, and the worked Llama-3 70B example at the end ("generic patches work outside Qwen") all hit the target. Friction points that surfaced when I tried to mentally re-execute it for our 27B/3090/Docker setup:
Smallest single-change with biggest impact: fix script naming (#1). That's the first thing every new operator runs into and the doc directly disagrees with Both items closed out. Backport priorities + ⭐ already in our previous reply. P38 silent-no-op trace filed separately as genesis-vllm-patches#14. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @noonghunna — thanks for the dual-3090 bench data + CONFIGS feedback. Gonna address both halves carefully so the facts stay grounded. 1. The 13% gap on TQ k8v4Your gut is right that the env-var subset is the dominant contributor. Yes please run the A/B with the full set — this is the cleanest data point we can get for the doc. One factual correction first: P82 is actually OFF in our 27B PROD launch script, not on. Verifiable in The patches that are PROD-on for the 27B TQ k8v4 path (verified from current launch script): P82 stays OFF; P78 stays OFF. (source-of-truth file — copy from line 36–53.) On your three contributing factors:
So at full env-var set + same pin: expected ~5-8% residual gap at most, closer to your hardware-only floor. 2. CONFIGS.md feedback — all 8 friction points addressedEvery one was actionable. Pushed fixes to dev in #1 Script naming mismatch. Fixed. Updated table to reference real files ( #2 Docker compose path invisible. Added new Step 2b — Docker compose mirror with worked compose snippet (~25 env vars from #3 TQ k8v4 deps scattered + P4 has no description. P4 description was actually present at line 243 — "P4 — required, removes hybrid TQ rejection" — but you're right it was hard to find. Added two consolidated copy-paste blocks: "Required for boot" (P4 + P67 + P98 + P101 + PN8) and "Recommended PROD additions" (~25 env vars). The required block is now the first thing readers see in the TQ k8v4 section. #4 API key repo-baked. Added explicit fallback note in Step 5's smoke test: "If you launched without #5 #6 Spec-decode trio without gating signals. Step 3's spec-decode block now back-links Step 1 §4 for each method capability check (ngram always works / MTP needs #7 DFlash on hybrid models — PR #40898 caveat. Added 1-line caveat next to the DFlash spec-decode option pointing at vllm#40898 OPEN status + Genesis PN21 partial backport state. ~25% acceptance-length gap acknowledged until upstream merges. #8 Step 7 — submit-back format spec. Step 7 already links Smallest-impact thanks for the prioritization — script naming was indeed the biggest first-touch friction and got fixed first. 3. Heads-up — your issues #14 + #15 are landed on
|
Beta Was this translation helpful? Give feedback.
-
|
A/B bench results — full env-var set on TQ k8v4 dual-3090. ⭐ Following your reply: bumped Genesis pin to dev tip ( Bench (n=5 measured, 3 warm, scripts/bench.sh canonical narrative + code prompts):
Headline: code wall_TPS 116.59 on dual 3090, +30.7% over your A5000 89.23 reference. Narrative 92.12 lands at the bottom edge of your 95-100 target band (likely the bench prompt class — narrative has more variable acceptance length, code is repetitive enough for MTP to amortize hard). Variance analysis on the +50% jump (full vs subset): The patches that were absent from our earlier subset and present now:
If you want a per-patch ablation, I can run a few targeted A/Bs (e.g. Notable: PN26b's "first sparse-V kernel deployed for SM86 (Ampere consumer)" log line is correct — Ampere consumer users now have a path to sparse-V tile-skip that doesn't exist anywhere upstream. That alone is a sizable contribution for the SM86 fleet beyond just our rig. P38B + P15B both apply cleanly on our v0.20-blocked config → boot clean, sustained workload clean, no observable regression vs the pre-fix state. Pin migration plan unblocked. With v7.65 carrying P38B + P15B + PN26b + PN25 + Cliff 8 hardening + P68/P69 threshold default, master can move to v0.20.1rc1.dev16 + Genesis v7.65 in one PR. Holding for v7.65 release tag — happy to test against any RC you cut. CONFIGS.md fixes look great — pulled Update for the bare-metal launch header would be |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the tests and data. They are extremely important to me and help make the project better and of higher quality. I haven't finished with the new PN26b sparse-V kernel yet; in fact, I've been working on it for the last 6 hours while also fixing bugs. I read all the comments and use a bot to track all repository activity, so I instantly see bug reports and suggestions from the community. I try to implement them as long as they don't distract too much from the project's main direction. For the next 2-3 days, I don't plan on pushing anything to the main branch. Everything will go to dev for now. Once both you and I are confident that everything is solid, I will merge the updates into main. This is how our workflow will operate moving forward: dev: for testing and new features main: for stable releases I apologize that I don't always reply. There's just not enough time for everything, so I dedicate most of it to the project and other personal priorities. If my lack of responses comes across as rude at times, please forgive me—that is absolutely not my intention. Have a great time of day, everyone (whether it's morning, afternoon, or late evening like it is for me right now). I try to hear everyone out, though it isn't always possible since we all have different perspectives on certain things. But that doesn't stop us from doing good and creating something valuable for all of us. Wishing everyone peace and a clear sky! |
Beta Was this translation helpful? Give feedback.
-
|
Hey @noonghunna and everyone — substantial update. Pushed v7.66 to dev (commits 1304c56..fc89395) and live-validated on 4 model configs on our 2× A5000 rig. Boot-tested all of them end-to-end against the actual vllm install, not just unit tests — sanity check after caught two real bugs that pytest missed. The patches, by statusPN33 — root-cause spec-decode warmup fix (DEFAULT ON) Backport of vllm-project/vllm#37521 (itailang) but EXTENDED beyond its Default ON when spec-decode is active. Disable via Live-verified: PN33 marker present in patched PN25 + P7b v7.66 — direct_register_custom_op refactor Switched Live-verified: PN25 v7.67 — REJECTED on live test Tried Stack showed Dynamo tracing INTO SGLang's working PN32 — audit only, no code change Confirmed Live-validation matrix
All 4 configs: PN33 patch APPLY (verified in live The 27B INT4 + DFlash drafter result (129.3 TPS on 2× A5000) lines up well against your published 78 narr / 128 code TPS on 2× 3090 — same drafter recipe, similar consumer Ampere. Known sharp edges
What I'd love help with
Honest note — stepping back for a few daysPretty wrung out from the last week. Going to read what people post but reply windows might be slower for 2-3 days. Keep the data coming whenever it shows up — every bench result and config detail matters. Your dual-3090 wall_TPS 116.59 number (+30.7% over A5000 reference) is exactly the kind of validation that justifies the effort. Thanks for the patient bug-hunting. Wishing peace and a clear sky to everyone. — Sander, Ukraine, Odessa |
Beta Was this translation helpful? Give feedback.
-
|
@noonghunna and the @ChatGPT/Codex CLI team — thank you. Big update. Pulled all three of your v7.66 cross-rig findings into Genesis directly as v7.68 (commit ab3f5ce on dev). Boot-validated on our 27B INT4 + TQ k8v4 + MTP K=3 + TP=2 PROD; ready for your 1×3090 + TP=1 retest whenever you have a window. What landed in Genesis directlyPN30 v7.68 — dst-shaped temp (your Your diagnosis was correct end-to-end. v7.65 Ported your fix as PN30 part3 patching Plus part1 (the old compact path) is now fail-closed RuntimeError so if anything ever reaches it we crash explicitly rather than silently corrupt. Reuses your existing PN25 v7.68 — import-time registration (your Your insight about activation.py module-import timing is the key — vLLM imports activation.py during model construction in each spawned worker, BEFORE profile_run enters aot_compile_fullgraph. So registration runs in eager Python, never inside a Dynamo trace. v7.66's Ported your
Same pattern extended preventively to P7b ( PN34 (NEW) — runtime workspace lock relaxation (your Your Default OFF (it's relaxing a strict-debug assertion, so explicit opt-in via Why I missed all three the first time — honest answerI tested patch application (does the text-patch land cleanly), not patch correctness against the bug-triggering workload. Specifically:
The system fix is on me: I'm setting up another rig with one consumer card next week to actually run your reproducers locally. No more arguing for workarounds when the right answer is to test against the actual bug surface. Server validation (post-backport)27B INT4 + TQ k8v4 + MTP K=3 + TP=2:
What I'd love help with nextWhen you have a window:
If anything regresses I'd rather hear about it within a day than have you carry sidecars forever. On the next-week test rigSetting up a second box with a single A5000 (24 GB, SM86 Ampere consumer — same memory budget + same compute capability as the 3090 you're testing on) to actually run your reproducers locally instead of asking you to do all the cross-rig validation. Specifically:
Not a substitute for your cross-rig data on the 3090s themselves, but should mean fewer "works on TP=2, breaks on TP=1" round-trips through your bug filings — A5000 single-card hits the same TP=1 spawn config + 24 GB activation budget that triggered all three of the bugs you found. — Sander, Ukraine, Odessa |
Beta Was this translation helpful? Give feedback.
-
|
Quick follow-up — ran two static-analysis audits today (Gemini + ChatGPT/Codex CLI) on the genesis-vllm-patches tree to catch latent issues that pytest + live-boot couldn't. Closing the loop on what they surfaced. Real bugs caught + fixedG-001 (Codex, Critical) — G-002 (Codex, High) — G-003 + G-004 (Codex, High×2) — G-006 (Codex, Medium) — G-007 (Codex, Medium) — G-008 (Codex, Medium) — 7 env-var references in PATCHES.md / INSTALL.md didn't match the actual P103 latent NameError (separate Gemini audit) — Plus cleanup passThe same audits flagged G-005 (streaming docs lying about SSE replay), G-009 (PATCHES.md P72 row truncated), G-010 (rig-specific paths in scripts — partially closed with env-var override + README rationale), G-011 ( Numbers
Honest noteThe two latent bugs that hurt most (G-001 conservative apply override; P103 silent Cliff 2 skip) are exactly the class that pytest + live-boot don't catch — boot doesn't trigger the rare exception path, and PROD continuous batching never crosses the chunked-prefill threshold. Static analysis (ruff F821, name resolution) found them in 30 seconds. Going to bake static analysis into a pre-commit hook so this is automated going forward. Stepping away for the rest of today — eyes are tired. Will read what comes in but reply windows likely tomorrow. Whatever cross-rig data you turn around on PN30 v7.68 / PN25 v7.68 / PN34 will be valuable whenever it lands. — Sander, Ukraine, Odessa |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — pulled v7.68 dev tip ( Three findings worth flagging before you cut v7.69. TL;DR: PN25 v7.68 ✅ + PN34 ✅. PN30 v7.68 ❌, P103 ❌, PN32 ❌ on TP=1 + 24GB. ✅ PN25 v7.68 — works clean on TP=1
✅ PN34 — works clean (but default OFF caught us)
❌ PN30 v7.68 — drift-marker false-positive breaks the patchYour part3 has apply_all then escalates to FAILED → vLLM aborts. Fix: change part3's drift marker to something part3-specific, e.g. ❌ P103 — wrap reports "rebound at 0 caller sites"Confirmed broken: probe 7 (60K) hit Cliff 2 OOM and the trace went straight through ❌ PN32 alone doesn't close Cliff 2 on TP=1 + 24GBAfter enabling PN32 + (broken) P103 with PN30 disabled (workaround for Finding 1) → Cliff 2 fired EARLIER than v7.66, at a 30K prompt instead of the usual 50-60K. PN32 chunks OOM trace at 30K: Two ways this could land:
The 2×A5000 PROD doesn't hit this because TP=2 splits the GDN forward state across ranks; on TP=1 the full state lands on one card. What we didStayed on master (Genesis v7.66 Happy to share full boot logs, dispatcher matrix, and OOM tracebacks for any of these — let me know which would be most useful for v7.69 triage. |
Beta Was this translation helpful? Give feedback.
-
|
@noonghunna — pulled all three findings into v7.69 on dev (commit F1 — PN30 part3 drift-marker false-positive ✅ FIXEDYour diagnosis was exact. part2 (separate Tightened part3's drift markers to F2 — P103 setattr lost on
|
Beta Was this translation helpful? Give feedback.
-
|
If this version passes validation and doesn't crash on your end, I will merge it into the main branch and we can lock in the first stable release :). It would be incredibly helpful if people could keep sharing their testing data... |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — first off, v7.69 turnaround speed is something else. F1 + F2 + F3 all rooted causes diagnosed correctly in your replies, 18 new tests, dispatcher composition matrix for ✅ F1 (PN30 part3 drift-marker) — confirmed workingDS layout active throughout. PN30 v7.68 part1+2+3 all APPLY clean, no ✅ F2 (P103 self-install hook in chunk.py) — confirmed firing on TP=1Trace at runtime hits
|
| T value | Invocations | Notes |
|---|---|---|
| 4128 | 394 | vLLM's chunked-prefill chunk size (capped by max_num_batched_tokens=4128) |
| 64 | 48 | cudagraph warmup or MTP verify path |
| > 4128 | 0 | Never seen |
q.shape[0] = 1 always. cu_shape = torch.Size([2]) always. So _single_seq_cu = True and _true_varlen_multi_seq = False for every invocation.
The P103 chunked path never engages because q.shape[1] <= _MAX_T (4128 ≤ 16384) is always true on real serving — vLLM's outer chunked-prefill is already capping T at 4128, well below MAX_T. PN32 v2's outer-level chunking has the same effect (chunk size 8192, threshold 16384). Neither closure mechanism fires.
We tested forcing it via GENESIS_FLA_FWD_H_MAX_T=2048. Chunked path engaged, per-call allocation halved (50→24 MiB), but cumulative state grew slightly and OOM fired earlier in absolute call count. Confirmed Codex's hypothesis: the issue isn't a single allocation that needs splitting; it's accumulated activation residency that the 50 MiB late-stage allocation can't fit into.
The actual closure: vllm#35975 backport + mem-util tuning + bisect data
ChatGPT/Codex round 2 diagnosed it as headroom rather than gate logic, and called out vllm#35975 (open) as a directly-relevant fix — skips inputs_embeds GPU buffer for text-only models, claims ~64 MiB savings.
We backported it locally as a setup-time text-patch. Combined with mem-util tuning, the matrix:
| Config | Boot resident | 60K MTP-on | Wall | Notes |
|---|---|---|---|---|
| 0.95 (baseline) | 23,164 MiB | ❌ OOM 50/24.5 free | n/a | Cliff 2 fires |
| 0.95 + #35975 | 22,720 MiB | ❌ OOM 50/46.5 free | n/a | #35975 freed 444 MiB at boot, only 22 MiB margin at peak |
| 0.92 + #35975 | 21,980 MiB | ✅ HTTP 200 | 689s | Cliff 2 closed. ~580 MiB end-of-run margin, AL=4.00 |
| 0.93 + #35975 | 22,260 MiB | ✅ HTTP 200 | 623s | Cliff 2 closed. ~494 MiB margin. Best balanced point |
Plus MTP-off + 0.95: 60K passes in 504s with full 5+ GiB KV pool — different shipping variant for users who want max KV pool.
Per Codex's framing post-bisect, three explicit variants to ship:
- Balanced MTP —
long-text.ymlupdated: Genesis v7.69 + your full env bundle + Codex P103 gate fix + #35975 sidecar + 0.93 mem-util + MTP K=3 retained. Cliff 2 closed at 60K with spec-decode acceleration. KV concurrency at 180K: ~1.4x. - Max-context safety —
long-text-no-mtp.yml: Same minus--speculative-config, full 0.95 mem-util. For long-shot RAG/codebase prompts where slow decode is OK in exchange for max KV pool stability. - Future upstream win — vllm#37429 hybrid Mamba/attention KV cache sizing. If it applies cleanly, could free residency without trading mem-util. Untested, separate branch experiment.
90K probe at 0.93 + max_tokens=1 (prefill-only timing) is in flight as we draft this; result will update the recipe with a confirmed Cliff 2 ceiling figure.
On Codex's P103 gate fix recommendation
We applied the gate change anyway:
-if cu_seqlens is not None or q.shape[1] <= _MAX_T:
+_single_seq_cu = (cu_seqlens is not None and q.shape[0] == 1
+ and cu_seqlens.shape[0] == 2)
+_true_varlen_multi_seq = cu_seqlens is not None and not _single_seq_cu
+if _true_varlen_multi_seq or q.shape[1] <= _MAX_T:Plus canonicalize cu_seqlens=None inside the chunked path (since [0,T] is dense B=1 semantically). It's the right semantic fix — the previous gate blocked single-seq cu_seqlens unnecessarily, which would matter on configs without vLLM's outer chunked-prefill capping T. Worth shipping in v7.70 even though it's not what closes Cliff 2 on our specific config. Diff in this discussion's attached file (or I can open a PR if you'd prefer).
PN32 v2's analogous gate (multi-seq bypass) has the same property. For users running spec-decode + long single-prompt, the right composition is: chunked-prefill at outer level (4128 cap) + #35975 freeing residency + 0.92 mem-util freeing activation budget. P103's chunked path is a defense-in-depth for cases where T > 16384 reaches FLA directly (synthetic benchmarks, non-chunked-prefill configs).
Cross-stack signal: PFlash from Luce-Org
@troymroberts surfaced this in club-3090#25 — Sandro Puppo's announcement + the lucebox blog. Not asking you to integrate, but flagging because the architectural overlap with PN26b is interesting:
- PFlash is a long-context prefill accelerator (vs DFlash which is decode). Uses small drafter (Qwen3-0.6B) + block-sparse attention to score token importance, compresses 128K → ~6.5K tokens before target prefill runs.
- Headline claim: TTFT 24.8s vs 257s vanilla llama.cpp at 128K (~10.4× speedup)
- Block-sparse drafter attention on SM86 — the kernel surface is exactly what your PN26b targets. If a community vLLM port of PFlash emerges, your existing sparse-V infrastructure could be directly reusable.
- C++/CUDA only today, lives in lucebox-hub. Tracked in our
docs/UPSTREAM.mdas a watch entry.
club-3090 plans to explore integration once lucebox-hub server stabilizes (currently has the daemon-mode + greedy-only quirks documented). Mentioning here in case you were unaware of it.
📱 Twitter handle?
Last housekeeping ask: what's your Twitter / X handle? We're starting to do more public posts (rewrote the pinned welcome to "club-3090 is open to all CUDA hardware" yesterday; likely more once v7.69 lands stable + we have NVLink + 4×3090 + modded-3080 cross-rig data points). We want to credit you properly when posting — your patches are the single biggest reason this stack performs the way it does.
If you'd rather stay off social platforms, that's fine — just say so and we'll credit you as @Sandermage on GitHub instead.
On Blackwell server > consumer
Your reasoning (24/7 reliability, ECC, lower power per useful FLOP) is the right framing for PROD. When the Blackwell server tier comes within reach, the cross-rig story flips — we become your test surface for Ampere consumer, you become the reference for Blackwell datacenter. Useful split.
Wishing peace and clear sky.
Beta Was this translation helpful? Give feedback.
-
|
Thanks for the insights. I’m also considering integrating Codex into my validation workflow on a permanent basis; it’s a great time-saver for catching things I might have missed or messed up. Regarding PFlash, I’ve looked into it, but it’s not a top priority for now, even though the tech is interesting. We’ll see how it goes—I don’t want to make promises I can’t keep. Right now, my main focus is stabilizing the entire stack; new features and improvements will come after that. I’m currently reworking the structure and developing an installer so that everything can be set up with a single click or command, bringing the whole project into a more logical and streamlined form. I’ll be away from my desk until Monday. Taking a little break to recharge, otherwise my brain is going to go 'boom' :). All fixes and updates will resume on Monday. I’m not very active on Twitter (X) yet—mostly just reading—but you can find me here: X: https://x.com/AleksandrBarzov Instagram: https://www.instagram.com/sander_odessa/ Facebook: https://www.facebook.com/sander.odessa/ I probably need to change my approach to social media, since the patcher and everything I’m doing is primarily aimed at the English-speaking community. Thanks again for the feedback, and have a great weekend! |
Beta Was this translation helpful? Give feedback.
-
|
I'll write you as soon as I'm done.. This OOM issue really bothered me. To be honest, I wanted to close the subject since it works on 2 GPUs and I don't see any problems on my end.. But then I thought.. Wait.. If I solve the problem on a single card and optimize memory usage so that Qwen3.6 27B and other models of that size can run smoothly on a single 24GB GPU with a decent context window.. It will yield the same optimization and savings effect on 2 GPUs, as well as on cards with 32-48-72-96-142 GB of memory, and so on.. So, for now, I've rolled back the repo patcher to before I started experimenting, because the AI that handles my pushes decided to just dump everything in there.. And right now, I'm reworking the entire memory management stack. For now, I'm tailoring it to the versions and the type of caching I currently use, and after that, I'll expand it based on what actually works out.. After that, it will strictly be fixes and tweaks to the current code to reach a truly stable version. And only then will I touch or try anything new. Because dealing with memory and VRAM caching really took up my whole day today.. I'll try not to drag it out. The current version on the dev branch is working.. |
Beta Was this translation helpful? Give feedback.
-
|
The update is delayed, as I am currently doing a comprehensive rebuild of how the patcher handles memory and models. Below is an example of the config for the models. Going forward, the patcher will operate and launch the model using this type of config. Next, as soon as we verify that everything is stable and working as intended, I'll move on to the full integration of gemma 4. Best wishes to everyone. Peace and clear skies.. |
Beta Was this translation helpful? Give feedback.
-
|
The changes to the patcher are mostly ready. There are a few things left to finish, but primarily, I'm waiting for tomorrow's nightly build, as it will include the necessary bug fixes and improvements. Once it's out, I'll test the patcher on it one more time and update the repository. Memory management has been improved; the next step is to experiment with the utilization value, fine-tuning it within the 0.85-0.92 range. The complete list of new features and changes will be available after the release. |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — appreciate the rolling update. The "rework over hotfix" call really is the right one for this — landing memory mgmt as a unified pass that benefits every VRAM class (24/32/48/72/96/142) is a much bigger win than another single-card sidecar. We're aligned on the "waiting for tomorrow's nightly" cadence on our side too — vLLM PR #41745 (Gemma 4 MTP first-party support) merged today at 14:39 UTC at commit The mem-util range note (0.85-0.92) is useful for us — our current TQ3 composes ship 0.92-0.95 (long-text/long-vision/dual-turbo), so when v7.73.x lands we'll re-validate those defaults against your tuned range. If your rework moves the cliff inward (more conservative mem-util as the new safe default in exchange for a stable 24 GB single-card path), that's a win we'd take in a heartbeat — we've been advising single-card users to switch engines ( Quick progress on our side while you've been heads-downTwo threads worth a heads-up: Gemma 4 spec-decode fully shipped on club-3090: TP=2 numbers landed yesterday + today across both spec-decode methods. MTP (your eventual integration target) gives 109 narr / 142 code TPS, AL ~4.0, soak PASS. DFlash (Z-Lab's block-diffusion drafter via vLLM PR #41703, ChatGPT/Codex-rebased onto current main) gives 95 narr / 168 code TPS, AL 5.23, soak PASS at n=7 (Pareto-optimal vs n=8). Different operating regimes — DFlash wins code +18%, MTP wins narrative +15%. Full bench + n-sweep + comparison at discussion #67 (and the comparison table is the data you'd want as a starting point when you do the Gemma 4 integration on the patcher side). When you get to Gemma 4, the AutoRound INT4 + BF16 drafter combo is what holds together — drafter dtype doesn't have to match target. TQ3 compose hygiene cleanup (#82 by @NHClimber87): one of our composes was missing Standing byThe one-line A/B (relax Telegram noted — will reach out there if anything urgent surfaces during validation. Otherwise async on the discussion is fine and matches your work cadence. Peace and clear skies. |
Beta Was this translation helpful? Give feedback.
-
|
Updates and changes have taken longer than expected. I’m reworking a lot of elements to make them more intuitive and efficient. There are a massive number of changes within the project itself and the supporting utilities to ensure a higher quality and more streamlined user experience. |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — appreciate the rolling status, and good name choice ("Sander Core Engine" / The single-command upgrade flow is going to be a real onboarding win — every new club-3090 user we've helped this week has hit some variant of "where do I We'll re-test our composes against the v7.73.x release (whether labeled v7.73 or as an Tuesday/Wednesday lands clean — no rush, the rework warrants the time. Will mention on disc #67 (the Gemma 4 thread we just updated) when v7.73.x / sndr_core lands so the audience there knows where to look for the next pin bump. 👋 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
|
To run the new version, the system must have 16+2GB of free RAM! |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — the PN95 tier-aware-cache shape is exactly the right primitive set. Keeping Mamba SSM state on GPU ( Two things from our side that might be worth threading into the 1. vllm#41434 landed 2026-05-08 — eliminates several GPU↔CPU syncs in attention impls. Measuring ~15% vanilla-path TPS lift on Qwen3-Next between pre-#41434 ( 2. Cross-model head-to-head this week — Discussion #119: Gemma 4 31B vs Qwen 3.6 27B on dual 3090 across 6 configs (INT8 PTH / bf16 / TQ3 patch-only / TQ3+MTP Genesis-backed), same vLLM pin + rebench-full harness for legs 1-5, Genesis leg pinned at your allowlist No pressure on the rename/refactor timing — the framework rework is clearly the right architectural move. Happy to fold a |
Beta Was this translation helpful? Give feedback.
-
|
As soon as I can release the new version, everything will become much simpler... and better. |
Beta Was this translation helpful? Give feedback.
-
|
It all started as a quick, makeshift project for myself—just tweaking things here and there, and suddenly it was published. |
Beta Was this translation helpful? Give feedback.
-
|
@Sandermage — 50-60% on a project whose scope has grown like this is genuinely impressive. The pivot from "collection of patches" to "deployment platform with config profiles + a review pipeline for outside contributors" is the architectural shift that separates a hobby fork from infrastructure. Most community patch work caps out at the patch surface; you're past that. Rooting for you. When sndr_core is ready for cross-rig testing, we have the Two upstream-vLLM items we've been tracking — no action needed, just so they're on your radar for the next pin-allowlist refresh:
Ship it when it's right. 🍻 |
Beta Was this translation helpful? Give feedback.
-
|
Hi everyone, |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.



Uh oh!
There was an error while loading. Please reload this page.
-
Just shipped v7.64. Tried to address everything that came out of cross-rig
work over the last couple weeks, especially the cliffs you have been hitting
on the 3090s.
What is in
Bug fixes that close existing issues:
falls through to the broken upstream path under TQ k8v4 + FULL_AND_PIECEWISE
cudagraph. Tool-call went 0/5 → 7/7 on the 2× A5000 validation. Closes the
GQA-pow-2 compile error class.
silently skipping. PN17 frees 50-100 MiB on long-context FA2 (resolves
Cliff 1 mech A from your diagnosis), PN19 frees 200-500 MiB during model load.
wall TPS) but regress 35B FP8 (−4%). 27B default carries them, 35B default
does not. Documented per-model so nobody auto-enables across configs.
New launch script variants:
6 new docs files:
docs/GLOSSARY.md(terms),docs/HARDWARE.md(VRAMbudget + GPU class),
docs/FAQ.md,docs/CONFIGS.md(add-your-own-modelwalkthrough),
docs/CLIFFS.md(8 cliffs catalogued),CONTRIBUTING.md.Repo structure cleanup — tried to make navigation obvious so future-me
does not get lost. Doc map in README, per-launch scripts named by KV dtype +
workload.
Asking for
our A5000 numbers (95-100 TPS @ 256-512t). If it does not on 3090, that is
interesting and I want to know.
CONFIGS.md. Was the walkthrough enoughto add your own model? What was missing?
PR #40898 (DFlash SWA, +25% acceptance length), PR #39419 (local argmax TP,
+9-30% on TP=2), PR #41306 mitigation (
--moe-backend=triton). Which ofthese would matter most for your workload?
If something looks off — please tell me. Tests on my side show no problem,
but I am open to being wrong if you have a counter-example.
Cheers and thanks for keeping this thing honest.
Beta Was this translation helpful? Give feedback.
All reactions