Add --stream-layers for streaming weights from CPU during generation#1576
Open
fszontagh wants to merge 3 commits into
Open
Add --stream-layers for streaming weights from CPU during generation#1576fszontagh wants to merge 3 commits into
--stream-layers for streaming weights from CPU during generation#1576fszontagh wants to merge 3 commits into
Conversation
Surface area for the unified-streaming design. No behaviour change
when --stream-layers is unset — the dispatch in compute<T>() (added in
the follow-up commit on src/ggml_extend.hpp) short-circuits to the
upstream walker.
- `sd_ctx_params_t::stream_layers` field (include/stable-diffusion.h).
- `--stream-layers` boolean CLI flag (examples/common/common.{h,cpp}).
- `sd::ggml_graph_cut::SegmentResidency` enum + Segment::residency
field + annotate_residency declaration (src/ggml_graph_cut.h).
- `sd::layer_registry::Registry` with register/move primitives using
the proven dup-copy-swap idiom on tensor->buffer/data/extra
(new src/layer_registry.{h,cpp}).
- Conditioner subclasses gain a virtual set_stream_layers_enabled
(src/conditioner.hpp). UpscalerGGML forwards to its inner runner
(src/upscaler.{h,cpp}).
Builds on the foundation commit (--stream-layers + planner annotation + executor scaffolding) with: ## chunk-K residency A parallel `resident_*` offload track on `GGMLRunner` keeps a fraction of the diffusion model's params on GPU permanently across sampling steps, amortising H2D over many invocations. - Members: `resident_offload_ctx`, `resident_offload_pairs`, `resident_runtime_params_buffer`, `resident_param_set`, `resident_state_token` (parallel to the existing `partial_offload_*` per-segment track). - `offload_resident_params(tensors)` / `restore_resident_params()` use the same dup-copy-swap idiom as `offload_partial_params` but write to the resident slot and persist across `compute()` calls. - `offload_partial_params` filters tensors already in `resident_param_set` so per-segment offload skips them. `restore_resident_params` is hooked into `~GGMLRunner()` and `free_params_buffer()` to keep swap pointers valid through teardown. - `compute_streaming_segments<T>` reads `graph_cut_plan_cache_.graph_cut_plan` (the unmerged base plan), annotates it, gathers the union of RESIDENT segments' param tensors, and offloads them once. Compute itself proceeds on the merged plan for fused-graph efficiency. A commutative pointer-hash state-token detects when a different plan is in play and rebuilds the resident set. `annotate_residency` updates: - "Any param-bearing segment exists" sanity replaces the `segments[0].input_param_bytes == 0` early-return (wrong for diffusion models whose first segment is a small prelude). - Greedy cumulative-bytes loop handles heterogeneous segment sizes (small prelude + large transformer layers). - Resets `seg.residency = STREAMED` at entry so cached plans don't carry forward stale RESIDENT marks from a previous larger-budget call. - Don't reserve a `prefetch_segments * largest_segment` window; async prefetch is no longer used (see below). ## Multi-runner safety - Per-runner free-VRAM clamp at compute time in `resolve_graph_cut_plan`. Each runner queries `ggml_backend_dev_memory(runtime_backend)` and clamps `effective_budget = min(max_vram, free - 512 MB)` per call. Without this, after the LLM committed ~7 GB chunk-K resident the diffusion runner still believed it had the whole budget and OOM'd. - `--stream-layers` is restricted to diffusion runners only (diffusion_model + high_noise_diffusion_model) — matches PR leejet#1477's scope and avoids one-shot runners (LLM, VAE, clip_vision, upscaler) claiming permanent chunk-K state that starves the diffusion model. - `GGMLRunner::release_streaming_residency()` (public trampoline to `restore_resident_params`) is called from `decode_first_stage()` on diffusion_model + high_noise_diffusion_model right before VAE decode. Without it the 6.5 GB chunk-K residency from sampling would starve VAE's compute buffer (~4.5 GB at full image resolution) and OOM at decode. ## `LORA_APPLY_AUTO` picks runtime when streaming or CPU-offload is on Immediate mode bakes LoRA into weights at load time by running a forward pass over every weight tensor — allocates a full-model-size (~11 GB on Z-Image bf16) compute buffer on the runtime backend in one shot and OOMs on any VRAM-constrained setup, which is the whole reason `--stream-layers` / `--offload-to-cpu` exist. AUTO previously only picked runtime for quantized models; now `stream_layers || offload_params_to_cpu` is also a trigger. ## Conservative streaming v4 (current shipping configuration) After observing edge-case failures with prefetch + multi-LoRA / + non-default --guidance/--flow-shift / + batch_count > 1, the final configuration is: - chunk-K residency skips itself when `weight_adapter != nullptr`. The state-token hashes tensor *pointers*, not data, so it can't detect MultiLoraAdapter modifications across batch images / steps; the symptom was colored static noise on batch image 2+. - Async prefetch is hard-disabled (`prefetch_enabled = false`). The `compute_streaming_segments_prefetch<T>` implementation stays in the file for future reference but is currently dead. Two correctness problems forced the disable: * Multi-LoRA workloads: graph_compute_async + per-segment pending offload races MultiLoraAdapter's per-layer patch_weight reads. * batch_count > 1 + non-default --guidance + --flow-shift: the smaller merged segments required to fit two prefetched buffers in --max-vram accumulate FP error across the extra boundary- cache roundtrips → collapses to pure white frames. - `resolve_graph_cut_plan` always passes the full `effective_budget` to the planner (no `/4` shrinking). Produces the upstream walker's large merged segments — the validated configuration. - The chunk-K hook reserves room for the LARGEST merged segment's params: `chunk_k_budget = max_graph_vram_bytes - largest_merged_segment`. Without this reservation chunk-K could grow large enough that the active merged-segment offload OOMs. ## Verification - Z-Image Q8 with `--stream-layers off`: byte-identical to upstream walker. - Z-Image bf16 1024x688 + 2 LoRAs + `--cfg-scale 1.0 --guidance 3.5 --flow-shift 3.0` + batch_count=2 (the production REST API failure mode that motivated the correctness work): both batch images clean and distinct. - Smoke-matrix-verified across Z-Image Q8/bf16, HiDream, Qwen, Flux schnell, SD3.5, SDXL, Anima, WAN.
2575d97 to
5cd49ca
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This is the successor of #1477. That earlier PR did the same thing (stream model weights from CPU to GPU so larger models fit), but it ran as a parallel system alongside the existing graph-cut planner (#1476) and exposed many user-facing flags. Both points came up in the review.
This PR rebuilds the feature on top of the graph-cut planner instead of running alongside it. There is one new boolean flag,
--stream-layers. When it is off, behavior is byte-identical to upstream master.The change is split into two commits:
--stream-layersflag, aSegmentResidencyannotation pass on the existing planner, and a small layer registry used by the runner. No behavior change when the flag is off.weight_adapteris attached, prefer the runtime LoRA mode when streaming is on because immediate mode OOMs).Usage
Streaming kicks in when both
--max-vram(or its auto sentinel-1) and--stream-layersare set. Weights need a place to stream from, so--offload-to-cpuis implicitly enabled if you forget it (with a log line).--max-vram -1auto-detects free VRAM and reserves 1 GiB headroom. Pass a positive value (e.g.--max-vram 9) to set the budget explicitly.Tested
--max-vram 4 --offload-to-cpu --stream-layers: Z-Image Q8 and bf16, HiDream, Flux schnell, SD3.5 large, SDXL, WAN 2.2 5B. All generated valid output, nocudaMallocfailures.--stream-layers off: byte-identical PNG to upstream walker.--cfg-scale 1.0 --guidance 3.5 --flow-shift 3.0. Both batch images come out correct.Performance
The async-prefetch path that was in earlier iterations of this branch is intentionally disabled in this PR because of correctness regressions it caused with runtime LoRA and with
batch_count > 1combined with non-default--guidance/--flow-shift. The implementation stays in the source for the follow-up to build on, but the engagement gate isfalse.chunk-K residency is still active and saves H2D for the resident segments across all sampling steps. The wallclock benefit varies by model and
--max-vrambudget.Future work
The headline perf win, keeping the GPU near 100 percent utilization during streaming, needs PR #1477's
chunk_graph.hpphelper ported on top of this foundation. That caches a fused cgraph for K base layers so the host does not need to issue per-layer kernel launches between H2D copies. It is the planned next PR, written specifically against this branch.Other items I have queued for follow-ups:
--stream-layers diffusion,llm,vae, once cross-runner residency eviction is in place. The current PR scopes streaming to the diffusion runner only.SDCPP_STREAM_PROFILEenv-var-gated per-stage timing breakdown, useful for tuning.Checklist