Add `--stream-layers` for streaming weights from CPU during generation by fszontagh · Pull Request #1576 · leejet/stable-diffusion.cpp

fszontagh · 2026-05-29T06:36:24Z

Summary

This is the successor of #1477. That earlier PR did the same thing (stream model weights from CPU to GPU so larger models fit), but it ran as a parallel system alongside the existing graph-cut planner (#1476) and exposed many user-facing flags. Both points came up in the review.

This PR rebuilds the feature on top of the graph-cut planner instead of running alongside it. There is one new boolean flag, --stream-layers. When it is off, behavior is byte-identical to upstream master.

The change is split into two commits:

Foundation: adds the --stream-layers flag, a SegmentResidency annotation pass on the existing planner, and a small layer registry used by the runner. No behavior change when the flag is off.
Operational: chunk-K residency (keep the first K base segments on GPU across sampling steps), multi-runner safety (per-call free-VRAM clamp, restrict streaming to the diffusion runner, release residency before VAE decode), and runtime-LoRA correctness fallbacks (skip chunk-K when a weight_adapter is attached, prefer the runtime LoRA mode when streaming is on because immediate mode OOMs).

Usage

Streaming kicks in when both --max-vram (or its auto sentinel -1) and --stream-layers are set. Weights need a place to stream from, so --offload-to-cpu is implicitly enabled if you forget it (with a log line).

./sd-cli \
  --diffusion-model models/z_image_turbo_bf16.safetensors \
  --llm models/Qwen3-4b-Z-Engineer-V2.gguf \
  --vae models/ae.safetensors \
  -p "a cat" --cfg-scale 1.0 -H 1024 -W 1024 --steps 8 \
  --max-vram -1 --offload-to-cpu --stream-layers \
  -o out.png

--max-vram -1 auto-detects free VRAM and reserves 1 GiB headroom. Pass a positive value (e.g. --max-vram 9) to set the budget explicitly.

Tested

7 architecture smoke matrix at --max-vram 4 --offload-to-cpu --stream-layers: Z-Image Q8 and bf16, HiDream, Flux schnell, SD3.5 large, SDXL, WAN 2.2 5B. All generated valid output, no cudaMalloc failures.
Z-Image Q8 with --stream-layers off: byte-identical PNG to upstream walker.
Multi-LoRA correctness check: Z-Image bf16 1024x688, batch_count=2, runtime LoRA mode with two LoRAs stacked, --cfg-scale 1.0 --guidance 3.5 --flow-shift 3.0. Both batch images come out correct.

Performance

The async-prefetch path that was in earlier iterations of this branch is intentionally disabled in this PR because of correctness regressions it caused with runtime LoRA and with batch_count > 1 combined with non-default --guidance / --flow-shift. The implementation stays in the source for the follow-up to build on, but the engagement gate is false.

chunk-K residency is still active and saves H2D for the resident segments across all sampling steps. The wallclock benefit varies by model and --max-vram budget.

Future work

The headline perf win, keeping the GPU near 100 percent utilization during streaming, needs PR #1477's chunk_graph.hpp helper ported on top of this foundation. That caches a fused cgraph for K base layers so the host does not need to issue per-layer kernel launches between H2D copies. It is the planned next PR, written specifically against this branch.

Other items I have queued for follow-ups:

Per-module flag form: --stream-layers diffusion,llm,vae, once cross-runner residency eviction is in place. The current PR scopes streaming to the diffusion runner only.
SDCPP_STREAM_PROFILE env-var-gated per-stage timing breakdown, useful for tuning.

Checklist

I have read and confirmed this PR follows the contribution guidelines.

Surface area for the unified-streaming design. No behaviour change when --stream-layers is unset — the dispatch in compute<T>() (added in the follow-up commit on src/ggml_extend.hpp) short-circuits to the upstream walker. - `sd_ctx_params_t::stream_layers` field (include/stable-diffusion.h). - `--stream-layers` boolean CLI flag (examples/common/common.{h,cpp}). - `sd::ggml_graph_cut::SegmentResidency` enum + Segment::residency field + annotate_residency declaration (src/ggml_graph_cut.h). - `sd::layer_registry::Registry` with register/move primitives using the proven dup-copy-swap idiom on tensor->buffer/data/extra (new src/layer_registry.{h,cpp}). - Conditioner subclasses gain a virtual set_stream_layers_enabled (src/conditioner.hpp). UpscalerGGML forwards to its inner runner (src/upscaler.{h,cpp}).

Builds on the foundation commit (--stream-layers + planner annotation + executor scaffolding) with: ## chunk-K residency A parallel `resident_*` offload track on `GGMLRunner` keeps a fraction of the diffusion model's params on GPU permanently across sampling steps, amortising H2D over many invocations. - Members: `resident_offload_ctx`, `resident_offload_pairs`, `resident_runtime_params_buffer`, `resident_param_set`, `resident_state_token` (parallel to the existing `partial_offload_*` per-segment track). - `offload_resident_params(tensors)` / `restore_resident_params()` use the same dup-copy-swap idiom as `offload_partial_params` but write to the resident slot and persist across `compute()` calls. - `offload_partial_params` filters tensors already in `resident_param_set` so per-segment offload skips them. `restore_resident_params` is hooked into `~GGMLRunner()` and `free_params_buffer()` to keep swap pointers valid through teardown. - `compute_streaming_segments<T>` reads `graph_cut_plan_cache_.graph_cut_plan` (the unmerged base plan), annotates it, gathers the union of RESIDENT segments' param tensors, and offloads them once. Compute itself proceeds on the merged plan for fused-graph efficiency. A commutative pointer-hash state-token detects when a different plan is in play and rebuilds the resident set. `annotate_residency` updates: - "Any param-bearing segment exists" sanity replaces the `segments[0].input_param_bytes == 0` early-return (wrong for diffusion models whose first segment is a small prelude). - Greedy cumulative-bytes loop handles heterogeneous segment sizes (small prelude + large transformer layers). - Resets `seg.residency = STREAMED` at entry so cached plans don't carry forward stale RESIDENT marks from a previous larger-budget call. - Don't reserve a `prefetch_segments * largest_segment` window; async prefetch is no longer used (see below). ## Multi-runner safety - Per-runner free-VRAM clamp at compute time in `resolve_graph_cut_plan`. Each runner queries `ggml_backend_dev_memory(runtime_backend)` and clamps `effective_budget = min(max_vram, free - 512 MB)` per call. Without this, after the LLM committed ~7 GB chunk-K resident the diffusion runner still believed it had the whole budget and OOM'd. - `--stream-layers` is restricted to diffusion runners only (diffusion_model + high_noise_diffusion_model) — matches PR leejet#1477's scope and avoids one-shot runners (LLM, VAE, clip_vision, upscaler) claiming permanent chunk-K state that starves the diffusion model. - `GGMLRunner::release_streaming_residency()` (public trampoline to `restore_resident_params`) is called from `decode_first_stage()` on diffusion_model + high_noise_diffusion_model right before VAE decode. Without it the 6.5 GB chunk-K residency from sampling would starve VAE's compute buffer (~4.5 GB at full image resolution) and OOM at decode. ## `LORA_APPLY_AUTO` picks runtime when streaming or CPU-offload is on Immediate mode bakes LoRA into weights at load time by running a forward pass over every weight tensor — allocates a full-model-size (~11 GB on Z-Image bf16) compute buffer on the runtime backend in one shot and OOMs on any VRAM-constrained setup, which is the whole reason `--stream-layers` / `--offload-to-cpu` exist. AUTO previously only picked runtime for quantized models; now `stream_layers || offload_params_to_cpu` is also a trigger. ## Conservative streaming v4 (current shipping configuration) After observing edge-case failures with prefetch + multi-LoRA / + non-default --guidance/--flow-shift / + batch_count > 1, the final configuration is: - chunk-K residency skips itself when `weight_adapter != nullptr`. The state-token hashes tensor *pointers*, not data, so it can't detect MultiLoraAdapter modifications across batch images / steps; the symptom was colored static noise on batch image 2+. - Async prefetch is hard-disabled (`prefetch_enabled = false`). The `compute_streaming_segments_prefetch<T>` implementation stays in the file for future reference but is currently dead. Two correctness problems forced the disable: * Multi-LoRA workloads: graph_compute_async + per-segment pending offload races MultiLoraAdapter's per-layer patch_weight reads. * batch_count > 1 + non-default --guidance + --flow-shift: the smaller merged segments required to fit two prefetched buffers in --max-vram accumulate FP error across the extra boundary- cache roundtrips → collapses to pure white frames. - `resolve_graph_cut_plan` always passes the full `effective_budget` to the planner (no `/4` shrinking). Produces the upstream walker's large merged segments — the validated configuration. - The chunk-K hook reserves room for the LARGEST merged segment's params: `chunk_k_budget = max_graph_vram_bytes - largest_merged_segment`. Without this reservation chunk-K could grow large enough that the active merged-segment offload OOMs. ## Verification - Z-Image Q8 with `--stream-layers off`: byte-identical to upstream walker. - Z-Image bf16 1024x688 + 2 LoRAs + `--cfg-scale 1.0 --guidance 3.5 --flow-shift 3.0` + batch_count=2 (the production REST API failure mode that motivated the correctness work): both batch images clean and distinct. - Smoke-matrix-verified across Z-Image Q8/bf16, HiDream, Qwen, Flux schnell, SD3.5, SDXL, Anima, WAN.

…rrectly

fszontagh added 2 commits May 29, 2026 02:13

fszontagh mentioned this pull request May 29, 2026

feat: cross-stage offload modes and layer-streaming for low-VRAM GPUs #1477

Closed

Snapshot+restore persistent EXTERNAL inputs so runtime LoRA scales co…

5cd49ca

…rrectly

fszontagh force-pushed the feature/unified-streaming branch from 2575d97 to 5cd49ca Compare May 29, 2026 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `--stream-layers` for streaming weights from CPU during generation#1576

Add `--stream-layers` for streaming weights from CPU during generation#1576
fszontagh wants to merge 3 commits into
leejet:masterfrom
fszontagh:feature/unified-streaming

fszontagh commented May 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fszontagh commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Tested

Performance

Future work

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fszontagh commented May 29, 2026 •

edited

Loading