Skip to content

Add --stream-layers for streaming weights from CPU during generation#1576

Open
fszontagh wants to merge 3 commits into
leejet:masterfrom
fszontagh:feature/unified-streaming
Open

Add --stream-layers for streaming weights from CPU during generation#1576
fszontagh wants to merge 3 commits into
leejet:masterfrom
fszontagh:feature/unified-streaming

Conversation

@fszontagh
Copy link
Copy Markdown
Contributor

@fszontagh fszontagh commented May 29, 2026

Summary

This is the successor of #1477. That earlier PR did the same thing (stream model weights from CPU to GPU so larger models fit), but it ran as a parallel system alongside the existing graph-cut planner (#1476) and exposed many user-facing flags. Both points came up in the review.

This PR rebuilds the feature on top of the graph-cut planner instead of running alongside it. There is one new boolean flag, --stream-layers. When it is off, behavior is byte-identical to upstream master.

The change is split into two commits:

  1. Foundation: adds the --stream-layers flag, a SegmentResidency annotation pass on the existing planner, and a small layer registry used by the runner. No behavior change when the flag is off.
  2. Operational: chunk-K residency (keep the first K base segments on GPU across sampling steps), multi-runner safety (per-call free-VRAM clamp, restrict streaming to the diffusion runner, release residency before VAE decode), and runtime-LoRA correctness fallbacks (skip chunk-K when a weight_adapter is attached, prefer the runtime LoRA mode when streaming is on because immediate mode OOMs).

Usage

Streaming kicks in when both --max-vram (or its auto sentinel -1) and --stream-layers are set. Weights need a place to stream from, so --offload-to-cpu is implicitly enabled if you forget it (with a log line).

./sd-cli \
  --diffusion-model models/z_image_turbo_bf16.safetensors \
  --llm models/Qwen3-4b-Z-Engineer-V2.gguf \
  --vae models/ae.safetensors \
  -p "a cat" --cfg-scale 1.0 -H 1024 -W 1024 --steps 8 \
  --max-vram -1 --offload-to-cpu --stream-layers \
  -o out.png

--max-vram -1 auto-detects free VRAM and reserves 1 GiB headroom. Pass a positive value (e.g. --max-vram 9) to set the budget explicitly.

example output of the command above

Tested

  • 7 architecture smoke matrix at --max-vram 4 --offload-to-cpu --stream-layers: Z-Image Q8 and bf16, HiDream, Flux schnell, SD3.5 large, SDXL, WAN 2.2 5B. All generated valid output, no cudaMalloc failures.
  • Z-Image Q8 with --stream-layers off: byte-identical PNG to upstream walker.
  • Multi-LoRA correctness check: Z-Image bf16 1024x688, batch_count=2, runtime LoRA mode with two LoRAs stacked, --cfg-scale 1.0 --guidance 3.5 --flow-shift 3.0. Both batch images come out correct.

Performance

The async-prefetch path that was in earlier iterations of this branch is intentionally disabled in this PR because of correctness regressions it caused with runtime LoRA and with batch_count > 1 combined with non-default --guidance / --flow-shift. The implementation stays in the source for the follow-up to build on, but the engagement gate is false.

chunk-K residency is still active and saves H2D for the resident segments across all sampling steps. The wallclock benefit varies by model and --max-vram budget.

Future work

The headline perf win, keeping the GPU near 100 percent utilization during streaming, needs PR #1477's chunk_graph.hpp helper ported on top of this foundation. That caches a fused cgraph for K base layers so the host does not need to issue per-layer kernel launches between H2D copies. It is the planned next PR, written specifically against this branch.

Other items I have queued for follow-ups:

  • Per-module flag form: --stream-layers diffusion,llm,vae, once cross-runner residency eviction is in place. The current PR scopes streaming to the diffusion runner only.
  • SDCPP_STREAM_PROFILE env-var-gated per-stage timing breakdown, useful for tuning.

Checklist

fszontagh added 2 commits May 29, 2026 02:13
Surface area for the unified-streaming design. No behaviour change
when --stream-layers is unset — the dispatch in compute<T>() (added in
the follow-up commit on src/ggml_extend.hpp) short-circuits to the
upstream walker.

- `sd_ctx_params_t::stream_layers` field (include/stable-diffusion.h).
- `--stream-layers` boolean CLI flag (examples/common/common.{h,cpp}).
- `sd::ggml_graph_cut::SegmentResidency` enum + Segment::residency
  field + annotate_residency declaration (src/ggml_graph_cut.h).
- `sd::layer_registry::Registry` with register/move primitives using
  the proven dup-copy-swap idiom on tensor->buffer/data/extra
  (new src/layer_registry.{h,cpp}).
- Conditioner subclasses gain a virtual set_stream_layers_enabled
  (src/conditioner.hpp). UpscalerGGML forwards to its inner runner
  (src/upscaler.{h,cpp}).
Builds on the foundation commit (--stream-layers + planner annotation
+ executor scaffolding) with:

## chunk-K residency

A parallel `resident_*` offload track on `GGMLRunner` keeps a fraction
of the diffusion model's params on GPU permanently across sampling
steps, amortising H2D over many invocations.

- Members: `resident_offload_ctx`, `resident_offload_pairs`,
  `resident_runtime_params_buffer`, `resident_param_set`,
  `resident_state_token` (parallel to the existing `partial_offload_*`
  per-segment track).
- `offload_resident_params(tensors)` / `restore_resident_params()` use
  the same dup-copy-swap idiom as `offload_partial_params` but write
  to the resident slot and persist across `compute()` calls.
- `offload_partial_params` filters tensors already in
  `resident_param_set` so per-segment offload skips them.
  `restore_resident_params` is hooked into `~GGMLRunner()` and
  `free_params_buffer()` to keep swap pointers valid through teardown.
- `compute_streaming_segments<T>` reads
  `graph_cut_plan_cache_.graph_cut_plan` (the unmerged base plan),
  annotates it, gathers the union of RESIDENT segments' param tensors,
  and offloads them once. Compute itself proceeds on the merged plan
  for fused-graph efficiency. A commutative pointer-hash state-token
  detects when a different plan is in play and rebuilds the resident
  set.

`annotate_residency` updates:

- "Any param-bearing segment exists" sanity replaces the
  `segments[0].input_param_bytes == 0` early-return (wrong for
  diffusion models whose first segment is a small prelude).
- Greedy cumulative-bytes loop handles heterogeneous segment sizes
  (small prelude + large transformer layers).
- Resets `seg.residency = STREAMED` at entry so cached plans don't
  carry forward stale RESIDENT marks from a previous larger-budget
  call.
- Don't reserve a `prefetch_segments * largest_segment` window;
  async prefetch is no longer used (see below).

## Multi-runner safety

- Per-runner free-VRAM clamp at compute time in
  `resolve_graph_cut_plan`. Each runner queries
  `ggml_backend_dev_memory(runtime_backend)` and clamps
  `effective_budget = min(max_vram, free - 512 MB)` per call.
  Without this, after the LLM committed ~7 GB chunk-K resident the
  diffusion runner still believed it had the whole budget and OOM'd.
- `--stream-layers` is restricted to diffusion runners only
  (diffusion_model + high_noise_diffusion_model) — matches PR leejet#1477's
  scope and avoids one-shot runners (LLM, VAE, clip_vision, upscaler)
  claiming permanent chunk-K state that starves the diffusion model.
- `GGMLRunner::release_streaming_residency()` (public trampoline to
  `restore_resident_params`) is called from `decode_first_stage()` on
  diffusion_model + high_noise_diffusion_model right before VAE
  decode. Without it the 6.5 GB chunk-K residency from sampling would
  starve VAE's compute buffer (~4.5 GB at full image resolution) and
  OOM at decode.

## `LORA_APPLY_AUTO` picks runtime when streaming or CPU-offload is on

Immediate mode bakes LoRA into weights at load time by running a
forward pass over every weight tensor — allocates a full-model-size
(~11 GB on Z-Image bf16) compute buffer on the runtime backend in
one shot and OOMs on any VRAM-constrained setup, which is the whole
reason `--stream-layers` / `--offload-to-cpu` exist. AUTO previously
only picked runtime for quantized models; now `stream_layers ||
offload_params_to_cpu` is also a trigger.

## Conservative streaming v4 (current shipping configuration)

After observing edge-case failures with prefetch + multi-LoRA / +
non-default --guidance/--flow-shift / + batch_count > 1, the final
configuration is:

- chunk-K residency skips itself when `weight_adapter != nullptr`.
  The state-token hashes tensor *pointers*, not data, so it can't
  detect MultiLoraAdapter modifications across batch images / steps;
  the symptom was colored static noise on batch image 2+.
- Async prefetch is hard-disabled (`prefetch_enabled = false`). The
  `compute_streaming_segments_prefetch<T>` implementation stays in
  the file for future reference but is currently dead. Two correctness
  problems forced the disable:
    * Multi-LoRA workloads: graph_compute_async + per-segment pending
      offload races MultiLoraAdapter's per-layer patch_weight reads.
    * batch_count > 1 + non-default --guidance + --flow-shift: the
      smaller merged segments required to fit two prefetched buffers
      in --max-vram accumulate FP error across the extra boundary-
      cache roundtrips → collapses to pure white frames.
- `resolve_graph_cut_plan` always passes the full `effective_budget`
  to the planner (no `/4` shrinking). Produces the upstream walker's
  large merged segments — the validated configuration.
- The chunk-K hook reserves room for the LARGEST merged segment's
  params: `chunk_k_budget = max_graph_vram_bytes -
  largest_merged_segment`. Without this reservation chunk-K could
  grow large enough that the active merged-segment offload OOMs.

## Verification

- Z-Image Q8 with `--stream-layers off`: byte-identical to upstream
  walker.
- Z-Image bf16 1024x688 + 2 LoRAs + `--cfg-scale 1.0 --guidance 3.5
  --flow-shift 3.0` + batch_count=2 (the production REST API failure
  mode that motivated the correctness work): both batch images clean
  and distinct.
- Smoke-matrix-verified across Z-Image Q8/bf16, HiDream, Qwen,
  Flux schnell, SD3.5, SDXL, Anima, WAN.
@fszontagh fszontagh force-pushed the feature/unified-streaming branch from 2575d97 to 5cd49ca Compare May 29, 2026 12:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant