Feat: Temporal tile custom size with overlap#1510
Merged
Merged
Conversation
leejet
reviewed
May 18, 2026
|
This is a massive improvement, especially for those on Vulkan who are vae-size limited! |
Owner
|
Thank you for your contribution. |
Contributor
Author
|
Thank you for completing it! I was planning to but got lost in other side-quests before. |
fszontagh
added a commit
to fszontagh/stable-diffusion.cpp
that referenced
this pull request
May 22, 2026
25 new upstream commits since the previous sync. Highlights: 3a8788c refactor: unify extra argument parsing (leejet#1540) 449165c feat: stream LTX VAE temporal tile decoding (leejet#1539) adaa599 Feat: Temporal tile custom size with overlap (leejet#1510) 2e35146 perf: run LTX audio VAE decode in one ggml graph (leejet#1538) 47d8198 feat: add taeltx2_3_wide support (leejet#1535) ef92a00 feat: add graph cut markers for LTXAV transformer (leejet#1534) b3374e6 feat: add LTX spatial latent upscale hires support (leejet#1533) bdd937f feat: add taeltx2/taeltx2.3 support (leejet#1531) c51ec7c fix: always load runtime lora params on runtime backend (leejet#1532) e7eb92f feat: add Gradient Estimation sampler (leejet#1484) 50134e5 refactor: split guidance composition (leejet#1506) e43b24c feat: add ltx2.3 flf2v support (leejet#1505) b706d68 fix: restore singleton dims for LLM outputs (leejet#1518) b758b7d fix: only enable TAE after successful load (leejet#1517) f683c88 feat: make negative max_vram control the amount of spare vram (leejet#1503) baf7eda refactor: minify vocab files (leejet#1509) 22c8c40 sync: update ggml (leejet#1520) plus 8 CI / docs / docker fixes. Conflict resolution: src/stable-diffusion.cpp had a single conflict in the video-generation post-sampling block. Our HEAD had the smart-offload-for-VAE-decode hook (move diffusion model to CPU when free_params_immediately is false and VRAM is tight). Upstream added the LTX spatial latent upscale hires path that runs a second sampler invocation. Both pieces are needed and they're complementary: smart offload is video-agnostic and runs only on the non-upscale code path; the upscale block manages its own params lifecycle through its own sampler+free invocation. Resolution: upstream's `if (latent_upscale_enabled)` block kept as-is, and our smart-offload + free_params_immediately handling moved into the matching `else` branch. No semantic change to either feature. All other touched files (include/stable-diffusion.h, src/llm.hpp, src/ggml_extend.hpp, src/diffusion_model.hpp, examples/common/...) auto-merged cleanly. Our additions (friend declaration in ggml_extend for the streaming executor, forward_layer_block / forward_final_norm helpers on LLM::TextModel, offload_config field on sd_ctx_params_t) all interoperate with the upstream changes — Build is clean. Smoke test: Z-Image-Turbo Q8 generates a valid cat image at 512x512 after the merge. Host CUDA driver currently shows NVML version mismatch (220s wallclock); requires driver reload to re-validate expected timings.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR improves the temporal tiled decoding for the LTX2.3 Video VAE by adding support for custom tile sizes (processing multiple latent frames per batch) and temporal overlap/padding to ensure smoother transitions between "tiles" as the VAE decoder has access to the near future.
For now it ses env variables
VAE_TILE_FRAMESandVAE_TILE_PADto control the effect. In my experience, settingVAE_TILE_FRAMES=4andVAE_TILE_PAD=1seems to give very decent results (see comment).Related Issue / Discussion
#1490
Additional Information
How it works:
Currently, when temporal tiling is enabled, it switches from decoding all latent frames at once to decoding them sequentially one-by-one. However, this causes choppy transitions every 8 frames (corresponding to 1 latent frame).
To fix this, instead of processing one frame at a time, latent frames are now decoded in small, overlapping batches (tiles). Each batch has access to the cached context of all the previous ones (like it is already the case when decoding frames sequentially). To ensure seamless transitions, the final frame(s) of a batch act as an overlap; they are discarded from the current output and re-processed at the start of the next batch.
Quick comparison
output.-.full.webm
output.-.tiling.webm
output.-.t4p1.webm
Checklist