-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Introduce cache-dit to community optimization #12366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution!
I think we can probably slim this down a bit and allow users to refer to your repo for all the finer details. Here, we can just focus on the most important and practical stuff :)
@stevhliu All suggestions have been committed. PTAL ~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, just a few more suggestions and then we can merge! :)
Updated the wording for clarity and consistency in the documentation. Adjusted sections on cache acceleration, automatic block adapter, patch functor, and hybrid cache configuration.
@stevhliu All suggestions have been committed. PTAL ~ |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Reopen: #12351
Hi~, I'm the maintainer of cache-dit. I'd like to introduce cache-dit: A Unified, Flexible and Training-free Cache Acceleration Framework for 🤗Diffusers. 🎉Now, cache-dit covers almost All Diffusers' DiT Pipelines🎉. I think this should be the first cache acceleration system in the community that fully supports 🤗 Diffusers.
@sayakpaul @stevhliu
CacheDiT
cache-dit is a Unified, Flexible, and Training-free cache acceleration framework designed for 🤗 Diffusers, enabling cache acceleration with just one line of code. It encompasses a range of key features including Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, DBCache, TaylorSeer Calibrator, and Cache CFG.
Notably, cache-dit now supports nearly all of Diffusers' DiT-based pipelines, such as Qwen-Image, FLUX.1, Qwen-Image-Lightning, Wan 2.1/2.2, HunyuanImage-2.1, HunyuanVideo, HunyuanDiT, HiDream, AuraFlow, CogView3Plus, CogView4, LTXVideo, CogVideoX/X 1.5, ConsisID, Cosmos, SkyReelsV2, VisualCloze, OmniGen 1/2, Lumina 1/2, PixArt, Chroma, Sana, Allegro, Mochi, SD 3/3.5, Amused, and DiT-XL, with relevant benchmarks available for Text2Image DrawBench and Text2Image Distillation DrawBench.
For more information, please refer to the following details.
A Unified, Flexible and Training-free Cache Acceleration Framework for 🤗Diffusers
♥️ Cache Acceleration with One-line Code ~ ♥️
📚Unified Cache APIs | 📚Forward Pattern Matching | 📚Automatic Block Adapter
📚Hybrid Forward Pattern | 📚DBCache | 📚TaylorSeer Calibrator | 📚Cache CFG
📚Text2Image DrawBench | 📚Text2Image Distillation DrawBench
🎉Now, cache-dit covers almost All Diffusers' DiT Pipelines🎉
🔥Qwen-Image | FLUX.1 | Qwen-Image-Lightning | Wan 2.1 | Wan 2.2 🔥
🔥HunyuanImage-2.1 | HunyuanVideo | HunyuanDiT | HiDream | AuraFlow🔥
🔥CogView3Plus | CogView4 | LTXVideo | CogVideoX | CogVideoX 1.5 | ConsisID🔥
🔥Cosmos | SkyReelsV2 | VisualCloze | OmniGen 1/2 | Lumina 1/2 | PixArt🔥
🔥Chroma | Sana | Allegro | Mochi | SD 3/3.5 | Amused | ... | DiT-XL🔥
🔥Wan2.2 MoE | +cache-dit:2.0x↑🎉 | HunyuanVideo | +cache-dit:2.1x↑🎉
🔥Qwen-Image | +cache-dit:1.8x↑🎉 | FLUX.1-dev | +cache-dit:2.1x↑🎉
🔥FLUX-Kontext-dev | Baseline | +cache-dit:1.3x↑🎉 | 1.7x↑🎉 | 2.0x↑ 🎉
🔥Qwen...Lightning | +cache-dit:1.14x↑🎉 | HunyuanImage | +cache-dit:1.7x↑🎉
🔥Qwen-Image-Edit | Input w/o Edit | Baseline | +cache-dit:1.6x↑🎉 | 1.9x↑🎉
🔥HiDream-I1 | +cache-dit:1.9x↑🎉 | CogView4 | +cache-dit:1.4x↑🎉 | 1.7x↑🎉
🔥CogView3 | +cache-dit:1.5x↑🎉 | 2.0x↑🎉| Chroma1-HD | +cache-dit:1.9x↑🎉
🔥Mochi-1-preview | +cache-dit:1.8x↑🎉 | SkyReelsV2 | +cache-dit:1.6x↑🎉
🔥VisualCloze-512 | Model | Cloth | Baseline | +cache-dit:1.4x↑🎉 | 1.7x↑🎉
🔥LTX-Video-0.9.7 | +cache-dit:1.7x↑🎉 | CogVideoX1.5 | +cache-dit:2.0x↑🎉
🔥OmniGen-v1 | +cache-dit:1.5x↑🎉 | 3.3x↑🎉 | Lumina2 | +cache-dit:1.9x↑🎉
🔥Allegro | +cache-dit:1.36x↑🎉 | AuraFlow-v0.3 | +cache-dit:2.27x↑🎉
🔥Sana | +cache-dit:1.3x↑🎉 | 1.6x↑🎉| PixArt-Sigma | +cache-dit:2.3x↑🎉
🔥PixArt-Alpha | +cache-dit:1.6x↑🎉 | 1.8x↑🎉| SD 3.5 | +cache-dit:2.5x↑🎉
🔥Asumed | +cache-dit:1.1x↑🎉 | 1.2x↑🎉 | DiT-XL-256 | +cache-dit:1.8x↑🎉
♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️
🔥News
Previous News
📖Contents
⚙️Installation
You can install the stable release of
cache-dit
from PyPI:Or you can install the latest develop version from GitHub:
🔥Supported Pipelines
Currently, cache-dit library supports almost Any Diffusion Transformers (with Transformer Blocks that match the specific Input and Output patterns). Please check 🎉Examples for more details. Here are just some of the tested models listed.
Show all pipelines
🔥Benchmarks
cache-dit will support more mainstream Cache acceleration algorithms in the future. More benchmarks will be released, please stay tuned for update. Here, only the results of some precision and performance benchmarks are presented. The test dataset is DrawBench. For a complete benchmark, please refer to 📚Benchmarks.
📚Text2Image DrawBench: FLUX.1-dev
Comparisons between different FnBn compute block configurations show that more compute blocks result in higher precision. For example, the F8B0_W8MC0 configuration achieves the best Clip Score (33.007) and ImageReward (1.0333). Device: NVIDIA L20. F: Fn_compute_blocks, B: Bn_compute_blocks, 50 steps.
The comparison between cache-dit: DBCache and algorithms such as Δ-DiT, Chipmunk, FORA, DuCa, TaylorSeer and FoCa is as follows. Now, in the comparison with a speedup ratio less than 3x, cache-dit achieved the best accuracy. Please check 📚How to Reproduce? for more details.
Show all comparison
NOTE: Except for DBCache, other performance data are referenced from the paper FoCa, arxiv.2508.16211.
📚Text2Image Distillation DrawBench: Qwen-Image-Lightning
Surprisingly, cache-dit: DBCache still works in the extremely few-step distill model. For example, Qwen-Image-Lightning w/ 4 steps, with the F16B16 configuration, the PSNR is 34.8163, the Clip Score is 35.6109, and the ImageReward is 1.2614. It maintained a relatively high precision.
🎉Unified Cache APIs
📚Forward Pattern Matching
Currently, for any Diffusion models with Transformer Blocks that match the specific Input/Output patterns, we can use the Unified Cache APIs from cache-dit, namely, the
cache_dit.enable_cache(...)
API. The Unified Cache APIs are currently in the experimental phase; please stay tuned for updates. The supported patterns are listed as follows:In most cases, you only need to call one-line of code, that is
cache_dit.enable_cache(...)
. After this API is called, you just need to call the pipe as normal. Thepipe
param can be any Diffusion Pipeline. Please refer to Qwen-Image as an example.🔥Automatic Block Adapter
But in some cases, you may have a modified Diffusion Pipeline or Transformer that is not located in the diffusers library or not officially supported by cache-dit at this time. The BlockAdapter can help you solve this problems. Please refer to 🔥Qwen-Image w/ BlockAdapter as an example.
For such situations, BlockAdapter can help you quickly apply various cache acceleration features to your own Diffusion Pipelines and Transformers. Please check the 📚BlockAdapter.md for more details.
📚Hybird Forward Pattern
Sometimes, a Transformer class will contain more than one transformer
blocks
. For example, FLUX.1 (HiDream, Chroma, etc) contains transformer_blocks and single_transformer_blocks (with different forward patterns). The BlockAdapter can also help you solve this problem. Please refer to 📚FLUX.1 as an example.Even sometimes you have more complex cases, such as Wan 2.2 MoE, which has more than one Transformer (namely
transformer
andtransformer_2
) in its structure. Fortunately, cache-dit can also handle this situation very well. Please refer to 📚Wan 2.2 MoE as an example.📚Implement Patch Functor
For any PATTERN not in {0...5}, we introduced the simple abstract concept of Patch Functor. Users can implement a subclass of Patch Functor to convert an unknown Pattern into a known PATTERN, and for some models, users may also need to fuse the operations within the blocks for loop into block forward.
Some Patch functors have already been provided in cache-dit: 📚HiDreamPatchFunctor, 📚ChromaPatchFunctor, etc. After implementing Patch Functor, users need to set the
patch_functor
property of BlockAdapter.🤖Cache Acceleration Stats Summary
After finishing each inference of
pipe(...)
, you can call thecache_dit.summary()
API on pipe to get the details of the Cache Acceleration Stats for the current inference.You can set
details
param asTrue
to show more details of cache stats. (markdown table format) Sometimes, this may help you analyze what values of the residual diff threshold would be better.⚡️DBCache: Dual Block Cache
DBCache: Dual Block Caching for Diffusion Transformers. Different configurations of compute blocks (F8B12, etc.) can be customized in DBCache, enabling a balanced trade-off between performance and precision. Moreover, it can be entirely training-free. Please check DBCache.md docs for more design details.
DBCache, L20x1 , Steps: 28, "A cat holding a sign that says hello world with complex background"
🔥TaylorSeer Calibrator
We have supported the TaylorSeers: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers algorithm to further improve the precision of DBCache in cases where the cached steps are large, namely, Hybrid TaylorSeer + DBCache. At timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, significantly harming the generation quality.
TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. The TaylorSeer implemented in cache-dit supports both hidden states and residual cache types. That is$\mathcal{F}_{\text {pred }, m}\left(x_{t-k}^l\right)$ can be a residual cache or a hidden-state cache.
Important
Please note that if you have used TaylorSeer as the calibrator for approximate hidden states, the Bn param of DBCache can be set to 0. In essence, DBCache's Bn is also act as a calibrator, so you can choose either Bn > 0 or TaylorSeer. We recommend using the configuration scheme of TaylorSeer + DBCache FnB0.
DBCache F1B0 + TaylorSeer, L20x1, Steps: 28,
"A cat holding a sign that says hello world with complex background"
⚡️Hybrid Cache CFG
cache-dit supports caching for CFG (classifier-free guidance). For models that fuse CFG and non-CFG into a single forward step, or models that do not include CFG (classifier-free guidance) in the forward step, please set
enable_separate_cfg
param to False (default, None). Otherwise, set it to True. For examples:⚙️Torch Compile
By the way, cache-dit is designed to work compatibly with torch.compile. You can easily use cache-dit with torch.compile to further achieve a better performance. For example:
However, users intending to use cache-dit for DiT with dynamic input shapes should consider increasing the recompile limit of
torch._dynamo
. Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.Please check perf.py for more details.
🛠Metrics CLI
You can utilize the APIs provided by cache-dit to quickly evaluate the accuracy losses caused by different cache configurations. For example:
Or, you can use
cache-dit-metrics-cli
tool. For examples:👋Contribute
How to contribute? Star ⭐️ this repo to support us or check CONTRIBUTE.md.
©️Acknowledgements
The cache-dit codebase is adapted from FBCache. Over time its codebase diverged a lot, and cache-dit API is no longer compatible with FBCache.
©️Citations