diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 401a40d645da..e9cea85ffc0b 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -180,8 +180,6 @@ title: Accelerate inference - local: optimization/memory title: Reduce memory usage - - local: optimization/torch2.0 - title: PyTorch 2.0 - local: optimization/xformers title: xFormers - local: optimization/tome diff --git a/docs/source/en/api/pipelines/deepfloyd_if.md b/docs/source/en/api/pipelines/deepfloyd_if.md index 006422281a1f..a00b248d63ce 100644 --- a/docs/source/en/api/pipelines/deepfloyd_if.md +++ b/docs/source/en/api/pipelines/deepfloyd_if.md @@ -347,7 +347,7 @@ pipe.to("cuda") image = pipe(image=image, prompt="", strength=0.3).images ``` -You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile` +You can also use [`torch.compile`](../../optimization/fp16#torchcompile). Note that we have not exhaustively tested `torch.compile` with IF and it might not give expected results. ```py diff --git a/docs/source/en/optimization/fp16.md b/docs/source/en/optimization/fp16.md index 97a1f5830a94..010b721536d7 100644 --- a/docs/source/en/optimization/fp16.md +++ b/docs/source/en/optimization/fp16.md @@ -150,6 +150,24 @@ pipeline(prompt, num_inference_steps=30).images[0] Compilation is slow the first time, but once compiled, it is significantly faster. Try to only use the compiled pipeline on the same type of inference operations. Calling the compiled pipeline on a different image size retriggers compilation which is slow and inefficient. +### Regional compilation + +[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) reduces the cold start compilation time by only compiling a specific repeated region (or block) of the model instead of the entire model. The compiler reuses the cached and compiled code for the other blocks. + +[Accelerate](https://huggingface.co/docs/accelerate/index) provides the [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method for automatically compiling the repeated blocks of a `nn.Module` sequentially. The rest of the model is compiled separately. + +```py +# pip install -U accelerate +import torch +from diffusers import StableDiffusionXLPipeline +from accelerate.utils import compile regions + +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16 +).to("cuda") +pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True) +``` + ### Graph breaks It is important to specify `fullgraph=True` in torch.compile to ensure there are no graph breaks in the underlying model. This allows you to take advantage of torch.compile without any performance degradation. For the UNet and VAE, this changes how you access the return variables. @@ -170,6 +188,12 @@ The `step()` function is [called](https://github.com/huggingface/diffusers/blob/ In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240) to avoid the communication sync and latency. +### Benchmarks + +Refer to the [diffusers/benchmarks](https://huggingface.co/datasets/diffusers/benchmarks) dataset to see inference latency and memory usage data for compiled pipelines. + +The [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao#benchmarking-results) repository also contains benchmarking results for compiled versions of Flux and CogVideoX. + ## Dynamic quantization [Dynamic quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) improves inference speed by reducing precision to enable faster math operations. This particular type of quantization determines how to scale the activations based on the data at runtime rather than using a fixed scaling factor. As a result, the scaling factor is more accurately aligned with the data. diff --git a/docs/source/en/optimization/tome.md b/docs/source/en/optimization/tome.md index 3e574efbfe1b..f379bc97f494 100644 --- a/docs/source/en/optimization/tome.md +++ b/docs/source/en/optimization/tome.md @@ -93,4 +93,4 @@ To reproduce this benchmark, feel free to use this [script](https://gist.github. | | | 2 | OOM | 13 | 10.78 | | | | 1 | OOM | 6.66 | 5.54 | -As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0). +As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](fp16#torchcompile). diff --git a/docs/source/en/optimization/torch2.0.md b/docs/source/en/optimization/torch2.0.md deleted file mode 100644 index cc69eceff3af..000000000000 --- a/docs/source/en/optimization/torch2.0.md +++ /dev/null @@ -1,438 +0,0 @@ - - -# PyTorch 2.0 - -🤗 Diffusers supports the latest optimizations from [PyTorch 2.0](https://pytorch.org/get-started/pytorch-2.0/) which include: - -1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers. -2. [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html), a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled. - -Both of these optimizations require PyTorch 2.0 or later and 🤗 Diffusers > 0.13.0. - -```bash -pip install --upgrade torch diffusers -``` - -## Scaled dot product attention - -[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention) (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type. SDPA is enabled by default if you're using PyTorch 2.0 and the latest version of 🤗 Diffusers, so you don't need to add anything to your code. - -However, if you want to explicitly enable it, you can set a [`DiffusionPipeline`] to use [`~models.attention_processor.AttnProcessor2_0`]: - -```diff - import torch - from diffusers import DiffusionPipeline -+ from diffusers.models.attention_processor import AttnProcessor2_0 - - pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_attn_processor(AttnProcessor2_0()) - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -SDPA should be as fast and memory efficient as `xFormers`; check the [benchmark](#benchmark) for more details. - -In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, [`~models.attention_processor.AttnProcessor`]. To revert to [`~models.attention_processor.AttnProcessor`], call the [`~UNet2DConditionModel.set_default_attn_processor`] function on the pipeline: - -```diff - import torch - from diffusers import DiffusionPipeline - - pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -+ pipe.unet.set_default_attn_processor() - - prompt = "a photo of an astronaut riding a horse on mars" - image = pipe(prompt).images[0] -``` - -## torch.compile - -The `torch.compile` function can often provide an additional speed-up to your PyTorch code. In 🤗 Diffusers, it is usually best to wrap the UNet with `torch.compile` because it does most of the heavy lifting in the pipeline. - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) -images = pipe(prompt, num_inference_steps=steps, num_images_per_prompt=batch_size).images[0] -``` - -Depending on GPU type, `torch.compile` can provide an *additional speed-up* of **5-300x** on top of SDPA! If you're using more recent GPU architectures such as Ampere (A100, 3090), Ada (4090), and Hopper (H100), `torch.compile` is able to squeeze even more performance out of these GPUs. - -Compilation requires some time to complete, so it is best suited for situations where you prepare your pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive. - -For more information and different options about `torch.compile`, refer to the [`torch_compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) tutorial. - -> [!TIP] -> Learn more about other ways PyTorch 2.0 can help optimize your model in the [Accelerate inference of text-to-image diffusion models](../tutorials/fast_diffusion) tutorial. - -### Regional compilation - -Compiling the whole model usually has a big problem space for optimization. Models are often composed of multiple repeated blocks. [Regional compilation](https://pytorch.org/tutorials/recipes/regional_compilation.html) compiles the repeated block first (a transformer encoder block, for example), so that the Torch compiler would re-use its cached/optimized generated code for the other blocks, reducing (often massively) the cold start compilation time observed on the first inference call. - -Enabling regional compilation might require simple yet intrusive changes to the -modeling code. However, 🤗 Accelerate provides a utility [`compile_regions()`](https://huggingface.co/docs/accelerate/main/en/usage_guides/compilation#how-to-use-regional-compilation) which automatically compiles -the repeated blocks of the provided `nn.Module` sequentially, and the rest of the model separately. This helps with reducing cold start time while keeping most (if not all) of the speedup you would get from full compilation. - -```py -# Make sure you're on the latest `accelerate`: `pip install -U accelerate`. -from accelerate.utils import compile_regions - -pipe.unet = compile_regions(pipe.unet, mode="reduce-overhead", fullgraph=True) -``` - -As you may have noticed `compile_regions()` takes the same arguments as `torch.compile()`, allowing flexibility. - -## Benchmark - -We conducted a comprehensive benchmark with PyTorch 2.0's efficient attention implementation and `torch.compile` across different GPUs and batch sizes for five of our most used pipelines. The code is benchmarked on 🤗 Diffusers v0.17.0.dev0 to optimize `torch.compile` usage (see [here](https://github.com/huggingface/diffusers/pull/3313) for more details). - -Expand the dropdown below to find the code used to benchmark each pipeline: - -
- -### Stable Diffusion text-to-image - -```python -from diffusers import DiffusionPipeline -import torch - -path = "stable-diffusion-v1-5/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = DiffusionPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - images = pipe(prompt=prompt).images -``` - -### Stable Diffusion image-to-image - -```python -from diffusers import StableDiffusionImg2ImgPipeline -from diffusers.utils import load_image -import torch - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -init_image = load_image(url) -init_image = init_image.resize((512, 512)) - -path = "stable-diffusion-v1-5/stable-diffusion-v1-5" - -run_compile = True # Set True / False - -pipe = StableDiffusionImg2ImgPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### Stable Diffusion inpainting - -```python -from diffusers import StableDiffusionInpaintPipeline -from diffusers.utils import load_image -import torch - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" - -init_image = load_image(img_url).resize((512, 512)) -mask_image = load_image(mask_url).resize((512, 512)) - -path = "runwayml/stable-diffusion-inpainting" - -run_compile = True # Set True / False - -pipe = StableDiffusionInpaintPipeline.from_pretrained(path, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image, mask_image=mask_image).images[0] -``` - -### ControlNet - -```python -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel -from diffusers.utils import load_image -import torch - -url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -init_image = load_image(url) -init_image = init_image.resize((512, 512)) - -path = "stable-diffusion-v1-5/stable-diffusion-v1-5" - -run_compile = True # Set True / False -controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -) - -pipe = pipe.to("cuda") -pipe.unet.to(memory_format=torch.channels_last) -pipe.controlnet.to(memory_format=torch.channels_last) - -if run_compile: - print("Run torch compile") - pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) - pipe.controlnet = torch.compile(pipe.controlnet, mode="reduce-overhead", fullgraph=True) - -prompt = "ghibli style, a fantasy landscape with castles" - -for _ in range(3): - image = pipe(prompt=prompt, image=init_image).images[0] -``` - -### DeepFloyd IF text-to-image + upscaling - -```python -from diffusers import DiffusionPipeline -import torch - -run_compile = True # Set True / False - -pipe_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_1.to("cuda") -pipe_2 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-II-M-v1.0", variant="fp16", text_encoder=None, torch_dtype=torch.float16, use_safetensors=True) -pipe_2.to("cuda") -pipe_3 = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, use_safetensors=True) -pipe_3.to("cuda") - - -pipe_1.unet.to(memory_format=torch.channels_last) -pipe_2.unet.to(memory_format=torch.channels_last) -pipe_3.unet.to(memory_format=torch.channels_last) - -if run_compile: - pipe_1.unet = torch.compile(pipe_1.unet, mode="reduce-overhead", fullgraph=True) - pipe_2.unet = torch.compile(pipe_2.unet, mode="reduce-overhead", fullgraph=True) - pipe_3.unet = torch.compile(pipe_3.unet, mode="reduce-overhead", fullgraph=True) - -prompt = "the blue hulk" - -prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) -neg_prompt_embeds = torch.randn((1, 2, 4096), dtype=torch.float16) - -for _ in range(3): - image_1 = pipe_1(prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_2 = pipe_2(image=image_1, prompt_embeds=prompt_embeds, negative_prompt_embeds=neg_prompt_embeds, output_type="pt").images - image_3 = pipe_3(prompt=prompt, image=image_1, noise_level=100).images -``` -
- -The graph below highlights the relative speed-ups for the [`StableDiffusionPipeline`] across five GPU families with PyTorch 2.0 and `torch.compile` enabled. The benchmarks for the following graphs are measured in *number of iterations/second*. - -![t2i_speedup](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/t2i_speedup.png) - -To give you an even better idea of how this speed-up holds for the other pipelines, consider the following -graph for an A100 with PyTorch 2.0 and `torch.compile`: - -![a100_numbers](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/pt2_benchmarks/a100_numbers.png) - -In the following tables, we report our findings in terms of the *number of iterations/second*. - -### A100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 21.66 | 23.13 | 44.03 | 49.74 | -| SD - img2img | 21.81 | 22.40 | 43.92 | 46.32 | -| SD - inpaint | 22.24 | 23.23 | 43.76 | 49.25 | -| SD - controlnet | 15.02 | 15.82 | 32.13 | 36.08 | -| IF | 20.21 /
13.84 /
24.00 | 20.12 /
13.70 /
24.03 | ❌ | 97.34 /
27.23 /
111.66 | -| SDXL - txt2img | 8.64 | 9.9 | - | - | - -### A100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 11.6 | 13.12 | 14.62 | 17.27 | -| SD - img2img | 11.47 | 13.06 | 14.66 | 17.25 | -| SD - inpaint | 11.67 | 13.31 | 14.88 | 17.48 | -| SD - controlnet | 8.28 | 9.38 | 10.51 | 12.41 | -| IF | 25.02 | 18.04 | ❌ | 48.47 | -| SDXL - txt2img | 2.44 | 2.74 | - | - | - -### A100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.04 | 3.6 | 3.83 | 4.68 | -| SD - img2img | 2.98 | 3.58 | 3.83 | 4.67 | -| SD - inpaint | 3.04 | 3.66 | 3.9 | 4.76 | -| SD - controlnet | 2.15 | 2.58 | 2.74 | 3.35 | -| IF | 8.78 | 9.82 | ❌ | 16.77 | -| SDXL - txt2img | 0.64 | 0.72 | - | - | - -### V100 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 18.99 | 19.14 | 20.95 | 22.17 | -| SD - img2img | 18.56 | 19.18 | 20.95 | 22.11 | -| SD - inpaint | 19.14 | 19.06 | 21.08 | 22.20 | -| SD - controlnet | 13.48 | 13.93 | 15.18 | 15.88 | -| IF | 20.01 /
9.08 /
23.34 | 19.79 /
8.98 /
24.10 | ❌ | 55.75 /
11.57 /
57.67 | - -### V100 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 5.96 | 5.89 | 6.83 | 6.86 | -| SD - img2img | 5.90 | 5.91 | 6.81 | 6.82 | -| SD - inpaint | 5.99 | 6.03 | 6.93 | 6.95 | -| SD - controlnet | 4.26 | 4.29 | 4.92 | 4.93 | -| IF | 15.41 | 14.76 | ❌ | 22.95 | - -### V100 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.66 | 1.66 | 1.92 | 1.90 | -| SD - img2img | 1.65 | 1.65 | 1.91 | 1.89 | -| SD - inpaint | 1.69 | 1.69 | 1.95 | 1.93 | -| SD - controlnet | 1.19 | 1.19 | OOM after warmup | 1.36 | -| IF | 5.43 | 5.29 | ❌ | 7.06 | - -### T4 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.9 | 6.95 | 7.3 | 7.56 | -| SD - img2img | 6.84 | 6.99 | 7.04 | 7.55 | -| SD - inpaint | 6.91 | 6.7 | 7.01 | 7.37 | -| SD - controlnet | 4.89 | 4.86 | 5.35 | 5.48 | -| IF | 17.42 /
2.47 /
18.52 | 16.96 /
2.45 /
18.69 | ❌ | 24.63 /
2.47 /
23.39 | -| SDXL - txt2img | 1.15 | 1.16 | - | - | - -### T4 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.79 | 1.79 | 2.03 | 1.99 | -| SD - img2img | 1.77 | 1.77 | 2.05 | 2.04 | -| SD - inpaint | 1.81 | 1.82 | 2.09 | 2.09 | -| SD - controlnet | 1.34 | 1.27 | 1.47 | 1.46 | -| IF | 5.79 | 5.61 | ❌ | 7.39 | -| SDXL - txt2img | 0.288 | 0.289 | - | - | - -### T4 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 2.34s | 2.30s | OOM after 2nd iteration | 1.99s | -| SD - img2img | 2.35s | 2.31s | OOM after warmup | 2.00s | -| SD - inpaint | 2.30s | 2.26s | OOM after 2nd iteration | 1.95s | -| SD - controlnet | OOM after 2nd iteration | OOM after 2nd iteration | OOM after warmup | OOM after warmup | -| IF * | 1.44 | 1.44 | ❌ | 1.94 | -| SDXL - txt2img | OOM | OOM | - | - | - -### RTX 3090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 22.56 | 22.84 | 23.84 | 25.69 | -| SD - img2img | 22.25 | 22.61 | 24.1 | 25.83 | -| SD - inpaint | 22.22 | 22.54 | 24.26 | 26.02 | -| SD - controlnet | 16.03 | 16.33 | 17.38 | 18.56 | -| IF | 27.08 /
9.07 /
31.23 | 26.75 /
8.92 /
31.47 | ❌ | 68.08 /
11.16 /
65.29 | - -### RTX 3090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 6.46 | 6.35 | 7.29 | 7.3 | -| SD - img2img | 6.33 | 6.27 | 7.31 | 7.26 | -| SD - inpaint | 6.47 | 6.4 | 7.44 | 7.39 | -| SD - controlnet | 4.59 | 4.54 | 5.27 | 5.26 | -| IF | 16.81 | 16.62 | ❌ | 21.57 | - -### RTX 3090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 1.7 | 1.69 | 1.93 | 1.91 | -| SD - img2img | 1.68 | 1.67 | 1.93 | 1.9 | -| SD - inpaint | 1.72 | 1.71 | 1.97 | 1.94 | -| SD - controlnet | 1.23 | 1.22 | 1.4 | 1.38 | -| IF | 5.01 | 5.00 | ❌ | 6.33 | - -### RTX 4090 (batch size: 1) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 40.5 | 41.89 | 44.65 | 49.81 | -| SD - img2img | 40.39 | 41.95 | 44.46 | 49.8 | -| SD - inpaint | 40.51 | 41.88 | 44.58 | 49.72 | -| SD - controlnet | 29.27 | 30.29 | 32.26 | 36.03 | -| IF | 69.71 /
18.78 /
85.49 | 69.13 /
18.80 /
85.56 | ❌ | 124.60 /
26.37 /
138.79 | -| SDXL - txt2img | 6.8 | 8.18 | - | - | - -### RTX 4090 (batch size: 4) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 12.62 | 12.84 | 15.32 | 15.59 | -| SD - img2img | 12.61 | 12,.79 | 15.35 | 15.66 | -| SD - inpaint | 12.65 | 12.81 | 15.3 | 15.58 | -| SD - controlnet | 9.1 | 9.25 | 11.03 | 11.22 | -| IF | 31.88 | 31.14 | ❌ | 43.92 | -| SDXL - txt2img | 2.19 | 2.35 | - | - | - -### RTX 4090 (batch size: 16) - -| **Pipeline** | **torch 2.0 -
no compile** | **torch nightly -
no compile** | **torch 2.0 -
compile** | **torch nightly -
compile** | -|:---:|:---:|:---:|:---:|:---:| -| SD - txt2img | 3.17 | 3.2 | 3.84 | 3.85 | -| SD - img2img | 3.16 | 3.2 | 3.84 | 3.85 | -| SD - inpaint | 3.17 | 3.2 | 3.85 | 3.85 | -| SD - controlnet | 2.23 | 2.3 | 2.7 | 2.75 | -| IF | 9.26 | 9.2 | ❌ | 13.31 | -| SDXL - txt2img | 0.52 | 0.53 | - | - | - -## Notes - -* Follow this [PR](https://github.com/huggingface/diffusers/pull/3313) for more details on the environment used for conducting the benchmarks. -* For the DeepFloyd IF pipeline where batch sizes > 1, we only used a batch size of > 1 in the first IF pipeline for text-to-image generation and NOT for upscaling. That means the two upscaling pipelines received a batch size of 1. - -*Thanks to [Horace He](https://github.com/Chillee) from the PyTorch team for their support in improving our support of `torch.compile()` in Diffusers.* diff --git a/docs/source/en/quantization/torchao.md b/docs/source/en/quantization/torchao.md index 3ccca0282588..70d2cd13e8bc 100644 --- a/docs/source/en/quantization/torchao.md +++ b/docs/source/en/quantization/torchao.md @@ -56,7 +56,7 @@ image = pipe( image.save("output.png") ``` -TorchAO is fully compatible with [torch.compile](./optimization/torch2.0#torchcompile), setting it apart from other quantization methods. This makes it easy to speed up inference with just one line of code. +TorchAO is fully compatible with [torch.compile](../optimization/fp16#torchcompile), setting it apart from other quantization methods. This makes it easy to speed up inference with just one line of code. ```python # In the above code, add the following after initializing the transformer diff --git a/docs/source/en/stable_diffusion.md b/docs/source/en/stable_diffusion.md index fc20d259f5f7..77610114ec85 100644 --- a/docs/source/en/stable_diffusion.md +++ b/docs/source/en/stable_diffusion.md @@ -256,6 +256,6 @@ make_image_grid(images, 2, 2) In this tutorial, you learned how to optimize a [`DiffusionPipeline`] for computational and memory efficiency as well as improving the quality of generated outputs. If you're interested in making your pipeline even faster, take a look at the following resources: -- Learn how [PyTorch 2.0](./optimization/torch2.0) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! +- Learn how [PyTorch 2.0](./optimization/fp16) and [`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html) can yield 5 - 300% faster inference speed. On an A100 GPU, inference can be up to 50% faster! - If you can't use PyTorch 2, we recommend you install [xFormers](./optimization/xformers). Its memory-efficient attention mechanism works great with PyTorch 1.13.1 for faster speed and reduced memory consumption. - Other optimization techniques, such as model offloading, are covered in [this guide](./optimization/fp16). diff --git a/docs/source/en/training/overview.md b/docs/source/en/training/overview.md index 5396afc0b8fd..bcd855ccb5d9 100644 --- a/docs/source/en/training/overview.md +++ b/docs/source/en/training/overview.md @@ -59,5 +59,5 @@ pip install -r requirements_sdxl.txt To speedup training and reduce memory-usage, we recommend: -- using PyTorch 2.0 or higher to automatically use [scaled dot product attention](../optimization/torch2.0#scaled-dot-product-attention) during training (you don't need to make any changes to the training code) +- using PyTorch 2.0 or higher to automatically use [scaled dot product attention](../optimization/fp16#scaled-dot-product-attention) during training (you don't need to make any changes to the training code) - installing [xFormers](../optimization/xformers) to enable memory-efficient attention \ No newline at end of file diff --git a/docs/source/en/tutorials/using_peft_for_inference.md b/docs/source/en/tutorials/using_peft_for_inference.md index f17113ecb830..7199361d5e5c 100644 --- a/docs/source/en/tutorials/using_peft_for_inference.md +++ b/docs/source/en/tutorials/using_peft_for_inference.md @@ -103,7 +103,7 @@ pipeline("A cute cnmt eating a slice of pizza, stunning color scheme, masterpiec ## torch.compile -[torch.compile](../optimization/torch2.0#torchcompile) speeds up inference by compiling the PyTorch model to use optimized kernels. Before compiling, the LoRA weights need to be fused into the base model and unloaded first. +[torch.compile](../optimization/fp16#torchcompile) speeds up inference by compiling the PyTorch model to use optimized kernels. Before compiling, the LoRA weights need to be fused into the base model and unloaded first. ```py import torch diff --git a/docs/source/en/using-diffusers/conditional_image_generation.md b/docs/source/en/using-diffusers/conditional_image_generation.md index b58b3b74b91a..0afbcaabe815 100644 --- a/docs/source/en/using-diffusers/conditional_image_generation.md +++ b/docs/source/en/using-diffusers/conditional_image_generation.md @@ -303,7 +303,7 @@ There are many types of conditioning inputs you can use, and 🤗 Diffusers supp Diffusion models are large, and the iterative nature of denoising an image is computationally expensive and intensive. But this doesn't mean you need access to powerful - or even many - GPUs to use them. There are many optimization techniques for running diffusion models on consumer and free-tier resources. For example, you can load model weights in half-precision to save GPU memory and increase speed or offload the entire model to the GPU to save even more memory. -PyTorch 2.0 also supports a more memory-efficient attention mechanism called [*scaled dot product attention*](../optimization/torch2.0#scaled-dot-product-attention) that is automatically enabled if you're using PyTorch 2.0. You can combine this with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speed your code up even more: +PyTorch 2.0 also supports a more memory-efficient attention mechanism called [*scaled dot product attention*](../optimization/fp16#scaled-dot-product-attention) that is automatically enabled if you're using PyTorch 2.0. You can combine this with [`torch.compile`](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) to speed your code up even more: ```py from diffusers import AutoPipelineForText2Image @@ -313,4 +313,4 @@ pipeline = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stab pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) ``` -For more tips on how to optimize your code to save memory and speed up inference, read the [Memory and speed](../optimization/fp16) and [Torch 2.0](../optimization/torch2.0) guides. +For more tips on how to optimize your code to save memory and speed up inference, read the [Accelerate inference](../optimization/fp16) and [Reduce memory usage](../optimization/memory) guides. diff --git a/docs/source/en/using-diffusers/img2img.md b/docs/source/en/using-diffusers/img2img.md index d9902081fde5..3175477f332c 100644 --- a/docs/source/en/using-diffusers/img2img.md +++ b/docs/source/en/using-diffusers/img2img.md @@ -35,7 +35,7 @@ pipeline.enable_xformers_memory_efficient_attention() -You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, then you don't need to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention). +You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, then you don't need to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/fp16#scaled-dot-product-attention). @@ -589,17 +589,17 @@ make_image_grid([init_image, depth_image, image_control_net, image_elden_ring], ## Optimize -Running diffusion models is computationally expensive and intensive, but with a few optimization tricks, it is entirely possible to run them on consumer and free-tier GPUs. For example, you can use a more memory-efficient form of attention such as PyTorch 2.0's [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention) or [xFormers](../optimization/xformers) (you can use one or the other, but there's no need to use both). You can also offload the model to the GPU while the other pipeline components wait on the CPU. +Running diffusion models is computationally expensive and intensive, but with a few optimization tricks, it is entirely possible to run them on consumer and free-tier GPUs. For example, you can use a more memory-efficient form of attention such as PyTorch 2.0's [scaled-dot product attention](../optimization/fp16#scaled-dot-product-attention) or [xFormers](../optimization/xformers) (you can use one or the other, but there's no need to use both). You can also offload the model to the GPU while the other pipeline components wait on the CPU. ```diff + pipeline.enable_model_cpu_offload() + pipeline.enable_xformers_memory_efficient_attention() ``` -With [`torch.compile`](../optimization/torch2.0#torchcompile), you can boost your inference speed even more by wrapping your UNet with it: +With [`torch.compile`](../optimization/fp16#torchcompile), you can boost your inference speed even more by wrapping your UNet with it: ```py pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) ``` -To learn more, take a look at the [Reduce memory usage](../optimization/memory) and [Torch 2.0](../optimization/torch2.0) guides. +To learn more, take a look at the [Reduce memory usage](../optimization/memory) and [Accelerate inference](../optimization/fp16) guides. diff --git a/docs/source/en/using-diffusers/inpaint.md b/docs/source/en/using-diffusers/inpaint.md index 2b62eecfb4f4..e780cc3c4dba 100644 --- a/docs/source/en/using-diffusers/inpaint.md +++ b/docs/source/en/using-diffusers/inpaint.md @@ -35,7 +35,7 @@ pipeline.enable_xformers_memory_efficient_attention() -You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, it's not necessary to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention). +You'll notice throughout the guide, we use [`~DiffusionPipeline.enable_model_cpu_offload`] and [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`], to save memory and increase inference speed. If you're using PyTorch 2.0, it's not necessary to call [`~DiffusionPipeline.enable_xformers_memory_efficient_attention`] on your pipeline because it'll already be using PyTorch 2.0's native [scaled-dot product attention](../optimization/fp16#scaled-dot-product-attention). @@ -788,7 +788,7 @@ make_image_grid([init_image, mask_image, image, image_elden_ring], rows=2, cols= ## Optimize -It can be difficult and slow to run diffusion models if you're resource constrained, but it doesn't have to be with a few optimization tricks. One of the biggest (and easiest) optimizations you can enable is switching to memory-efficient attention. If you're using PyTorch 2.0, [scaled-dot product attention](../optimization/torch2.0#scaled-dot-product-attention) is automatically enabled and you don't need to do anything else. For non-PyTorch 2.0 users, you can install and use [xFormers](../optimization/xformers)'s implementation of memory-efficient attention. Both options reduce memory usage and accelerate inference. +It can be difficult and slow to run diffusion models if you're resource constrained, but it doesn't have to be with a few optimization tricks. One of the biggest (and easiest) optimizations you can enable is switching to memory-efficient attention. If you're using PyTorch 2.0, [scaled-dot product attention](../optimization/fp16#scaled-dot-product-attention) is automatically enabled and you don't need to do anything else. For non-PyTorch 2.0 users, you can install and use [xFormers](../optimization/xformers)'s implementation of memory-efficient attention. Both options reduce memory usage and accelerate inference. You can also offload the model to the CPU to save even more memory: @@ -797,10 +797,10 @@ You can also offload the model to the CPU to save even more memory: + pipeline.enable_model_cpu_offload() ``` -To speed-up your inference code even more, use [`torch_compile`](../optimization/torch2.0#torchcompile). You should wrap `torch.compile` around the most intensive component in the pipeline which is typically the UNet: +To speed-up your inference code even more, use [`torch_compile`](../optimization/fp16#torchcompile). You should wrap `torch.compile` around the most intensive component in the pipeline which is typically the UNet: ```py pipeline.unet = torch.compile(pipeline.unet, mode="reduce-overhead", fullgraph=True) ``` -Learn more in the [Reduce memory usage](../optimization/memory) and [Torch 2.0](../optimization/torch2.0) guides. +Learn more in the [Reduce memory usage](../optimization/memory) and [Accelerate inference](../optimization/fp16) guides. diff --git a/docs/source/en/using-diffusers/marigold_usage.md b/docs/source/en/using-diffusers/marigold_usage.md index b8e9a5838e8d..f66e47bada09 100644 --- a/docs/source/en/using-diffusers/marigold_usage.md +++ b/docs/source/en/using-diffusers/marigold_usage.md @@ -288,7 +288,7 @@ Speeding them up can be achieved by using a more efficient attention processor: depth = pipe(image, num_inference_steps=1) ``` -Finally, as suggested in [Optimizations](../optimization/torch2.0#torch.compile), enabling `torch.compile` can further enhance performance depending on +Finally, as suggested in [Optimizations](../optimization/fp16#torchcompile), enabling `torch.compile` can further enhance performance depending on the target hardware. However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when the same pipeline instance is called repeatedly, such as within a loop. diff --git a/docs/source/en/using-diffusers/svd.md b/docs/source/en/using-diffusers/svd.md index 7852d81fa209..b7fe4df8f7b8 100644 --- a/docs/source/en/using-diffusers/svd.md +++ b/docs/source/en/using-diffusers/svd.md @@ -63,7 +63,7 @@ export_to_video(frames, "generated.mp4", fps=7) ## torch.compile -You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/torch2.0#torchcompile) the UNet. +You can gain a 20-25% speedup at the expense of slightly increased memory by [compiling](../optimization/fp16#torchcompile) the UNet. ```diff - pipe.enable_model_cpu_offload() diff --git a/docs/source/en/using-diffusers/text-img2vid.md b/docs/source/en/using-diffusers/text-img2vid.md index 92e740bb579d..0098d61cbab4 100644 --- a/docs/source/en/using-diffusers/text-img2vid.md +++ b/docs/source/en/using-diffusers/text-img2vid.md @@ -547,7 +547,7 @@ Video generation requires a lot of memory because you're generating many video f + frames = pipeline(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0] ``` -If memory is not an issue and you want to optimize for speed, try wrapping the UNet with [`torch.compile`](../optimization/torch2.0#torchcompile). +If memory is not an issue and you want to optimize for speed, try wrapping the UNet with [`torch.compile`](../optimization/fp16#torchcompile). ```diff - pipeline.enable_model_cpu_offload()