In [None]:
!pip installl -qU diffusers transformers torch xformers tomesd gate DeepCache accelerate xfuser

# Speed up inference

To optimize Diffusers for inference speed, we can
* reduce the computational burden by
  * lowering the data precision, or
  * using a lightweight distilled model.
* apply memory-efficient attention implementations, such as xFormers and scaled dot product attention in PyTorch 2.0.

For example, if the prompt of a single 512x512 image is "a photo of an astronaut riding a horse on mars" with 50 DDIM steps on a NVIDIA A100, the inference time:

| setup | latency | speed-up |
| ----- | ------- | -------- |
| baseline | 5.27s | x1 |
| tf32 | 4.14s | x1.27 |
| fp16 | 3.51s | x1.50 |
| combined | 3.41s | x1.54 |

## TensorFloat-32

By default, PyTorch enables tf32 mode for convolutions but not matrix multiplications. It is recommended to enable tf32 for matrix multiplications to significantly speed up computations with typically negligible loss in numerical accuracy.

In [None]:
import torch

torch.backends.cuda.matmul.allow_tf32 = True

## Half-precision weights

To save GPU memory and get more speed, we can set `torch_dtype=torch.float16` to load and run the model weights directly with half-precision weights.

In [None]:
import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

**Note**: Do NOT use `torch.autocast` in any of the pipelines as it can lead to black images and is always slower than pure float16 precision.

## Distilled model

We could also use a distilled Stable Diffusion model and autoencoder to speed up inference.

For example, if the prompt of four 512x512 image is "a photo of an astronaut riding a horse on mars" with 25 PNDM steps on a NVIDIA A100, the inference time to generate 4 images:

| setup | latency | speed-up |
| ----- | ------- | -------- |
| baseline | 6.37s | x1 |
| distilled | 4.18s | x1.52 |
| distilled + tiny autoencoder | 3.83s | x1.66 |

In [None]:
from diffusers import StableDiffusionPipeline
import torch

distilled = StableDiffusionPipeline.from_pretrained(
    'nota-ai/bk-sdm-small',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

In [None]:
prompt = "a golden vase with different flowers"
generator = torch.manual_seed(111)

image = distilled(
    prompt,
    num_inference_steps=25,
    generator=generator,
).images[0]
image

### Tiny AutoEncoder

In [None]:
from diffusers import AutoencoderTiny, StableDiffusionPipeline
import torch

distilled = StableDiffusionPipeline.from_pretrained(
    'nota-ai/bk-sdm-small',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

distilled.vae = AutoencoderTiny.from_pretrained(
    'sayakpaul/taesd-diffusers',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

In [None]:
prompt = "a golden vase with different flowers"
generator = torch.manual_seed(111)

image = distilled(
    prompt,
    num_inference_steps=25,
    generator=generator,
).images[0]
image

# Reduce memory usage

Optimizing for memory or speed lead to improved performance in the other, so we should try to optimize for both whenver we can.

For example, if the prompt of a single 512x512 image is "a photo of an astronaut riding a horse on mars" with 50 DDIM steps on a NVIDIA Titan RTX, the inference time:

| setup | latency | speed-up |
| ----- | ------- | -------- |
| original | 9.50s | x1 |
| fp16 | 3.61s | x2.63 |
| channels last | 3.30s | x2.88 |
| traced UNet | 3.21s | x2.96 |
| memory-efficient attention | 2.63s | x3.61 |

## Sliced VAE

**Sliced VAE** enables decoding large batches of images with limited VRAM or batches with 32 images or more by decoding the batches of latents one image at a time. We will likely want to couple this with `enable_xformers_memory_efficient_attention()` to reduce memory use further if we have xFormers installed.

To use sliced VAE, we need to call `enable_vae_slicing()` on our pipeline before inference:

In [None]:
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

# add here
pipe.enable_vae_slicing()
# if xFormers installed,
pipe.enable_xformers_memory_efficient_attention()

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"

images = pipe([prompt] * 32).images[0]
images

We may a small performance boost in VAE decoding on multi-image batches, and there should be no performance impact on single-image batches.

## Tiled VAE

**Tiled VAE** also enables working with large images on limited VRAM (for example, generating 4k images on 8GB of VRAM) by splitting the image into overlapping tiles, decoding the tiles, and then blending the outputs together to compose the final image. We should also use tiled VAE with `enable_xformers_memory_efficient_attention()` to reduce memory use further if we have xFormers installed.

To use tiled VAE, we need to call `enable_vae_tiling()` on our pipeline before inference:

In [None]:
from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to('cuda')

# add here
pipe.enable_vae_tiling()
# if xFormers installed
pipe.enable_xformers_memory_efficient_attention()

In [None]:
prompt = "a beautiful landscape photograph"

image = pipe(
    prompt,
    width=3840,
    height=2224,
    num_inference_steps=20
).images[0]
image

The output image has some tile-to-tile tone variation because the tiles are decoded separately, but we should not see any sharp and obvious seams between the tiles.

Tiling is turned off for images that are 512x512 or smaller.

## CPU offloading

Offloading the weights to the CPU and only loading them on the GPU when performing the forward pass can also save memory.

To perform CPU offloading, we can call `enable_sequential_cpu_offload()`:

In [None]:
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True
)
# do NOT move pipeline to CUDA

# add here
pipe.enable_sequential_cpu_offload()

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image

When using `enable_sequential_cpu_offload()`, do NOT move the pipeline to CUDA beforehand or else the gain in memory consumption will only be minimal.

CPU offloading works on submodules rather than whole models. This is the best way to minimize memory consumption, but inference is much slower due to the iterative nature of the diffusion process. The UNet component of the pipeline runs several times (as many as `num_inference_steps`); each time, the different UNet submodules are sequentially onloaded and offloaded as needed, resulting in a large number of memory transfers.

Consider using model offloading if we want to optimize for speed because it is much faster. The tradeoff is our memory savings will not be as large.

## Model offloading

As we saw in the previous section, Sequential CPU offloading preserves a lot of memory but it makes inference slower because submodules are moved to GPU as needed, and they are immediately returned to the CPU when a new module runs.

Full-model offloading moves the whole models to the CPU, rather than handling each model's constituent submodules. There is a negligible impact on inference time (compared with moving the pipeline to CUDA), and it still provides some meomry savings.

During model offloading, only one of the main components of the pipeline (typically the text encoder, UNet, and VAE) is placed on the GPU while the others wait on the CPU. Components like the UNet that run for multiple iterations stay on the GPU until they are no longer needed.

In [None]:
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True
)

# add here
pipe.enable_model_cpu_offload()

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image

## Channels-last memory format

The **channels-last memory format** is used to order NCHW tensors in memory to preserve dimension ordering.

Channels-last tensors are ordered in such a way that the channels become the densest dimension (storing images pixel-per-pixel). Since not all operators currently support the channels-last formst, it may result in worst performance but we should still try and see if it works for the model we choose.

In [None]:
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True
)

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
generator = torch.manual_seed(111)
image = pipe(
    prompt,
    generator=generator
).images[0]
image

For example, to set the pipeline's UNet to use the channels-last format:

In [None]:
print(pipe.unet.conv_out.state_dict()['weight'].stride())
pipe.unet.to(memory_format=torch.channels_last) # in-place operation
print(pipe.unet.conv_out.state_dict()['weight'].stride())

In [None]:
image = pipe(
    prompt,
    generator=generator
).images[0]
image

## Tracing

**Tracing** runs an example input tensor through the model and captures the operations that are performed on it as that input makes its way through the model's layers.

To trace a UNet:

In [None]:
import time
import torch
from diffusers import StableDiffusionPipeline
import functools

# torch disable grad
torch.set_grad_enabled(False)

# set variables
n_experiments = 2
unet_runs_per_experiment = 50

# load inputs
def generate_inputs():
    sample = torch.randn((2, 4, 64, 64), device='cuda', dtype=torch.float16)
    timestep = torch.rand(1, device='cuda', dtype=torch.float16) * 999
    encoder_hidden_states = torch.randn((2, 77, 768), device='cuda', dtype=torch.float16)

    return sample, timestep, encoder_hidden_states


pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

unet = pipe.unet
unet.eval()
unet.to(memory_format=torch.channels_last) # use channels_last memory format
unet.forward = functools.partial(unet.forward, return_dict=False) # set return_dict=False as default

In [None]:
# warmup
for _ in range(3):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet(*inputs)

In [None]:
# trace
print('tracing...')
unet_traced = torch.jit.trace(unet, inputs)
unet_traced.eval()
print('dont tracing')

In [None]:
# warmup and optimize graph
for _ in range(5):
    with torch.inference_mode():
        inputs = generate_inputs()
        orig_output = unet_traced(*inputs)

In [None]:
# benchmarking
with torch.inference_mode():
    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet_traced(*inputs)
        torch.cuda.synchronize()
        print(f"unet traced inference took {time.time() - start_time:.2f} seconds")

    for _ in range(n_experiments):
        torch.cuda.synchronize()
        start_time = time.time()
        for _ in range(unet_runs_per_experiment):
            orig_output = unet(*inputs)
        torch.cuda.synchronize()
        print(f"unet inference took {time.time() - start_time:.2f} seconds")

# save the model
unet_traced.save('unet_traced.pt')

Replace the `unet` attribute of the pipeline with the traced model:

In [None]:
from diffusers import StableDiffusionPipeline
import torch
from dataclasses import dataclass

@dataclass
class UNet2DConditionOutput:
    sample: torch.Tensor


pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True
).to('cuda')

# use jitted unet
unet_traced = torch.jit.load('unet_traced.pt')

# del pipe.unet
class TracedUNet(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.in_channels = pipe.unet.config.in_channels
        self.device = pipe.unet.device

    def forward(self, latent_model_input, t, encoder_hidden_states):
        # apply unet_traced here
        sample = unet_traced(latent_model_input, t, encoder_hidden_states)[0]
        return UNet2DConditionOutput(sample=sample)


pipe.unet = TracedUNet()

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"

with torch.inference_mode():
    image = pipe(
        prompt,
        num_inference_steps=50
    ).images[0]
image

## Memory-efficient attention

To use **Flash Attention**, we need to install the following:
* PyTorch > 1.12
* CUDA available
* xFormers

and then we can call `enable_xformers_memory_efficient_attention()`:

In [None]:
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    use_safetensors=True,
).to("cuda")

pipe.enable_xformers_memory_efficient_attention()

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
generator = torch.manual_seed(111)

with torch.inference_mode():
    image = pipe(
        prompt,
        generator=generator
    ).images[0]
image

# PyTorch 2.0

HuggingFace Diffusers supports the latest optimizations from PyTorch 2.0:
1. A memory-efficient attention implementation, scaled dot product attention, without requiring any extra dependencies such as xFormers.
2. `torch.compile`, a just-in-time (JIT) compiler to provide an extra performance boost when individual models are compiled.

## Scaled dot product attention

`torch.nn.functional.scaled_dot_product_attention` (SDPA) is an optimized and memory-efficient attention (similar to xFormers) that automatically enables several other optimizations depending on the model inputs and GPU type.

SDPA is enabled by default if we use PyTorch 2.0 and the latest version of Diffusers.

If we want to explicitly enable it, we can set a `DiffusionPipeline` to use `AttnProcessor2_0`:

In [None]:
import torch
from diffusers import DiffusionPipeline
# import AttnProcessor2_0
from diffusers.models.attention_processor import AttnProcessor2_0

pipe = DiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')
# enable AttnProcessor2_0 in unet
pipe.unet.set_attn_processor(AttnProcessor2_0())

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image

In some cases - such as making the pipeline more deterministic or converting it to other formats - it may be helpful to use the vanilla attention processor, `AttnProcessor`. To revert to `AttnProcessor`, we need to call the `set_default_attn_processor()` function:

In [None]:
pipe.unet.set_default_attn_processor()

image = pipe(prompt).images[0]
image

## `torch.compile`

It is usually best to wrap the UNet with `torch.compile` becuase it does most of the heavy lifting in the pipeline.

In [None]:
from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

pipe.unet = torch.compile(
    pipe.unet,
    mode='reduce-overhead',
    fullgraph=True
)

In [None]:
prompt = "a photo of an astronaut riding a horse on mars"
images = pipe(
    prompt,
    num_inference_steps=50
    num_images_per_prompt=4
).images[0]
images

Compilation requires some time to complete, so it is best suited for situations where we prepare our pipeline once and then perform the same type of inference operations multiple times. For example, calling the compiled pipeline on a different image size triggers compilation again which can be expensive.

# xFormers

It is recommended to use xFormers for both inference and training.

After xFormers is installed (`pip install xformers`), we can use `enable_xformers_memory_efficient_attention()` for faster inference and reduced memory comsumption.

# Token Merging

**Token Merging (ToMe)** merges redundant tokens/patches progressively in the forward pass of a Transformer-based network which can speed-up the inference latency of `StableDiffusionPipeline`.

To install ToMe, `pip install tomesd`.

We can use ToMe from the `tomesd` library with the `apply_patch` function:

In [None]:
from diffusers import StableDiffusionPipeline
import torch
import tomesd

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

# apply patch
tomesd.apply_patch(pipe, ratio=0.5)

In [None]:
  image = pipeline("a photo of an astronaut riding a horse on mars").images[0]
  image

The `apply_patch` function exposes a number of arguments to help strike a balance between pipeline inference speed and the quality of the generated tokens. The most important argument is `ratio` which controls the number of tokens that are merged during the forward pass.

ToMe can greatly preserve the quality of the generated images while boosting inference speed. By increasing the `ratio`, we can speed-up inference even further, but at the cost of some degraded image quality.

# DeepCache

**DeepCache** accelerates `StableDiffusionPipeline` and StableDiffusionXLPipeline` by strategically caching and reusing high-level features while efficiently updating low-level features by taking advantage of the UNet architecture.

To install DeepCache, `pip install DeepCache`.

Then we can load and enable the `DeepCacheSDHelper`:

In [None]:
import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
).to('cuda')

In [None]:
from DeepCache import DeepCacheSDHelper

helper = DeepCacheSDHelper(pipe=pipe)
helper.set_params(
    cache_interval=3,
    cache_branch_id=0,
)
helper.enable()

In [None]:
image = pipe("a photo of an astronaut on a moon").images[0]
image

The `set_params` method accepts
* `cache_interval`, the frequency of feature caching, specified as the number of steps between each cache operation.
* `cache_branch_id`, identifying which branch of the network (ordered from the shallowest to the deepest layer) is responsible for executing the caching process.

a lower `cache_branch_id` or a larger `cache_interval` can lead to faster inference speed at the expense of reduced image quality.

# T-GATE

**T-GATE** accelerates inference for Stable Diffusion, PixArt, and Latency Consistentcy Model pipelines by *skipping the cross-attention calculation* once it converges. This method does not require any additional training and it can speed up inference from 10-50%. T-GATE is also compatible with other optimization methods mentioned above.

To install T-GATE, `pip install tgate`.

To use T-GATE with a pipeline, we need to use its corresponding loader:

| Pipeline | T-GATE Loader |
| -------- | ------------- |
| PixArt | TgatePixArtLoader |
| Stable Diffusion XL | TgateSDXLLoader |
| Stable Diffusion XL + DeepCache | TgateSDXLLDeepCacheLLoader |
| Stable Diffusion | TgateSDLoader |
| Stable Diffusion + DeepCache | TgateSDDeepCacheLoader |

Next, we can create a `TgateLoader` with a pipeline, the gate step (the time step to stop calculating the cross attention), and the number of inference steps. Then we call the `tgate` method on the pipeline with a prompt, gate step, and the number of inference steps.

##### PixArt

In [None]:
import torch
from diffusers import PixArtAlphaPipeline
from tgate import TgatePixArtLoader

pipe = PixArtAlphaPipeline.from_pretrained(
    'PixArt-alpha/PixArt-XL-2-1024-MS',
    torch_dtype=torch.float16,
)

gate_step = 8
inference_step = 25
pipe = TgatePixArtLoader(
    pipe,
    gate_step=gate_step,
    num_inference_steps=inference_step
).to('cuda')

In [None]:
image = pipe.tgate(
       "An alpaca made of colorful building blocks, cyberpunk.",
       gate_step=gate_step,
       num_inference_steps=inference_step,
).images[0]
image

##### SDXL

In [None]:
import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
from tgate import TgateSDXLLoader

pipe = StableDiffusionXLPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
    variant='fp16',
    use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

gate_step = 10
inference_step = 25
pipe = TgateSDXLLoader(
    pipe,
    gate_step=gate_step,
    num_inference_steps=inference_step
).to('cuda')

In [None]:
image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]
image

##### SDXL + DeepCache

In [None]:
import torch
from diffusers import StableDiffusionXLPipeline, DPMSolverMultistepScheduler
from tgate import TgateSDXLDeepCacheLoader

pipe = StableDiffusionXLPipeline.from_pretrained(
            "stabilityai/stable-diffusion-xl-base-1.0",
            torch_dtype=torch.float16,
            variant="fp16",
            use_safetensors=True,
)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)

gate_step = 10
inference_step = 25
pipe = TgateSDXLDeepCacheLoader(
       pipe,
       cache_interval=3,
       cache_branch_id=0,
).to("cuda")

In [None]:
image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]
image

##### Latent Consistency Model

In [None]:
import torch
from diffusers import (
    StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler
)
from tgate import TgateSDXLLoader

unet = UNet2DConditionModel.from_pretrained(
    'latent-consistency/lcm-sdxl',
    torch_dtype=torch.float16,
    variant='fp16',
)

pipe = StableDiffusionXLPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    unet=unet,
    torch_dtype=torch.float16,
    variant='fp16',
)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)

gate_step = 1
inference_step = 4
pipe = TgateSDXLLoader(
    pipe,
    gate_step=gate_step,
    num_inference_steps=inference_step,
    lcm=True
).to('cuda')

In [None]:
image = pipe.tgate(
       "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k.",
       gate_step=gate_step,
       num_inference_steps=inference_step
).images[0]
image

# xDiT

**xDiT** is an inference engine designed for the large scale parallel deployment of Diffusion Transformers (DiTs). xDiT provides a suite of efficient parallel approaches for Diffusion models, as welll as GPU kernel accelerations.

Parallel methods in xDiT:
* Unified Sequence Parallelism
* PipeFusion
* CGF Parallelism
* Data Parallelism

To install xDiT, `pip install xfuser`.

Example of using xDiT to accelerate inference of a Diffusers model:

In [None]:
import torch
from diffusers import StableDiffusion3Pipeline

from xfuser import xFuserArgs, xDiTParallel
from xfuser.config import FlexibleArgumentParser
from xfuser.core.distributred import get_world_group


def main():
    parser = FlexibleArgumentParser(description='xFuser Arguments')
    args = xFuserArgs.add_cli_args(parser).parse_args()
    engine_args = xFuserArgs.from_cli_args(args)
    engine_config, input_config = engine_args.create_config()

    local_rank = get_world_group().local_rank

    pipe = StableDiffusion3Pipeline.from_pretrained(
        pretrained_model_name_or_path=engine_config.model_config.model,
        torch_dtype=torch.float16,
    ).to(f"cuda:{local_rank}")

    pipe = xDiTParallel(
        pipe,
        engine_config,
        input_config
    )

    image = pipe(
        height=input_config.height,
        width=input_config.height,
        prompt=input_config.prompt,
        num_inference_steps=input_config.num_inference_steps,
        output_type=input_config.output_type,
        generator=torch.Generator(device="cuda").manual_seed(input_config.seed),
    )

    if input_config.output_type == 'pil':
        pipe.save('results', 'stable_diffusion_3')

if __name__ == '__main__':
    main()