In [None]:
!pip install -qU diffusers accelerate transformers huggingface_hub

In [None]:
from huggingface_hub import notebook_login
notebook_login()

# Distributed inference

On distributed setups, we can run inference across multiple GPUs with HuggingFace Accelerate or PyTorch Distributed, which is useful for generating with multiple prompts in parallel.

## HuggingFace Accelerate

Accelerate is a library designed to make it easy to train or run inference across distributed setups.

To begin, we need to create a Python file and initialize an `accelerate.PartialState` to create a distributed environment; our setup is automatically detected so we do not need to explicitly define the `rank` or `world_size`. Move the `DiffusionPipeline` to `distributed_state.device` to assign a GPU to each process.

Now we can use the `split_between_processes` utility as a context manager to automatically distribute the prompts between the number of processes.

In [None]:
# the following snippet is saved as `run_distributed.py`

import torch
from accelerate import PartialState
from diffusers import DiffusionPipeline

pipeline = DiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
)

distributed_state = PartialState()
pipeline.to(distributed_state.device)

with distributed_state.split_between_processes(["a dog", 'a cat']) as prompt:
    result = pipeline(prompt).images[0]
    result.save(f"result_{distributed_state.process_index}.png")

Use the `--num_process` argument to specify the number of GPUs to use, and call `accelerate launch` to run the script:
```bash
accelearte launch run_distributed.py --num_processes=2
```

## PyTorch Distributed

PyTorch supports `DistributedDataParallel` which enables data parallelism.

To start, create a python file and import `torch.distributed` and `torch.multiprocessing` to set up the distributed process group and to spawn the processes for inference on each GPU.

In [None]:
# the following snippet is saved as `run_distributed.py`
import torch
import torch.distributed as dist
import torch.multiprocessing as mp

from diffusers import DiffusionPipeline

sd = DiffusionPipeline.from_pretrained(
    'stable-diffusion-v1-5/stable-diffusion-v1-5',
    torch_dtype=torch.float16,
    use_safetensors=True,
)


def run_inference(rank, world_size):
    dist.init_process_group('nccl', rank=rank, world_size=world_size)
    # `init_process_group` handles creating a distributed envrionment with
    # the type of backend to use, the `rank` of the current process, and
    # the `world_size` or the number of processes participating.

    sd.to(rank)

    if torch.distributed.get_rank() == 0:
        prompt = 'a dog'
    elif torch.distributed.get_rank() == 1:
        prompt = 'a cat'

    image = sd(prompt).images[0]
    image.save(f"./{'_'.join(prompt)}.png")


def main():
    # call `mp.spawn` to run the distributed inference
    world_size = 2 # 2 gpus
    mp.spawn(
        run_inference,
        args=(world_size,),
        nprocs=world_size,
        join=True,
    )


if __name__ == '__main__':
    main()

Now the inference script is completed, we can use the `--nproc_per_node` argument to specify the number of GPUs to use and call `torchrun` to run the script:
```bash
torchrun run_distributed.py --nproc_per_node=2
```

## Model sharding

Modern diffusion systems such as `Flux` are very large and have multiple models. For example, `Flux.1-Dev` is made up of two text encoders - `T5-XXL` and `CLIP-L`, a diffusion transformer, and a VAE. With a model this size, it can be challenging to run inference on consumer GPUs.

Model sharding is a technique that distributes models aross GPUs when the models do not fit on a single GPU. The example below assumes two 16GB GPUs are available for inference.

Start by computing the text embedings with the text encoders. Keep the text encoders on two GPUs by setting `device_map="balanced"`. The `balanced` strategy evenly distributes the model on all available GPUs. Use the `max_memory` parameter to allocate the maximum amount of memory for each text encoder on each GPU.

In [None]:
from diffusers import FluxPipeline
import torch

prompt = 'a photo of a dog with cat-like look'

pipeline = FluxPipeline.from_pretrained(
    'black-forest-labs/FLUX.1-dev',
    transformers=None,
    vae=None,
    device_map='balanced',
    max_memory={0: "16GB", 1: "16GB"},
    torch_dtype=torch.float16,
)

with torch.no_grad():
    print('Encoding prompts')
    prompt_embeds, pooled_prompt_embeds, text_ids = pipeline.encode_prompt(
        prompt=prompt,
        prompt_2=None,
        max_sequence_length=512,
    )

Once the text embeddings are computed, remove them from the GPU to make space for the diffusion transformer:

In [None]:
import gc

def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


del pipeline.text_encoder
del pipeline.text_encoder_2
del pipeline.tokenizer
del pipeline.tokenizer_2
del pipeline

flush()

Load the diffusion transformer next which has 12.5B parameters. This time, set `device_map="auto"` to automatically distribute the model across two 16GB GPUs. The `auto` strategy is backed by Acclereate and available as a part of the Big Model Inference feature.

In [None]:
from diffusers import FluxTransformer2DModel
import torch

transformer = FluxTransformer2DModel.from_pretrained(
    'black-forest-labs/FLUX.1-dev',
    subfolder='transformer',
    device_map='audo',
    torch_dtype=torch.float16,
)

In [None]:
pipeline.hf_device_map

In [None]:
transformer.hf_device_map

Add the transformer model the pipeline for denoising, but set the other model-level components like the text encoders and VAE to `None` because we do not need them yet.

In [None]:
pipeline = FluxPipeline.from_pretrained(
    'black-forest-labs/FLUX.1-dev',
    text_encoder=None,
    text_encoder_2=None,
    tokenizer=None,
    tokenizer_2=None,
    vae=None,
    transformer=transformer,
    torch_dtype=torch.float16,
)

print('Running denoising...')
height, width = 768, 1360
latents = pipeline(
    prompt_embeds=prompt_embeds,
    pooled_prompt_embeds=pooled_prompt_embeds,
    num_inference_steps=50,
    guidance_scale=3.5,
    height=height,
    width=width,
    output_type='latent',
).images

Remove the pipeline and transformer from memory as they're no longer needed.

In [None]:
del pipeline.transformer
del pipeline

flush()

Finally, decode the latents with the VAE into an image. The VAE is typically small enough to be loaded on a single GPU.

In [None]:
from diffusers import AutoencoderKL
from diffusers.image_processor import VaeImageProcessor
import torch

vae = AutoencoderKL.from_pretrained(
    'black-forest-labs/FLUX.1-dev',
    subfolder='vae',
    torch_dtype=torch.bfloat16,
).to('cuda')
vae_scale_factor = 2 ** (len(vae.config.block_out_channels))
image_processor = VaeImageProcessor(vae_scale_factor=vae_scale_factor)

with torch.no_grad():
    print('Running decoding...')
    latents = FluxPipeline._unpack_latents(
        latents,
        height,
        width,
        vae_scale_factor,
    )
    latents = (latents / vae.config.scaling_factor) + vae.config.shift_factor

    image = vae.decode(latents, return_dict=False)[0]
    image = image_processor.postprocess(image, output_type='pil')
    image[0].save('split_transformer.png')