# Understanding Diffusion

---

# Building Diffusion Systems with Diffusers

The `diffusers` library is designed to be a user-friendly and flexible toolbox for building diffusion systems tailored to your use-case. At the core of this toolbox are **models** and **schedulers**. While the `DiffusionPipeline` bundles these components together for convenience, you can also unbundle the pipeline and use the models and schedulers separately to create new diffusion systems.

# Understanding Diffusion

---

## Diffusion Models: A Deep Dive

Diffusion models are an exciting advancement in the field of generative models. The high-level idea is simple yet powerful: these models take images that are blurred with noise and learn to denoise them, resulting in clear images. During training, the model sees images with varying levels of noise, and at inference time, it starts with pure noise and iteratively generates an image that looks like it came from the training data.

### The Key Insight: Iterative Refinement

So, what makes diffusion models so effective? Unlike previous techniques like Variational Autoencoders (VAEs) or Generative Adversarial Networks (GANs), which generate images in a single pass, diffusion models create images through many steps. This iterative process allows the model to refine and correct its output gradually, leading to high-quality image generation. To see this in action, let's explore an example using the Hugging Face diffusers library.

### Loading a Pre-Trained Diffusion Model

We'll use the `DDPMPipeline` from the Hugging Face diffusers library to load a pre-trained diffusion model. Specifically, we'll use the `ddpm-celebahq-256` model, which was trained on the CelebA-HQ dataset—a collection of high-quality celebrity images. This model will generate images resembling those in the dataset, starting from pure noise.

In this tutorial, you’ll learn how to use models and schedulers to assemble a diffusion system for inference. We'll start with a basic pipeline and then progress to the more complex Stable Diffusion pipeline.

## Deconstruct a Basic Pipeline

A pipeline is a quick and easy way to run a model for inference, requiring no more than four lines of code to generate an image:

In [None]:

from diffusers import DDPMPipeline

ddpm = DDPMPipeline.from_pretrained("google/ddpm-celebahq-256").to("cuda")

In [None]:
# Generate a celebrity face
image = ddpm(num_inference_steps=30).images[0]
image

This generates an image of a fake celebrity using the `DDPMPipeline`. But how does it work under the hood? Let’s break down the pipeline and see what’s happening.

### Understanding the Pipeline

In the example above, the pipeline contains a `UNet2DModel` model and a `DDPMScheduler`. The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the noise residual and the scheduler uses it to predict a less noisy image. This process repeats until it reaches the end of the specified number of inference steps.

To recreate the pipeline with the model and scheduler separately, let’s write our own denoising process.

### Load the Model and Scheduler

First, we need to load the model and scheduler:

In [None]:

from diffusers import DDPMScheduler, UNet2DModel

scheduler = DDPMScheduler.from_pretrained("google/ddpm-celebahq-256")
model = UNet2DModel.from_pretrained("google/ddpm-celebahq-256").to("cuda")


### Set the Number of Timesteps

Next, we set the number of timesteps to run the denoising process for:

In [None]:

scheduler.set_timesteps(50)


Setting the scheduler timesteps creates a tensor with evenly spaced elements, 50 in this example. Each element corresponds to a timestep at which the model denoises an image:

In [None]:

scheduler.timesteps


### Create Random Noise

Create some random noise with the same shape as the desired output:

In [None]:
import torch

# Get the sample size from the model configuration
sample_size = model.config.sample_size

# Create random noise with the same shape as the desired output
noise = torch.randn((1, 3, sample_size, sample_size), device="cuda")


### Write the Denoising Loop

Now we write a loop to iterate over the timesteps. At each timestep, the model does a `UNet2DModel.forward()` pass and returns the noisy residual. The scheduler’s `step()` method takes the noisy residual, timestep, and input, and predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, repeating until the end of the timesteps array.


In each iteration:

* The model predicts the noise in the current image (noisy_residual).
* The scheduler's step method uses this prediction to estimate the image at the previous timestep (previous_noisy_sample).
* This new estimate becomes the input for the next iteration.
* This process continues until we have stepped through all timesteps, progressively denoising the image.

In [None]:
# Initialize the input to be the random noise we created
input = noise

# Loop over each timestep
for t in scheduler.timesteps:
    with torch.no_grad():  # No gradient calculation is needed
        # Get the noisy residual from the model
        noisy_residual = model(input, t).sample
    # Predict the image at the previous timestep
    previous_noisy_sample = scheduler.step(noisy_residual, t, input).prev_sample
    # Update the input for the next iteration
    input = previous_noisy_sample


This is the entire denoising process. You can use this same pattern to write any diffusion system.

### Convert the Denoised Output to an Image

The last step is to convert the denoised output into an image.
Here’s what happens in this step:

* We normalize the tensor values to the range [0, 1] and squeeze any singleton dimensions.
* We permute the dimensions to match the format expected by PIL (Height x Width x Channels).
* The values are scaled to [0, 255] and converted to an 8-bit unsigned integer format.
* Finally, we convert the NumPy array to a PIL image and display it.

In [None]:
from PIL import Image
import numpy as np

# Normalize the image data to be between 0 and 1
image = (input / 2 + 0.5).clamp(0, 1).squeeze()
# Change the shape and type for image conversion
# The Python imaging library expects (w,h,ch)
image = (image.permute(1, 2, 0) * 255).round().to(torch.uint8).cpu().numpy()
# Create a PIL image
image = Image.fromarray(image)
image


# Another Example with Stability AI Stable Diffusion Model

In [None]:
from PIL import Image
import torch
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler

vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", use_safetensors=True)
tokenizer = CLIPTokenizer.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="tokenizer")
text_encoder = CLIPTextModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="text_encoder", use_safetensors=True)
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", use_safetensors=True)


### Use a Different Scheduler

Instead of the default `PNDMScheduler`, let's use the `UniPCMultistepScheduler`:

In [None]:

from diffusers import UniPCMultistepScheduler

scheduler = UniPCMultistepScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler")


### Move Models to GPU

To speed up inference, move the models to a GPU:

In [None]:

torch_device = "cuda"
vae.to(torch_device)
text_encoder.to(torch_device)
unet.to(torch_device)


### Create Text Embeddings

Tokenize the text to generate embeddings. The text is used to condition the UNet model and steer the diffusion process towards something that resembles the input prompt:

In [None]:
torch.cuda.is_available()

In [None]:
prompt = ["a nice flower"]
height = 512  # default height of Stable Diffusion
width = 512  # default width of Stable Diffusion
num_inference_steps = 25  # Number of denoising steps
guidance_scale = 7.5  # Scale for classifier-free guidance

In [None]:
generator = torch.manual_seed(0)  # Seed generator to create the initial latent noise

In [None]:
batch_size = len(prompt)

text_input = tokenizer(prompt, padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")

with torch.no_grad():
    text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer([""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt")
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])


### Create Random Noise

Generate some initial random noise as a starting point for the diffusion process:

In [None]:

latents = torch.randn((batch_size, unet.config.in_channels, height // 8, width // 8), device=torch_device)
latents = latents * scheduler.init_noise_sigma


### Denoise the Image

Create the denoising loop to progressively transform the pure noise in latents to an image described by your prompt:

In [None]:

from tqdm.auto import tqdm

scheduler.set_timesteps(num_inference_steps)

for t in tqdm(scheduler.timesteps):
    latent_model_input = torch.cat([latents] * 2)
    latent_model_input = scheduler.scale_model_input(latent_model_input, timestep=t)

    with torch.no_grad():
        noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

    latents = scheduler.step(noise_pred, t, latents).prev_sample


### Decode the Image

Finally, use the VAE to decode the latent representation into an image:

In [None]:

latents = 1/ 0.18215 * latents
with torch.no_grad():
    image = vae.decode(latents).sample

image = (image / 2 + 0.5).clamp(0, 1).squeeze()
image = (image.permute(1, 2, 0) * 255).to(torch.uint8).cpu().numpy()
image = Image.fromarray(image)
image


And that's it! You've now created and understood a diffusion system using the `diffusers` library, both for a basic and a Stable Diffusion pipeline. Feel free to experiment with different models and settings to see what other amazing images you can generate!


In [None]:
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("black-forest-labs/FLUX.1-dev")
pipe.load_lora_weights("strangerzonehf/Flux-Midjourney-Mix2-LoRA")

prompt = "street photography, dark green background --ar 47:64 --v 6.0 --style raw"
image = pipe(prompt).images[0]