# Pertrubed-Attention Guidance

**Perturbed-Attention Guidance (PAG)** is a diffusion sampling guidance that improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules.

PAG is designed to progressively enhance the structure of synthesized samples throughout the denoising process by considering the self-attention mechanisms' ability to capture structural information. It involves generating intermediate samples with degraded structure by substituting self-attention maps in diffusion UNet with an identity matrix, and guiding the denoising process away from these degraded samples.

## General tasks

To enable PAG, we can load the pipeline using the `AutoPipelin` API with the `enable_pag=True` and the `pag_applied_layers` argument.

##### Text-to-image

In [None]:
import torch
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid

pipeline = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    enable_pag=True,
    pag_applied_layers=['mid'],
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()

If we have a pipeline created and loaded, we can enable PAG on it using the `from_pipe` API with the `enable_pag` flag.

In [None]:
pipeline_sdxl = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pipe(
    pipeline_sdxl,
    enable_pag=True,
)

To generate an image, we need to pass a `pag_scale`.
* When `pag_scale` increases, images gain more semantically coherent structures and exhibit fewer artifacts.
* Overly large guidance scale can lead to smoother textures and slight saturation in the images, similarly to CFG.
* PAG is disabled when `pag_scale=0`.

`pag_scale=3.0` is used in the official demo and works well in most of the use cases.

In [None]:
prompt = 'an insect robot preparing a delicious meal, anime style'
generator = torch.Generator('cpu').manual_seed(111)

images = []
for pag_scale in [0, 1, 2, 3, 5, 10]:
    image = pipeline(
        prompt,
        num_inference_steps=25,
        guidance_scale=7.,
        generator=generator,
        pag_scale=pag_scale,
    ).images[0]
    images.append(image)

make_image_grid(images, rows=1, cols=(len(images)))

##### Image-to-image

In [None]:
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid
import torch

pipeline = AutoPipelineForImage2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    enable_pag=True,
    pag_applied_layers=['mid'],
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()

If we already have a image-to-image pipeline and would like to enable PAG,

In [None]:
pipeline_sdxl = AutoPipelineForImage2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForImage2Image.from_pipe(
    pipeline_sdxl,
    enable_pag=True,
)

To directly switch from a text-to-image pipeline to a PAG-enabled image-to-image pipeline

In [None]:
from diffusers import AutoPipelineForText2Image

pipeline_t2i = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForImage2Image.from_pipe(
    pipeline_t2i,
    enable_pag=True,
)

If we have a PAG-enabled text-to-image pipeline, we can directly switch to an image-to-image pipeline with PAG still enabled:

In [None]:
pipeline_pag = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    enable_pag=True,
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_pag)

In [None]:
pag_scales = 4.0
guidance_scales = 7.0

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
init_image = load_image(url)
prompt = "a dog catching a frisbee in the jungle"
generator = torch.Generator('cpu').manual_seed(111)

image = pipeline(
    prompt,
    image=init_image,
    strength=0.8,
    guidance_scale=guidance_scale,
    pag_scale=pag_scale,
    generator=generator,
).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

##### Inpainting

In [None]:
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid
import torch

pipeline = AutoPipelineForInpainting.from_pretrained(
    'stabilityai/stable_diffusion-xl-base-1.0',
    enable_pag=True,
    pag_applied_layers=['mid'],
    torch_dtype=torch.float16,
)
pipeline.enable_model_cpu_offload()

On an existing inpainting pipeline,

In [None]:
pipeline_sdxl = AutoPipelineForInpaint.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForInpainting.from_pipe(
    pipeline_sdxl,
    enable_pag=True,
)

Switching from another pipeline task:

In [None]:
from diffusers import AutoPipelineForText2Image

pipeline_t2i = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForInpainting.from_pipe(
    pipeline_t2i,
    enable_pag=True
)

In [None]:
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")

prompt = "A majestic tiger sitting on a bench"

pag_scales = 3.0
guidance_scale = 7.5
generator = torch.Generator('cpu').manual_seed(111)

image = pipeline(
    prompt,
    image=init_image,
    mask_image=mask_image,
    strength=0.8,
    num_inference_steps=50,
    guidance_scale=guidance_scale,
    generator=generator,
    pag_scale=pag_scale,
).images[0]
make_image_grid([init_image, mask_image, image], rows=1, cols=3)

## PAG with ControlNet

In [None]:
from diffusers import AutoPipelineForText2Image, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained(
    'diffusers/controlnet-canny-sdxl-1.0',
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    controlnet=controlnet,
    enable_pag=True,
    pag_applied_layers=['mid'],
    torch_dtype=torch.float16
)
pipeline.enable_model_cpu_offload()

If we already have a controlnet pipeline and want to enable PAG:

In [None]:
pipeline_controlnet = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    controlnet=controlnet,
    torch_dtype=torch.float16,
)

pipeline = AutoPipelineForText2Image.from_pipe(
    pipeline_controlnet,
    enable_pag=True,
)

In [None]:
from diffusers.utils import load_image, make_image_grid

canny_image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/pag_control_input.png"
)
generator = torch.Generator('cpu').manual_seed(111)

images = []
images.append(canny_image)
for pag_scale in [0., 3.0]:
    image = pipeline(
        prompt="",
        controlnet_conditioning_scale=0.8,
        image=canny_image,
        num_inference_steps=50,
        guidance_scale=0,
        generator=generator,
        pag_scale=pag_scale,
    ).images[0]
    images.append(image)

make_image_grid(images, rows=1, cols=(len(images)))

## PAG with IP-Adapter

In [None]:
from diffusers import AutoPipelineForText2Image
from diffusers.utils import load_image, make_image_grid
from transformers import CLIPVisionModelWithProjection
import torch

image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    'h94/IP-Adapter',
    subfolder='models/image_encoder',
    torch_dtype=torch.float16
)

pipeline = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    image_encoder=image_encoder,
    enable_pag=True,
    torch_dtype=torch.float16,
).to('cuda')

pipeline.load_ip_adapter(
    'h94/IP-Adapter',
    subfolder='sdxl_models',
    weight_name='ip-adapter-plus_sdxl_vit-h.bin'
)

In [None]:
pag_scales = 5.0
ip_adapter_scale = 0.8

ip_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")
propmt = 'a polar bear sitting in a chair drinking a milkshake'
negative_prompt="deformed, ugly, wrong proportion, low res, bad anatomy, worst quality, low quality"
generator = torch.Generator('cpu').manual_seed(111)

pipeline.set_ip_adapter_scale(ip_adapter_scale)
image = pipeline(
    prompt,
    negative_prompt=negative_prompt,
    ip_adapter_image=ip_image,
    num_inference_steps=25,
    guidance_scale=3.0,
    generator=generator,
    pag_scale=pag_scale,
).images[0]
make_image_grid([ip_image, image], rows=1, cols=2)

## Configure parameters

The `pag_applied_layers` argument allows us to specify which layers PAG is applied to. By default, it applies only to the mid blocks.

We can use the `set_pag_applied_layers` to adjust the PAG layers after the pipeline is created.

In [None]:
from diffusers import AutoPipelineText2Image
import torch

pipeline = AutoPipelineForText2Image.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0',
    torch_dtype=torch.float16,
    enable_pag=True,
)
pipeline.enable_model_cpu_offload()

In [None]:
prompt = "an insect robot preparing a delicious meal, anime style"
generator = torch.Generator(device="cpu").manual_seed(0)
pag_layers = [
    ['mid'],
    ['down.block_1'],
    ['down.block_2', 'up.block_1.attentions_0'],
]

images = []
for pag_applied_layers in pag_layers:
    pipeline.set_pag_applied_layers(pag_applied_layers)
    image = pipeline(
        prompt,
        num_inference_steps=25,
        guidance_scale=5.,
        generator=generator,
        pag_scale=3
    ).images[0]
    images.append(image)

make_image_grid(images, rows=1, cols=(len(images)))