Skip to content

pag model/pipeline review #13594

@hlky

Description

@hlky

pag model/pipeline review

Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423

Review performed against the repository review rules. Reviewed PAG public exports/lazy imports, pipeline/runtime behavior, related base-pipeline consistency, docs/examples, and tests/pipelines/pag.

Issue 1: pag_applied_layers="blocks.1" also matches blocks.10

Affected code:

def is_fake_integral_match(layer_id, name):
layer_id = layer_id.split(".")[-1]
name = name.split(".")[-1]
return layer_id.isnumeric() and name.isnumeric() and layer_id == name
for layer_id in pag_applied_layers:
# for each PAG layer input, we find corresponding self-attention layers in the unet model
target_modules = []
for name, module in model.named_modules():
# Identify the following simple cases:
# (1) Self Attention layer existing
# (2) Whether the module name matches pag layer id even partially
# (3) Make sure it's not a fake integral match if the layer_id ends with a number
# For example, blocks.1, blocks.10 should be differentiable if layer_id="blocks.1"
if (
is_self_attn(module)
and re.search(layer_id, name) is not None
and not is_fake_integral_match(layer_id, name)
):

Problem:
re.search(layer_id, name) is used directly, and the numeric disambiguation only compares the last dot-separated tokens. For names like blocks.10.attn1, the last token is attn1, so blocks.1 still matches blocks.10. This contradicts the nearby comment and over-applies PAG to unintended transformer blocks.

Impact:
Users selecting a single DiT block can silently perturb additional blocks, changing quality/performance and making layer ablations unreliable.

Reproduction:

import torch.nn as nn
from diffusers.models.attention_processor import Attention
from diffusers.pipelines.pag.pag_utils import PAGMixin

class TinyTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.blocks = nn.ModuleList([nn.Module() for _ in range(11)])
        for block in self.blocks:
            block.attn1 = Attention(query_dim=8, heads=1, dim_head=8)

    @property
    def attn_processors(self):
        return {f"{n}.processor": m.processor for n, m in self.named_modules() if isinstance(m, Attention)}

class Dummy(PAGMixin):
    def __init__(self):
        self.transformer = TinyTransformer()
        self.set_pag_applied_layers(["blocks.1"])

pipe = Dummy()
pipe._set_pag_attn_processor(pipe.pag_applied_layers, do_classifier_free_guidance=False)
print(sorted(pipe.pag_attn_processors))
# Includes both blocks.1.attn1.processor and blocks.10.attn1.processor

Relevant precedent:
No duplicate found in GitHub issue/PR searches for PAG pag_applied_layers blocks.1 blocks.10.

Suggested fix:

match = re.search(layer_id, name)
if match is None:
    continue
if layer_id[-1].isdigit() and match.end() < len(name) and name[match.end()].isdigit():
    continue
if is_self_attn(module):
    target_modules.append(module)

Issue 2: PAG processors are not restored if generation raises

Affected code:

if self.do_perturbed_attention_guidance:
original_attn_proc = self.unet.attn_processors
self._set_pag_attn_processor(
pag_applied_layers=self.pag_applied_layers,
do_classifier_free_guidance=self.do_classifier_free_guidance,
)

# Offload all models
self.maybe_free_model_hooks()
if self.do_perturbed_attention_guidance:
self.unet.set_attn_processor(original_attn_proc)

Problem:
PAG pipelines save the original attention processors before the denoising loop and restore them only on the normal success path. If a callback, scheduler, VAE, or user interrupt raises after _set_pag_attn_processor, the model keeps PAG processors installed.

Impact:
A later pag_scale=0 call can fail or produce wrong results because the UNet/transformer still expects PAG-expanded batches.

Reproduction:

import torch
from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPAGPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(block_out_channels=(4, 8), layers_per_block=2, sample_size=32, in_channels=4, out_channels=4,
    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
    cross_attention_dim=8, norm_num_groups=2)
vae = AutoencoderKL(block_out_channels=[4, 8], in_channels=3, out_channels=3, down_block_types=["DownEncoderBlock2D"]*2,
    up_block_types=["UpDecoderBlock2D"]*2, latent_channels=4, norm_num_groups=2)
text_encoder = CLIPTextModel(CLIPTextConfig(bos_token_id=0, eos_token_id=2, hidden_size=8, intermediate_size=16,
    num_attention_heads=2, num_hidden_layers=2, pad_token_id=1, vocab_size=1000))
pipe = StableDiffusionPAGPipeline(
    unet=unet, scheduler=DDIMScheduler(), vae=vae, text_encoder=text_encoder,
    tokenizer=CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip"),
    safety_checker=None, feature_extractor=None,
)
pipe.set_progress_bar_config(disable=True)
try:
    pipe("x", num_inference_steps=1, guidance_scale=5, pag_scale=1, output_type="latent",
         callback_on_step_end=lambda *a, **k: (_ for _ in ()).throw(RuntimeError("boom")))
except RuntimeError:
    pass
pipe("x", num_inference_steps=1, guidance_scale=5, pag_scale=0, output_type="latent")
# ValueError: not enough values to unpack (expected 3, got 2)

Relevant precedent:
No duplicate found for PAG callback exception attention processor restore.

Suggested fix:

original_attn_proc = None
try:
    if self.do_perturbed_attention_guidance:
        original_attn_proc = self.unet.attn_processors
        self._set_pag_attn_processor(self.pag_applied_layers, self.do_classifier_free_guidance)
    # denoise/decode body
finally:
    if original_attn_proc is not None:
        self.unet.set_attn_processor(original_attn_proc)

Issue 3: SD3 PAG pipelines are stale versus base SD3

Affected code:

class StableDiffusion3PAGPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, PAGMixin):
r"""
[PAG pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag) for text-to-image generation
using Stable Diffusion 3.
Args:
transformer ([`SD3Transformer2DModel`]):
Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler ([`FlowMatchEulerDiscreteScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModelWithProjection`]):
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
as its dimension.
text_encoder_2 ([`CLIPTextModelWithProjection`]):
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
specifically the
[laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
variant.
text_encoder_3 ([`T5EncoderModel`]):
Frozen text-encoder. Stable Diffusion 3 uses
[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
[t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
tokenizer (`CLIPTokenizer`):
Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
tokenizer_2 (`CLIPTokenizer`):
Second Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
tokenizer_3 (`T5TokenizerFast`):
Tokenizer of class
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
"""
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
_optional_components = []
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]

def __call__(
self,
prompt: str | list[str] = None,
prompt_2: str | list[str] | None = None,
prompt_3: str | list[str] | None = None,
height: int | None = None,
width: int | None = None,
num_inference_steps: int = 28,
sigmas: list[float] | None = None,
guidance_scale: float = 7.0,
negative_prompt: str | list[str] | None = None,
negative_prompt_2: str | list[str] | None = None,
negative_prompt_3: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.FloatTensor | None = None,
prompt_embeds: torch.FloatTensor | None = None,
negative_prompt_embeds: torch.FloatTensor | None = None,
pooled_prompt_embeds: torch.FloatTensor | None = None,
negative_pooled_prompt_embeds: torch.FloatTensor | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
joint_attention_kwargs: dict[str, Any] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
max_sequence_length: int = 256,
pag_scale: float = 3.0,
pag_adaptive_scale: float = 0.0,
):

class StableDiffusion3PAGImg2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, PAGMixin):
r"""
[PAG pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag) for image-to-image generation
using Stable Diffusion 3.
Args:
transformer ([`SD3Transformer2DModel`]):
Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
scheduler ([`FlowMatchEulerDiscreteScheduler`]):
A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
vae ([`AutoencoderKL`]):
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
text_encoder ([`CLIPTextModelWithProjection`]):
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
as its dimension.
text_encoder_2 ([`CLIPTextModelWithProjection`]):
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
specifically the
[laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
variant.
text_encoder_3 ([`T5EncoderModel`]):
Frozen text-encoder. Stable Diffusion 3 uses
[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
[t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
tokenizer (`CLIPTokenizer`):
Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
tokenizer_2 (`CLIPTokenizer`):
Second Tokenizer of class
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
tokenizer_3 (`T5TokenizerFast`):
Tokenizer of class
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
"""
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
_optional_components = []
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]

negative_prompt: str | list[str] | None = None,
negative_prompt_2: str | list[str] | None = None,
negative_prompt_3: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.FloatTensor | None = None,
prompt_embeds: torch.FloatTensor | None = None,
negative_prompt_embeds: torch.FloatTensor | None = None,
pooled_prompt_embeds: torch.FloatTensor | None = None,
negative_pooled_prompt_embeds: torch.FloatTensor | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
joint_attention_kwargs: dict[str, Any] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
max_sequence_length: int = 256,
pag_scale: float = 3.0,
pag_adaptive_scale: float = 0.0,
):

Problem:
Base SD3 supports mu for dynamic shifting, SD3 IP-Adapter loading/call inputs, and text-to-image skip-layer guidance. The PAG SD3 variants do not expose those APIs. They also expose callback tensor inputs for negative pooled embeds while omitting the consumed pooled_prompt_embeds.

Impact:
SD3.5-style schedulers with use_dynamic_shifting=True fail because PAG cannot pass mu; IP-Adapter workflows cannot be used with SD3 PAG; callbacks cannot modify pooled conditioning.

Reproduction:

import inspect
from diffusers import FlowMatchEulerDiscreteScheduler, StableDiffusion3PAGPipeline, StableDiffusion3PAGImg2ImgPipeline

for cls in (StableDiffusion3PAGPipeline, StableDiffusion3PAGImg2ImgPipeline):
    params = inspect.signature(cls.__call__).parameters
    print(cls.__name__, "mu" in params, "ip_adapter_image" in params, cls._callback_tensor_inputs)

scheduler = FlowMatchEulerDiscreteScheduler(use_dynamic_shifting=True)
scheduler.set_timesteps(2)
# ValueError: `mu` must be passed when `use_dynamic_shifting` is set to be `True`

Relevant precedent:
Base SD3 implements these APIs:

ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: torch.Tensor | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
joint_attention_kwargs: dict[str, Any] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
max_sequence_length: int = 256,
skip_guidance_layers: list[int] = None,
skip_layer_guidance_scale: float = 2.8,
skip_layer_guidance_stop: float = 0.2,
skip_layer_guidance_start: float = 0.01,
mu: float | None = None,

scheduler_kwargs = {}
if self.scheduler.config.get("use_dynamic_shifting", None) and mu is None:
_, _, height, width = latents.shape
image_seq_len = (height // self.transformer.config.patch_size) * (
width // self.transformer.config.patch_size
)
mu = calculate_shift(
image_seq_len,
self.scheduler.config.get("base_image_seq_len", 256),
self.scheduler.config.get("max_image_seq_len", 4096),
self.scheduler.config.get("base_shift", 0.5),
self.scheduler.config.get("max_shift", 1.16),
)
scheduler_kwargs["mu"] = mu
elif mu is not None:
scheduler_kwargs["mu"] = mu
if XLA_AVAILABLE:
timestep_device = "cpu"
else:
timestep_device = device
timesteps, num_inference_steps = retrieve_timesteps(
self.scheduler,
num_inference_steps,
timestep_device,
sigmas=sigmas,
**scheduler_kwargs,

No duplicate found for StableDiffusion3PAGPipeline mu use_dynamic_shifting or SD3 PAG IP-Adapter searches.

Suggested fix:
Port the current base SD3 __call__ API and logic into both SD3 PAG variants, then add the PAG batch expansion around the updated prompt/pooled/IP-Adapter conditioning. Include SD3IPAdapterMixin, mu, and the base callback tensor contract.

Issue 4: ControlNet PAG variants dropped guess_mode

Affected code:

height: int | None = None,
width: int | None = None,
padding_mask_crop: int | None = None,
strength: float = 1.0,
num_inference_steps: int = 50,
guidance_scale: float = 7.5,
negative_prompt: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
controlnet_conditioning_scale: float | list[float] = 0.5,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
pag_scale: float = 3.0,
pag_adaptive_scale: float = 0.0,
):

batch_size=batch_size * num_images_per_prompt,
num_images_per_prompt=num_images_per_prompt,
device=device,
dtype=controlnet.dtype,
crops_coords=crops_coords,
resize_mode=resize_mode,
do_classifier_free_guidance=self.do_classifier_free_guidance,
guess_mode=False,
)
elif isinstance(controlnet, MultiControlNetModel):
control_images = []
for control_image_ in control_image:
control_image_ = self.prepare_control_image(
image=control_image_,
width=width,
height=height,
batch_size=batch_size * num_images_per_prompt,
num_images_per_prompt=num_images_per_prompt,
device=device,
dtype=controlnet.dtype,
crops_coords=crops_coords,
resize_mode=resize_mode,
do_classifier_free_guidance=self.do_classifier_free_guidance,
guess_mode=False,
)

def __call__(
self,
prompt: str | list[str] = None,
prompt_2: str | list[str] | None = None,
image: PipelineImageInput = None,
height: int | None = None,
width: int | None = None,
num_inference_steps: int = 50,
timesteps: list[int] = None,
sigmas: list[float] = None,
denoising_end: float | None = None,
guidance_scale: float = 5.0,
negative_prompt: str | list[str] | None = None,
negative_prompt_2: str | list[str] | None = None,
num_images_per_prompt: int | None = 1,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
pooled_prompt_embeds: torch.Tensor | None = None,
negative_pooled_prompt_embeds: torch.Tensor | None = None,
ip_adapter_image: PipelineImageInput | None = None,
ip_adapter_image_embeds: list[torch.Tensor] | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
cross_attention_kwargs: dict[str, Any] | None = None,
controlnet_conditioning_scale: float | list[float] = 1.0,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
original_size: tuple[int, int] = None,
crops_coords_top_left: tuple[int, int] = (0, 0),
target_size: tuple[int, int] = None,
negative_original_size: tuple[int, int] | None = None,
negative_crops_coords_top_left: tuple[int, int] = (0, 0),
negative_target_size: tuple[int, int] | None = None,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
pag_scale: float = 3.0,
pag_adaptive_scale: float = 0.0,
):

down_block_res_samples, mid_block_res_sample = self.controlnet(
control_model_input,
t,
encoder_hidden_states=controlnet_prompt_embeds,
controlnet_cond=image,
conditioning_scale=cond_scale,
guess_mode=False,
added_cond_kwargs=controlnet_added_cond_kwargs,

Problem:
StableDiffusionControlNetPAGInpaintPipeline and StableDiffusionXLControlNetPAGPipeline omit the public guess_mode argument and hard-code guess_mode=False in image preparation and ControlNet forward. They also skip the base pipeline’s global_pool_conditions fallback.

Impact:
ControlNet checkpoints that require guess mode/global pooling cannot be used through these PAG variants, and users cannot request a feature supported by the corresponding base pipelines.

Reproduction:

import inspect
from diffusers import StableDiffusionControlNetPAGInpaintPipeline, StableDiffusionXLControlNetPAGPipeline

for cls in (StableDiffusionControlNetPAGInpaintPipeline, StableDiffusionXLControlNetPAGPipeline):
    print(cls.__name__, "guess_mode" in inspect.signature(cls.__call__).parameters)
# False for both

Relevant precedent:
Base inpaint and SDXL ControlNet expose and normalize guess_mode:

controlnet_conditioning_scale: float | list[float] = 0.5,
guess_mode: bool = False,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
clip_skip: int | None = None,
callback_on_step_end: Callable[[int, int], None] | PipelineCallback | MultiPipelineCallbacks | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
**kwargs,

cross_attention_kwargs: dict[str, Any] | None = None,
controlnet_conditioning_scale: float | list[float] = 1.0,
guess_mode: bool = False,
control_guidance_start: float | list[float] = 0.0,
control_guidance_end: float | list[float] = 1.0,
original_size: tuple[int, int] = None,
crops_coords_top_left: tuple[int, int] = (0, 0),
target_size: tuple[int, int] = None,

Search found old base SDXL guess-mode issue #4709, but it is closed and not a duplicate of these PAG omissions.

Suggested fix:

# __call__ signature
guess_mode: bool = False,

# after resolving controlnet
global_pool_conditions = (
    controlnet.config.global_pool_conditions
    if isinstance(controlnet, ControlNetModel)
    else controlnet.nets[0].config.global_pool_conditions
)
guess_mode = guess_mode or global_pool_conditions

# replace every hard-coded guess_mode=False with guess_mode=guess_mode

Issue 5: SanaPAGPipeline lost Sana LoRA and attention kwargs support

Affected code:

class SanaPAGPipeline(DiffusionPipeline, PAGMixin):
r"""
Pipeline for text-to-image generation using [Sana](https://huggingface.co/papers/2410.10629). This pipeline
supports the use of [Perturbed Attention Guidance
(PAG)](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag).
"""
# fmt: off
bad_punct_regex = re.compile(r"[" + "#®•©™&@·º½¾¿¡§~" + r"\)" + r"\(" + r"\]" + r"\[" + r"\}" + r"\{" + r"\|" + "\\" + r"\/" + r"\*" + r"]{1,}")
# fmt: on
model_cpu_offload_seq = "text_encoder->transformer->vae"
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]

def __call__(
self,
prompt: str | list[str] = None,
negative_prompt: str = "",
num_inference_steps: int = 20,
timesteps: list[int] = None,
sigmas: list[float] = None,
guidance_scale: float = 4.5,
num_images_per_prompt: int | None = 1,
height: int = 1024,
width: int = 1024,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
prompt_attention_mask: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
negative_prompt_attention_mask: torch.Tensor | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
clean_caption: bool = False,
use_resolution_binning: bool = True,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],
max_sequence_length: int = 300,
complex_human_instruction: list[str] = [
"Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:",
"- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.",
"- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.",
"Here are examples of how to transform or refine prompts:",
"- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.",
"- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.",
"Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:",
"User Prompt: ",
],
pag_scale: float = 3.0,
pag_adaptive_scale: float = 0.0,
) -> ImagePipelineOutput | tuple:

noise_pred = self.transformer(
latent_model_input,
encoder_hidden_states=prompt_embeds,
encoder_attention_mask=prompt_attention_mask,
timestep=timestep,

Problem:
Base SanaPipeline inherits SanaLoraLoaderMixin, accepts attention_kwargs, uses attention_kwargs["scale"] for prompt encoding, and forwards the kwargs into the transformer. SanaPAGPipeline inherits only DiffusionPipeline, PAGMixin and has no attention_kwargs argument or forwarding.

Impact:
Users cannot load or scale Sana LoRAs with the PAG pipeline, despite Sana supporting them and the PAG API page carrying a LoRA badge.

Reproduction:

import inspect
from diffusers import SanaPipeline, SanaPAGPipeline

print(hasattr(SanaPipeline, "load_lora_weights"), hasattr(SanaPAGPipeline, "load_lora_weights"))
print("attention_kwargs" in inspect.signature(SanaPipeline.__call__).parameters)
print("attention_kwargs" in inspect.signature(SanaPAGPipeline.__call__).parameters)
# True False
# True
# False

Relevant precedent:

class SanaPipeline(DiffusionPipeline, SanaLoraLoaderMixin):

def __call__(
self,
prompt: str | list[str] = None,
negative_prompt: str = "",
num_inference_steps: int = 20,
timesteps: list[int] = None,
sigmas: list[float] = None,
guidance_scale: float = 4.5,
num_images_per_prompt: int | None = 1,
height: int = 1024,
width: int = 1024,
eta: float = 0.0,
generator: torch.Generator | list[torch.Generator] | None = None,
latents: torch.Tensor | None = None,
prompt_embeds: torch.Tensor | None = None,
prompt_attention_mask: torch.Tensor | None = None,
negative_prompt_embeds: torch.Tensor | None = None,
negative_prompt_attention_mask: torch.Tensor | None = None,
output_type: str | None = "pil",
return_dict: bool = True,
clean_caption: bool = False,
use_resolution_binning: bool = True,
attention_kwargs: dict[str, Any] | None = None,
callback_on_step_end: Callable[[int, int], None] | None = None,
callback_on_step_end_tensor_inputs: list[str] = ["latents"],

self._guidance_scale = guidance_scale
self._attention_kwargs = attention_kwargs
self._interrupt = False
# 2. Default height and width to transformer
if prompt is not None and isinstance(prompt, str):
batch_size = 1
elif prompt is not None and isinstance(prompt, list):
batch_size = len(prompt)
else:
batch_size = prompt_embeds.shape[0]
device = self._execution_device
lora_scale = self.attention_kwargs.get("scale", None) if self.attention_kwargs is not None else None

Related SanaPAG quality issue #10241 exists, but it is not a duplicate of the LoRA/attention kwargs API gap.

Suggested fix:

class SanaPAGPipeline(DiffusionPipeline, SanaLoraLoaderMixin, PAGMixin):
    ...

def __call__(..., attention_kwargs: dict[str, Any] | None = None, ...):
    self._attention_kwargs = attention_kwargs
    lora_scale = self.attention_kwargs.get("scale", None) if self.attention_kwargs is not None else None
    ...
    noise_pred = self.transformer(..., attention_kwargs=self.attention_kwargs, return_dict=False)[0]

Issue 6: PAG docs have broken examples and omit Sana API docs

Affected code:

pag_scales = 4.0
guidance_scales = 7.0
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
init_image = load_image(url)
prompt = "a dog catching a frisbee in the jungle"
generator = torch.Generator(device="cpu").manual_seed(0)
image = pipeline(
prompt,
image=init_image,
strength=0.8,
guidance_scale=guidance_scale,
pag_scale=pag_scale,
generator=generator).images[0]

pipeline_t2i = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_t2i, enable_pag=True)
```
Let's generate an image!
```py
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")
prompt = "A majestic tiger sitting on a bench"
pag_scales = 3.0
guidance_scales = 7.5
generator = torch.Generator(device="cpu").manual_seed(1)
images = pipeline(
prompt=prompt,
image=init_image,
mask_image=mask_image,
strength=0.8,
num_inference_steps=50,
guidance_scale=guidance_scale,
generator=generator,
pag_scale=pag_scale,

## AnimateDiffPAGPipeline
[[autodoc]] AnimateDiffPAGPipeline
- all
- __call__
## HunyuanDiTPAGPipeline
[[autodoc]] HunyuanDiTPAGPipeline
- all
- __call__
## KolorsPAGPipeline
[[autodoc]] KolorsPAGPipeline
- all
- __call__
## StableDiffusionPAGInpaintPipeline
[[autodoc]] StableDiffusionPAGInpaintPipeline
- all
- __call__
## StableDiffusionPAGPipeline
[[autodoc]] StableDiffusionPAGPipeline
- all
- __call__
## StableDiffusionPAGImg2ImgPipeline
[[autodoc]] StableDiffusionPAGImg2ImgPipeline
- all
- __call__
## StableDiffusionControlNetPAGPipeline
[[autodoc]] StableDiffusionControlNetPAGPipeline
## StableDiffusionControlNetPAGInpaintPipeline
[[autodoc]] StableDiffusionControlNetPAGInpaintPipeline
- all
- __call__
## StableDiffusionXLPAGPipeline
[[autodoc]] StableDiffusionXLPAGPipeline
- all
- __call__
## StableDiffusionXLPAGImg2ImgPipeline
[[autodoc]] StableDiffusionXLPAGImg2ImgPipeline
- all
- __call__
## StableDiffusionXLPAGInpaintPipeline
[[autodoc]] StableDiffusionXLPAGInpaintPipeline
- all
- __call__
## StableDiffusionXLControlNetPAGPipeline
[[autodoc]] StableDiffusionXLControlNetPAGPipeline
- all
- __call__
## StableDiffusionXLControlNetPAGImg2ImgPipeline
[[autodoc]] StableDiffusionXLControlNetPAGImg2ImgPipeline
- all
- __call__
## StableDiffusion3PAGPipeline
[[autodoc]] StableDiffusion3PAGPipeline
- all
- __call__
## StableDiffusion3PAGImg2ImgPipeline
[[autodoc]] StableDiffusion3PAGImg2ImgPipeline
- all
- __call__
## PixArtSigmaPAGPipeline
[[autodoc]] PixArtSigmaPAGPipeline
- all
- __call__

Problem:
The guide defines pag_scales/guidance_scales but calls pag_scale/guidance_scale, references AutoPipelineForInpaiting, and uses pipeline_t2i where the preceding snippet defines pipeline_pag. The API page exports SanaPAGPipeline in code but never documents it.

Impact:
Copy-pasted PAG guide snippets fail before generation, and Sana PAG users cannot find the API reference.

Reproduction:

from pathlib import Path

guide = Path("docs/source/en/using-diffusers/pag.md").read_text()
api = Path("docs/source/en/api/pipelines/pag.md").read_text()
assert "AutoPipelineForInpaiting" not in guide
assert "pag_scales" not in guide or "pag_scale=pag_scale" not in guide
assert "## SanaPAGPipeline" in api

Relevant precedent:
No duplicate found for AutoPipelineForInpaiting pag_scale guidance_scale PAG docs.

Suggested fix:

pag_scale = 4.0
guidance_scale = 7.0
...
pipeline = AutoPipelineForInpainting.from_pipe(pipeline_t2i, enable_pag=True)

Also add:

## SanaPAGPipeline
[[autodoc]] SanaPAGPipeline
  - all
  - __call__

Issue 7: Slow coverage is missing for most PAG variants

Affected code:

class AnimateDiffPAGPipelineFastTests(

class StableDiffusionControlNetPAGPipelineFastTests(

class StableDiffusionControlNetPAGInpaintPipelineFastTests(

class StableDiffusionXLControlNetPAGPipelineFastTests(

class StableDiffusionXLControlNetPAGImg2ImgPipelineFastTests(

class HunyuanDiTPAGPipelineFastTests(PipelineTesterMixin, unittest.TestCase):

class KolorsPAGPipelineFastTests(

class PixArtSigmaPAGPipelineFastTests(PipelineTesterMixin, unittest.TestCase):

class SanaPAGPipelineFastTests(PipelineTesterMixin, unittest.TestCase):

class StableDiffusion3PAGPipelineFastTests(unittest.TestCase, PipelineTesterMixin):

Problem:
Fast tests exist across the family, but these PAG variants have no @slow integration tests. The prompt explicitly requires missing slow tests to be reported.

Impact:
Real checkpoint/API drift is not covered for many PAG pipelines, including the stale SD3/Sana/ControlNet gaps above.

Reproduction:

from pathlib import Path

missing = [
    p.as_posix()
    for p in sorted(Path("tests/pipelines/pag").glob("test_pag_*.py"))
    if "@slow" not in p.read_text(encoding="utf-8")
]
print("\n".join(missing))

Relevant precedent:
Existing slow PAG tests are present for SD, SD img2img/inpaint, SDXL, SDXL img2img/inpaint, and SD3 img2img.

Suggested fix:
Add at least one @slow smoke/integration class per missing pipeline using the smallest stable public checkpoint available, covering pag_scale=0 parity and pag_scale>0 execution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions