pag model/pipeline review
Commit tested: 0f1abc4ae8b0eb2a3b40e82a310507281144c423
Review performed against the repository review rules. Reviewed PAG public exports/lazy imports, pipeline/runtime behavior, related base-pipeline consistency, docs/examples, and tests/pipelines/pag.
Issue 1: pag_applied_layers="blocks.1" also matches blocks.10
Affected code:
|
def is_fake_integral_match(layer_id, name): |
|
layer_id = layer_id.split(".")[-1] |
|
name = name.split(".")[-1] |
|
return layer_id.isnumeric() and name.isnumeric() and layer_id == name |
|
|
|
for layer_id in pag_applied_layers: |
|
# for each PAG layer input, we find corresponding self-attention layers in the unet model |
|
target_modules = [] |
|
|
|
for name, module in model.named_modules(): |
|
# Identify the following simple cases: |
|
# (1) Self Attention layer existing |
|
# (2) Whether the module name matches pag layer id even partially |
|
# (3) Make sure it's not a fake integral match if the layer_id ends with a number |
|
# For example, blocks.1, blocks.10 should be differentiable if layer_id="blocks.1" |
|
if ( |
|
is_self_attn(module) |
|
and re.search(layer_id, name) is not None |
|
and not is_fake_integral_match(layer_id, name) |
|
): |
Problem:
re.search(layer_id, name) is used directly, and the numeric disambiguation only compares the last dot-separated tokens. For names like blocks.10.attn1, the last token is attn1, so blocks.1 still matches blocks.10. This contradicts the nearby comment and over-applies PAG to unintended transformer blocks.
Impact:
Users selecting a single DiT block can silently perturb additional blocks, changing quality/performance and making layer ablations unreliable.
Reproduction:
import torch.nn as nn
from diffusers.models.attention_processor import Attention
from diffusers.pipelines.pag.pag_utils import PAGMixin
class TinyTransformer(nn.Module):
def __init__(self):
super().__init__()
self.blocks = nn.ModuleList([nn.Module() for _ in range(11)])
for block in self.blocks:
block.attn1 = Attention(query_dim=8, heads=1, dim_head=8)
@property
def attn_processors(self):
return {f"{n}.processor": m.processor for n, m in self.named_modules() if isinstance(m, Attention)}
class Dummy(PAGMixin):
def __init__(self):
self.transformer = TinyTransformer()
self.set_pag_applied_layers(["blocks.1"])
pipe = Dummy()
pipe._set_pag_attn_processor(pipe.pag_applied_layers, do_classifier_free_guidance=False)
print(sorted(pipe.pag_attn_processors))
# Includes both blocks.1.attn1.processor and blocks.10.attn1.processor
Relevant precedent:
No duplicate found in GitHub issue/PR searches for PAG pag_applied_layers blocks.1 blocks.10.
Suggested fix:
match = re.search(layer_id, name)
if match is None:
continue
if layer_id[-1].isdigit() and match.end() < len(name) and name[match.end()].isdigit():
continue
if is_self_attn(module):
target_modules.append(module)
Issue 2: PAG processors are not restored if generation raises
Affected code:
|
if self.do_perturbed_attention_guidance: |
|
original_attn_proc = self.unet.attn_processors |
|
self._set_pag_attn_processor( |
|
pag_applied_layers=self.pag_applied_layers, |
|
do_classifier_free_guidance=self.do_classifier_free_guidance, |
|
) |
|
# Offload all models |
|
self.maybe_free_model_hooks() |
|
|
|
if self.do_perturbed_attention_guidance: |
|
self.unet.set_attn_processor(original_attn_proc) |
Problem:
PAG pipelines save the original attention processors before the denoising loop and restore them only on the normal success path. If a callback, scheduler, VAE, or user interrupt raises after _set_pag_attn_processor, the model keeps PAG processors installed.
Impact:
A later pag_scale=0 call can fail or produce wrong results because the UNet/transformer still expects PAG-expanded batches.
Reproduction:
import torch
from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPAGPipeline, UNet2DConditionModel
unet = UNet2DConditionModel(block_out_channels=(4, 8), layers_per_block=2, sample_size=32, in_channels=4, out_channels=4,
down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
cross_attention_dim=8, norm_num_groups=2)
vae = AutoencoderKL(block_out_channels=[4, 8], in_channels=3, out_channels=3, down_block_types=["DownEncoderBlock2D"]*2,
up_block_types=["UpDecoderBlock2D"]*2, latent_channels=4, norm_num_groups=2)
text_encoder = CLIPTextModel(CLIPTextConfig(bos_token_id=0, eos_token_id=2, hidden_size=8, intermediate_size=16,
num_attention_heads=2, num_hidden_layers=2, pad_token_id=1, vocab_size=1000))
pipe = StableDiffusionPAGPipeline(
unet=unet, scheduler=DDIMScheduler(), vae=vae, text_encoder=text_encoder,
tokenizer=CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip"),
safety_checker=None, feature_extractor=None,
)
pipe.set_progress_bar_config(disable=True)
try:
pipe("x", num_inference_steps=1, guidance_scale=5, pag_scale=1, output_type="latent",
callback_on_step_end=lambda *a, **k: (_ for _ in ()).throw(RuntimeError("boom")))
except RuntimeError:
pass
pipe("x", num_inference_steps=1, guidance_scale=5, pag_scale=0, output_type="latent")
# ValueError: not enough values to unpack (expected 3, got 2)
Relevant precedent:
No duplicate found for PAG callback exception attention processor restore.
Suggested fix:
original_attn_proc = None
try:
if self.do_perturbed_attention_guidance:
original_attn_proc = self.unet.attn_processors
self._set_pag_attn_processor(self.pag_applied_layers, self.do_classifier_free_guidance)
# denoise/decode body
finally:
if original_attn_proc is not None:
self.unet.set_attn_processor(original_attn_proc)
Issue 3: SD3 PAG pipelines are stale versus base SD3
Affected code:
|
class StableDiffusion3PAGPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, PAGMixin): |
|
r""" |
|
[PAG pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag) for text-to-image generation |
|
using Stable Diffusion 3. |
|
|
|
Args: |
|
transformer ([`SD3Transformer2DModel`]): |
|
Conditional Transformer (MMDiT) architecture to denoise the encoded image latents. |
|
scheduler ([`FlowMatchEulerDiscreteScheduler`]): |
|
A scheduler to be used in combination with `transformer` to denoise the encoded image latents. |
|
vae ([`AutoencoderKL`]): |
|
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. |
|
text_encoder ([`CLIPTextModelWithProjection`]): |
|
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection), |
|
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant, |
|
with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size` |
|
as its dimension. |
|
text_encoder_2 ([`CLIPTextModelWithProjection`]): |
|
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection), |
|
specifically the |
|
[laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) |
|
variant. |
|
text_encoder_3 ([`T5EncoderModel`]): |
|
Frozen text-encoder. Stable Diffusion 3 uses |
|
[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the |
|
[t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant. |
|
tokenizer (`CLIPTokenizer`): |
|
Tokenizer of class |
|
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). |
|
tokenizer_2 (`CLIPTokenizer`): |
|
Second Tokenizer of class |
|
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). |
|
tokenizer_3 (`T5TokenizerFast`): |
|
Tokenizer of class |
|
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer). |
|
""" |
|
|
|
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae" |
|
_optional_components = [] |
|
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"] |
|
|
|
def __call__( |
|
self, |
|
prompt: str | list[str] = None, |
|
prompt_2: str | list[str] | None = None, |
|
prompt_3: str | list[str] | None = None, |
|
height: int | None = None, |
|
width: int | None = None, |
|
num_inference_steps: int = 28, |
|
sigmas: list[float] | None = None, |
|
guidance_scale: float = 7.0, |
|
negative_prompt: str | list[str] | None = None, |
|
negative_prompt_2: str | list[str] | None = None, |
|
negative_prompt_3: str | list[str] | None = None, |
|
num_images_per_prompt: int | None = 1, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.FloatTensor | None = None, |
|
prompt_embeds: torch.FloatTensor | None = None, |
|
negative_prompt_embeds: torch.FloatTensor | None = None, |
|
pooled_prompt_embeds: torch.FloatTensor | None = None, |
|
negative_pooled_prompt_embeds: torch.FloatTensor | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
joint_attention_kwargs: dict[str, Any] | None = None, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
max_sequence_length: int = 256, |
|
pag_scale: float = 3.0, |
|
pag_adaptive_scale: float = 0.0, |
|
): |
|
class StableDiffusion3PAGImg2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, PAGMixin): |
|
r""" |
|
[PAG pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag) for image-to-image generation |
|
using Stable Diffusion 3. |
|
|
|
Args: |
|
transformer ([`SD3Transformer2DModel`]): |
|
Conditional Transformer (MMDiT) architecture to denoise the encoded image latents. |
|
scheduler ([`FlowMatchEulerDiscreteScheduler`]): |
|
A scheduler to be used in combination with `transformer` to denoise the encoded image latents. |
|
vae ([`AutoencoderKL`]): |
|
Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. |
|
text_encoder ([`CLIPTextModelWithProjection`]): |
|
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection), |
|
specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant, |
|
with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size` |
|
as its dimension. |
|
text_encoder_2 ([`CLIPTextModelWithProjection`]): |
|
[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection), |
|
specifically the |
|
[laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k) |
|
variant. |
|
text_encoder_3 ([`T5EncoderModel`]): |
|
Frozen text-encoder. Stable Diffusion 3 uses |
|
[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the |
|
[t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant. |
|
tokenizer (`CLIPTokenizer`): |
|
Tokenizer of class |
|
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). |
|
tokenizer_2 (`CLIPTokenizer`): |
|
Second Tokenizer of class |
|
[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer). |
|
tokenizer_3 (`T5TokenizerFast`): |
|
Tokenizer of class |
|
[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer). |
|
""" |
|
|
|
model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae" |
|
_optional_components = [] |
|
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"] |
|
negative_prompt: str | list[str] | None = None, |
|
negative_prompt_2: str | list[str] | None = None, |
|
negative_prompt_3: str | list[str] | None = None, |
|
num_images_per_prompt: int | None = 1, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.FloatTensor | None = None, |
|
prompt_embeds: torch.FloatTensor | None = None, |
|
negative_prompt_embeds: torch.FloatTensor | None = None, |
|
pooled_prompt_embeds: torch.FloatTensor | None = None, |
|
negative_pooled_prompt_embeds: torch.FloatTensor | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
joint_attention_kwargs: dict[str, Any] | None = None, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
max_sequence_length: int = 256, |
|
pag_scale: float = 3.0, |
|
pag_adaptive_scale: float = 0.0, |
|
): |
Problem:
Base SD3 supports mu for dynamic shifting, SD3 IP-Adapter loading/call inputs, and text-to-image skip-layer guidance. The PAG SD3 variants do not expose those APIs. They also expose callback tensor inputs for negative pooled embeds while omitting the consumed pooled_prompt_embeds.
Impact:
SD3.5-style schedulers with use_dynamic_shifting=True fail because PAG cannot pass mu; IP-Adapter workflows cannot be used with SD3 PAG; callbacks cannot modify pooled conditioning.
Reproduction:
import inspect
from diffusers import FlowMatchEulerDiscreteScheduler, StableDiffusion3PAGPipeline, StableDiffusion3PAGImg2ImgPipeline
for cls in (StableDiffusion3PAGPipeline, StableDiffusion3PAGImg2ImgPipeline):
params = inspect.signature(cls.__call__).parameters
print(cls.__name__, "mu" in params, "ip_adapter_image" in params, cls._callback_tensor_inputs)
scheduler = FlowMatchEulerDiscreteScheduler(use_dynamic_shifting=True)
scheduler.set_timesteps(2)
# ValueError: `mu` must be passed when `use_dynamic_shifting` is set to be `True`
Relevant precedent:
Base SD3 implements these APIs:
|
ip_adapter_image: PipelineImageInput | None = None, |
|
ip_adapter_image_embeds: torch.Tensor | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
joint_attention_kwargs: dict[str, Any] | None = None, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
max_sequence_length: int = 256, |
|
skip_guidance_layers: list[int] = None, |
|
skip_layer_guidance_scale: float = 2.8, |
|
skip_layer_guidance_stop: float = 0.2, |
|
skip_layer_guidance_start: float = 0.01, |
|
mu: float | None = None, |
|
scheduler_kwargs = {} |
|
if self.scheduler.config.get("use_dynamic_shifting", None) and mu is None: |
|
_, _, height, width = latents.shape |
|
image_seq_len = (height // self.transformer.config.patch_size) * ( |
|
width // self.transformer.config.patch_size |
|
) |
|
mu = calculate_shift( |
|
image_seq_len, |
|
self.scheduler.config.get("base_image_seq_len", 256), |
|
self.scheduler.config.get("max_image_seq_len", 4096), |
|
self.scheduler.config.get("base_shift", 0.5), |
|
self.scheduler.config.get("max_shift", 1.16), |
|
) |
|
scheduler_kwargs["mu"] = mu |
|
elif mu is not None: |
|
scheduler_kwargs["mu"] = mu |
|
if XLA_AVAILABLE: |
|
timestep_device = "cpu" |
|
else: |
|
timestep_device = device |
|
timesteps, num_inference_steps = retrieve_timesteps( |
|
self.scheduler, |
|
num_inference_steps, |
|
timestep_device, |
|
sigmas=sigmas, |
|
**scheduler_kwargs, |
No duplicate found for StableDiffusion3PAGPipeline mu use_dynamic_shifting or SD3 PAG IP-Adapter searches.
Suggested fix:
Port the current base SD3 __call__ API and logic into both SD3 PAG variants, then add the PAG batch expansion around the updated prompt/pooled/IP-Adapter conditioning. Include SD3IPAdapterMixin, mu, and the base callback tensor contract.
Issue 4: ControlNet PAG variants dropped guess_mode
Affected code:
|
height: int | None = None, |
|
width: int | None = None, |
|
padding_mask_crop: int | None = None, |
|
strength: float = 1.0, |
|
num_inference_steps: int = 50, |
|
guidance_scale: float = 7.5, |
|
negative_prompt: str | list[str] | None = None, |
|
num_images_per_prompt: int | None = 1, |
|
eta: float = 0.0, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.Tensor | None = None, |
|
prompt_embeds: torch.Tensor | None = None, |
|
negative_prompt_embeds: torch.Tensor | None = None, |
|
ip_adapter_image: PipelineImageInput | None = None, |
|
ip_adapter_image_embeds: list[torch.Tensor] | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
cross_attention_kwargs: dict[str, Any] | None = None, |
|
controlnet_conditioning_scale: float | list[float] = 0.5, |
|
control_guidance_start: float | list[float] = 0.0, |
|
control_guidance_end: float | list[float] = 1.0, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | PipelineCallback | MultiPipelineCallbacks | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
pag_scale: float = 3.0, |
|
pag_adaptive_scale: float = 0.0, |
|
): |
|
batch_size=batch_size * num_images_per_prompt, |
|
num_images_per_prompt=num_images_per_prompt, |
|
device=device, |
|
dtype=controlnet.dtype, |
|
crops_coords=crops_coords, |
|
resize_mode=resize_mode, |
|
do_classifier_free_guidance=self.do_classifier_free_guidance, |
|
guess_mode=False, |
|
) |
|
elif isinstance(controlnet, MultiControlNetModel): |
|
control_images = [] |
|
|
|
for control_image_ in control_image: |
|
control_image_ = self.prepare_control_image( |
|
image=control_image_, |
|
width=width, |
|
height=height, |
|
batch_size=batch_size * num_images_per_prompt, |
|
num_images_per_prompt=num_images_per_prompt, |
|
device=device, |
|
dtype=controlnet.dtype, |
|
crops_coords=crops_coords, |
|
resize_mode=resize_mode, |
|
do_classifier_free_guidance=self.do_classifier_free_guidance, |
|
guess_mode=False, |
|
) |
|
def __call__( |
|
self, |
|
prompt: str | list[str] = None, |
|
prompt_2: str | list[str] | None = None, |
|
image: PipelineImageInput = None, |
|
height: int | None = None, |
|
width: int | None = None, |
|
num_inference_steps: int = 50, |
|
timesteps: list[int] = None, |
|
sigmas: list[float] = None, |
|
denoising_end: float | None = None, |
|
guidance_scale: float = 5.0, |
|
negative_prompt: str | list[str] | None = None, |
|
negative_prompt_2: str | list[str] | None = None, |
|
num_images_per_prompt: int | None = 1, |
|
eta: float = 0.0, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.Tensor | None = None, |
|
prompt_embeds: torch.Tensor | None = None, |
|
negative_prompt_embeds: torch.Tensor | None = None, |
|
pooled_prompt_embeds: torch.Tensor | None = None, |
|
negative_pooled_prompt_embeds: torch.Tensor | None = None, |
|
ip_adapter_image: PipelineImageInput | None = None, |
|
ip_adapter_image_embeds: list[torch.Tensor] | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
cross_attention_kwargs: dict[str, Any] | None = None, |
|
controlnet_conditioning_scale: float | list[float] = 1.0, |
|
control_guidance_start: float | list[float] = 0.0, |
|
control_guidance_end: float | list[float] = 1.0, |
|
original_size: tuple[int, int] = None, |
|
crops_coords_top_left: tuple[int, int] = (0, 0), |
|
target_size: tuple[int, int] = None, |
|
negative_original_size: tuple[int, int] | None = None, |
|
negative_crops_coords_top_left: tuple[int, int] = (0, 0), |
|
negative_target_size: tuple[int, int] | None = None, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | PipelineCallback | MultiPipelineCallbacks | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
pag_scale: float = 3.0, |
|
pag_adaptive_scale: float = 0.0, |
|
): |
|
down_block_res_samples, mid_block_res_sample = self.controlnet( |
|
control_model_input, |
|
t, |
|
encoder_hidden_states=controlnet_prompt_embeds, |
|
controlnet_cond=image, |
|
conditioning_scale=cond_scale, |
|
guess_mode=False, |
|
added_cond_kwargs=controlnet_added_cond_kwargs, |
Problem:
StableDiffusionControlNetPAGInpaintPipeline and StableDiffusionXLControlNetPAGPipeline omit the public guess_mode argument and hard-code guess_mode=False in image preparation and ControlNet forward. They also skip the base pipeline’s global_pool_conditions fallback.
Impact:
ControlNet checkpoints that require guess mode/global pooling cannot be used through these PAG variants, and users cannot request a feature supported by the corresponding base pipelines.
Reproduction:
import inspect
from diffusers import StableDiffusionControlNetPAGInpaintPipeline, StableDiffusionXLControlNetPAGPipeline
for cls in (StableDiffusionControlNetPAGInpaintPipeline, StableDiffusionXLControlNetPAGPipeline):
print(cls.__name__, "guess_mode" in inspect.signature(cls.__call__).parameters)
# False for both
Relevant precedent:
Base inpaint and SDXL ControlNet expose and normalize guess_mode:
|
controlnet_conditioning_scale: float | list[float] = 0.5, |
|
guess_mode: bool = False, |
|
control_guidance_start: float | list[float] = 0.0, |
|
control_guidance_end: float | list[float] = 1.0, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | PipelineCallback | MultiPipelineCallbacks | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
**kwargs, |
|
cross_attention_kwargs: dict[str, Any] | None = None, |
|
controlnet_conditioning_scale: float | list[float] = 1.0, |
|
guess_mode: bool = False, |
|
control_guidance_start: float | list[float] = 0.0, |
|
control_guidance_end: float | list[float] = 1.0, |
|
original_size: tuple[int, int] = None, |
|
crops_coords_top_left: tuple[int, int] = (0, 0), |
|
target_size: tuple[int, int] = None, |
Search found old base SDXL guess-mode issue #4709, but it is closed and not a duplicate of these PAG omissions.
Suggested fix:
# __call__ signature
guess_mode: bool = False,
# after resolving controlnet
global_pool_conditions = (
controlnet.config.global_pool_conditions
if isinstance(controlnet, ControlNetModel)
else controlnet.nets[0].config.global_pool_conditions
)
guess_mode = guess_mode or global_pool_conditions
# replace every hard-coded guess_mode=False with guess_mode=guess_mode
Issue 5: SanaPAGPipeline lost Sana LoRA and attention kwargs support
Affected code:
|
class SanaPAGPipeline(DiffusionPipeline, PAGMixin): |
|
r""" |
|
Pipeline for text-to-image generation using [Sana](https://huggingface.co/papers/2410.10629). This pipeline |
|
supports the use of [Perturbed Attention Guidance |
|
(PAG)](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag). |
|
""" |
|
|
|
# fmt: off |
|
bad_punct_regex = re.compile(r"[" + "#®•©™&@·º½¾¿¡§~" + r"\)" + r"\(" + r"\]" + r"\[" + r"\}" + r"\{" + r"\|" + "\\" + r"\/" + r"\*" + r"]{1,}") |
|
# fmt: on |
|
|
|
model_cpu_offload_seq = "text_encoder->transformer->vae" |
|
_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"] |
|
def __call__( |
|
self, |
|
prompt: str | list[str] = None, |
|
negative_prompt: str = "", |
|
num_inference_steps: int = 20, |
|
timesteps: list[int] = None, |
|
sigmas: list[float] = None, |
|
guidance_scale: float = 4.5, |
|
num_images_per_prompt: int | None = 1, |
|
height: int = 1024, |
|
width: int = 1024, |
|
eta: float = 0.0, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.Tensor | None = None, |
|
prompt_embeds: torch.Tensor | None = None, |
|
prompt_attention_mask: torch.Tensor | None = None, |
|
negative_prompt_embeds: torch.Tensor | None = None, |
|
negative_prompt_attention_mask: torch.Tensor | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
clean_caption: bool = False, |
|
use_resolution_binning: bool = True, |
|
callback_on_step_end: Callable[[int, int], None] | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
max_sequence_length: int = 300, |
|
complex_human_instruction: list[str] = [ |
|
"Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:", |
|
"- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.", |
|
"- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.", |
|
"Here are examples of how to transform or refine prompts:", |
|
"- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.", |
|
"- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.", |
|
"Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:", |
|
"User Prompt: ", |
|
], |
|
pag_scale: float = 3.0, |
|
pag_adaptive_scale: float = 0.0, |
|
) -> ImagePipelineOutput | tuple: |
|
noise_pred = self.transformer( |
|
latent_model_input, |
|
encoder_hidden_states=prompt_embeds, |
|
encoder_attention_mask=prompt_attention_mask, |
|
timestep=timestep, |
Problem:
Base SanaPipeline inherits SanaLoraLoaderMixin, accepts attention_kwargs, uses attention_kwargs["scale"] for prompt encoding, and forwards the kwargs into the transformer. SanaPAGPipeline inherits only DiffusionPipeline, PAGMixin and has no attention_kwargs argument or forwarding.
Impact:
Users cannot load or scale Sana LoRAs with the PAG pipeline, despite Sana supporting them and the PAG API page carrying a LoRA badge.
Reproduction:
import inspect
from diffusers import SanaPipeline, SanaPAGPipeline
print(hasattr(SanaPipeline, "load_lora_weights"), hasattr(SanaPAGPipeline, "load_lora_weights"))
print("attention_kwargs" in inspect.signature(SanaPipeline.__call__).parameters)
print("attention_kwargs" in inspect.signature(SanaPAGPipeline.__call__).parameters)
# True False
# True
# False
Relevant precedent:
|
class SanaPipeline(DiffusionPipeline, SanaLoraLoaderMixin): |
|
def __call__( |
|
self, |
|
prompt: str | list[str] = None, |
|
negative_prompt: str = "", |
|
num_inference_steps: int = 20, |
|
timesteps: list[int] = None, |
|
sigmas: list[float] = None, |
|
guidance_scale: float = 4.5, |
|
num_images_per_prompt: int | None = 1, |
|
height: int = 1024, |
|
width: int = 1024, |
|
eta: float = 0.0, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.Tensor | None = None, |
|
prompt_embeds: torch.Tensor | None = None, |
|
prompt_attention_mask: torch.Tensor | None = None, |
|
negative_prompt_embeds: torch.Tensor | None = None, |
|
negative_prompt_attention_mask: torch.Tensor | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
clean_caption: bool = False, |
|
use_resolution_binning: bool = True, |
|
attention_kwargs: dict[str, Any] | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
|
|
self._guidance_scale = guidance_scale |
|
self._attention_kwargs = attention_kwargs |
|
self._interrupt = False |
|
|
|
# 2. Default height and width to transformer |
|
if prompt is not None and isinstance(prompt, str): |
|
batch_size = 1 |
|
elif prompt is not None and isinstance(prompt, list): |
|
batch_size = len(prompt) |
|
else: |
|
batch_size = prompt_embeds.shape[0] |
|
|
|
device = self._execution_device |
|
lora_scale = self.attention_kwargs.get("scale", None) if self.attention_kwargs is not None else None |
|
|
Related SanaPAG quality issue #10241 exists, but it is not a duplicate of the LoRA/attention kwargs API gap.
Suggested fix:
class SanaPAGPipeline(DiffusionPipeline, SanaLoraLoaderMixin, PAGMixin):
...
def __call__(..., attention_kwargs: dict[str, Any] | None = None, ...):
self._attention_kwargs = attention_kwargs
lora_scale = self.attention_kwargs.get("scale", None) if self.attention_kwargs is not None else None
...
noise_pred = self.transformer(..., attention_kwargs=self.attention_kwargs, return_dict=False)[0]
Issue 6: PAG docs have broken examples and omit Sana API docs
Affected code:
|
pag_scales = 4.0 |
|
guidance_scales = 7.0 |
|
|
|
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" |
|
init_image = load_image(url) |
|
prompt = "a dog catching a frisbee in the jungle" |
|
|
|
generator = torch.Generator(device="cpu").manual_seed(0) |
|
image = pipeline( |
|
prompt, |
|
image=init_image, |
|
strength=0.8, |
|
guidance_scale=guidance_scale, |
|
pag_scale=pag_scale, |
|
generator=generator).images[0] |
|
pipeline_t2i = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16) |
|
pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_t2i, enable_pag=True) |
|
``` |
|
|
|
Let's generate an image! |
|
|
|
```py |
|
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" |
|
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" |
|
init_image = load_image(img_url).convert("RGB") |
|
mask_image = load_image(mask_url).convert("RGB") |
|
|
|
prompt = "A majestic tiger sitting on a bench" |
|
|
|
pag_scales = 3.0 |
|
guidance_scales = 7.5 |
|
|
|
generator = torch.Generator(device="cpu").manual_seed(1) |
|
images = pipeline( |
|
prompt=prompt, |
|
image=init_image, |
|
mask_image=mask_image, |
|
strength=0.8, |
|
num_inference_steps=50, |
|
guidance_scale=guidance_scale, |
|
generator=generator, |
|
pag_scale=pag_scale, |
|
|
|
## AnimateDiffPAGPipeline |
|
[[autodoc]] AnimateDiffPAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## HunyuanDiTPAGPipeline |
|
[[autodoc]] HunyuanDiTPAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## KolorsPAGPipeline |
|
[[autodoc]] KolorsPAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionPAGInpaintPipeline |
|
[[autodoc]] StableDiffusionPAGInpaintPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionPAGPipeline |
|
[[autodoc]] StableDiffusionPAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionPAGImg2ImgPipeline |
|
[[autodoc]] StableDiffusionPAGImg2ImgPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionControlNetPAGPipeline |
|
[[autodoc]] StableDiffusionControlNetPAGPipeline |
|
|
|
## StableDiffusionControlNetPAGInpaintPipeline |
|
[[autodoc]] StableDiffusionControlNetPAGInpaintPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionXLPAGPipeline |
|
[[autodoc]] StableDiffusionXLPAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionXLPAGImg2ImgPipeline |
|
[[autodoc]] StableDiffusionXLPAGImg2ImgPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionXLPAGInpaintPipeline |
|
[[autodoc]] StableDiffusionXLPAGInpaintPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionXLControlNetPAGPipeline |
|
[[autodoc]] StableDiffusionXLControlNetPAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusionXLControlNetPAGImg2ImgPipeline |
|
[[autodoc]] StableDiffusionXLControlNetPAGImg2ImgPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusion3PAGPipeline |
|
[[autodoc]] StableDiffusion3PAGPipeline |
|
- all |
|
- __call__ |
|
|
|
## StableDiffusion3PAGImg2ImgPipeline |
|
[[autodoc]] StableDiffusion3PAGImg2ImgPipeline |
|
- all |
|
- __call__ |
|
|
|
## PixArtSigmaPAGPipeline |
|
[[autodoc]] PixArtSigmaPAGPipeline |
|
- all |
|
- __call__ |
Problem:
The guide defines pag_scales/guidance_scales but calls pag_scale/guidance_scale, references AutoPipelineForInpaiting, and uses pipeline_t2i where the preceding snippet defines pipeline_pag. The API page exports SanaPAGPipeline in code but never documents it.
Impact:
Copy-pasted PAG guide snippets fail before generation, and Sana PAG users cannot find the API reference.
Reproduction:
from pathlib import Path
guide = Path("docs/source/en/using-diffusers/pag.md").read_text()
api = Path("docs/source/en/api/pipelines/pag.md").read_text()
assert "AutoPipelineForInpaiting" not in guide
assert "pag_scales" not in guide or "pag_scale=pag_scale" not in guide
assert "## SanaPAGPipeline" in api
Relevant precedent:
No duplicate found for AutoPipelineForInpaiting pag_scale guidance_scale PAG docs.
Suggested fix:
pag_scale = 4.0
guidance_scale = 7.0
...
pipeline = AutoPipelineForInpainting.from_pipe(pipeline_t2i, enable_pag=True)
Also add:
## SanaPAGPipeline
[[autodoc]] SanaPAGPipeline
- all
- __call__
Issue 7: Slow coverage is missing for most PAG variants
Affected code:
|
class AnimateDiffPAGPipelineFastTests( |
|
class StableDiffusionControlNetPAGPipelineFastTests( |
|
class StableDiffusionControlNetPAGInpaintPipelineFastTests( |
|
class StableDiffusionXLControlNetPAGPipelineFastTests( |
|
class StableDiffusionXLControlNetPAGImg2ImgPipelineFastTests( |
|
class HunyuanDiTPAGPipelineFastTests(PipelineTesterMixin, unittest.TestCase): |
|
class KolorsPAGPipelineFastTests( |
|
class PixArtSigmaPAGPipelineFastTests(PipelineTesterMixin, unittest.TestCase): |
|
class SanaPAGPipelineFastTests(PipelineTesterMixin, unittest.TestCase): |
|
class StableDiffusion3PAGPipelineFastTests(unittest.TestCase, PipelineTesterMixin): |
Problem:
Fast tests exist across the family, but these PAG variants have no @slow integration tests. The prompt explicitly requires missing slow tests to be reported.
Impact:
Real checkpoint/API drift is not covered for many PAG pipelines, including the stale SD3/Sana/ControlNet gaps above.
Reproduction:
from pathlib import Path
missing = [
p.as_posix()
for p in sorted(Path("tests/pipelines/pag").glob("test_pag_*.py"))
if "@slow" not in p.read_text(encoding="utf-8")
]
print("\n".join(missing))
Relevant precedent:
Existing slow PAG tests are present for SD, SD img2img/inpaint, SDXL, SDXL img2img/inpaint, and SD3 img2img.
Suggested fix:
Add at least one @slow smoke/integration class per missing pipeline using the smallest stable public checkpoint available, covering pag_scale=0 parity and pag_scale>0 execution.
pagmodel/pipeline reviewCommit tested:
0f1abc4ae8b0eb2a3b40e82a310507281144c423Review performed against the repository review rules. Reviewed PAG public exports/lazy imports, pipeline/runtime behavior, related base-pipeline consistency, docs/examples, and
tests/pipelines/pag.Issue 1:
pag_applied_layers="blocks.1"also matchesblocks.10Affected code:
diffusers/src/diffusers/pipelines/pag/pag_utils.py
Lines 58 to 77 in 0f1abc4
Problem:
re.search(layer_id, name)is used directly, and the numeric disambiguation only compares the last dot-separated tokens. For names likeblocks.10.attn1, the last token isattn1, soblocks.1still matchesblocks.10. This contradicts the nearby comment and over-applies PAG to unintended transformer blocks.Impact:
Users selecting a single DiT block can silently perturb additional blocks, changing quality/performance and making layer ablations unreliable.
Reproduction:
Relevant precedent:
No duplicate found in GitHub issue/PR searches for
PAG pag_applied_layers blocks.1 blocks.10.Suggested fix:
Issue 2: PAG processors are not restored if generation raises
Affected code:
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sd.py
Lines 996 to 1001 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sd.py
Lines 1073 to 1077 in 0f1abc4
Problem:
PAG pipelines save the original attention processors before the denoising loop and restore them only on the normal success path. If a callback, scheduler, VAE, or user interrupt raises after
_set_pag_attn_processor, the model keeps PAG processors installed.Impact:
A later
pag_scale=0call can fail or produce wrong results because the UNet/transformer still expects PAG-expanded batches.Reproduction:
Relevant precedent:
No duplicate found for
PAG callback exception attention processor restore.Suggested fix:
Issue 3: SD3 PAG pipelines are stale versus base SD3
Affected code:
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sd_3.py
Lines 136 to 176 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sd_3.py
Lines 686 to 715 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sd_3_img2img.py
Lines 152 to 191 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sd_3_img2img.py
Lines 749 to 768 in 0f1abc4
Problem:
Base SD3 supports
mufor dynamic shifting, SD3 IP-Adapter loading/call inputs, and text-to-image skip-layer guidance. The PAG SD3 variants do not expose those APIs. They also expose callback tensor inputs for negative pooled embeds while omitting the consumedpooled_prompt_embeds.Impact:
SD3.5-style schedulers with
use_dynamic_shifting=Truefail because PAG cannot passmu; IP-Adapter workflows cannot be used with SD3 PAG; callbacks cannot modify pooled conditioning.Reproduction:
Relevant precedent:
Base SD3 implements these APIs:
diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
Lines 794 to 807 in 0f1abc4
diffusers/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py
Lines 1012 to 1037 in 0f1abc4
No duplicate found for
StableDiffusion3PAGPipeline mu use_dynamic_shiftingor SD3 PAG IP-Adapter searches.Suggested fix:
Port the current base SD3
__call__API and logic into both SD3 PAG variants, then add the PAG batch expansion around the updated prompt/pooled/IP-Adapter conditioning. IncludeSD3IPAdapterMixin,mu, and the base callback tensor contract.Issue 4: ControlNet PAG variants dropped
guess_modeAffected code:
diffusers/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_inpaint.py
Lines 980 to 1006 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_inpaint.py
Lines 1235 to 1260 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_xl.py
Lines 1003 to 1044 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_xl.py
Lines 1512 to 1519 in 0f1abc4
Problem:
StableDiffusionControlNetPAGInpaintPipelineandStableDiffusionXLControlNetPAGPipelineomit the publicguess_modeargument and hard-codeguess_mode=Falsein image preparation and ControlNet forward. They also skip the base pipeline’sglobal_pool_conditionsfallback.Impact:
ControlNet checkpoints that require guess mode/global pooling cannot be used through these PAG variants, and users cannot request a feature supported by the corresponding base pipelines.
Reproduction:
Relevant precedent:
Base inpaint and SDXL ControlNet expose and normalize
guess_mode:diffusers/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py
Lines 1020 to 1027 in 0f1abc4
diffusers/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py
Lines 1025 to 1032 in 0f1abc4
Search found old base SDXL guess-mode issue #4709, but it is closed and not a duplicate of these PAG omissions.
Suggested fix:
Issue 5:
SanaPAGPipelinelost Sana LoRA and attention kwargs supportAffected code:
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sana.py
Lines 148 to 160 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sana.py
Lines 650 to 687 in 0f1abc4
diffusers/src/diffusers/pipelines/pag/pipeline_pag_sana.py
Lines 906 to 910 in 0f1abc4
Problem:
Base
SanaPipelineinheritsSanaLoraLoaderMixin, acceptsattention_kwargs, usesattention_kwargs["scale"]for prompt encoding, and forwards the kwargs into the transformer.SanaPAGPipelineinherits onlyDiffusionPipeline, PAGMixinand has noattention_kwargsargument or forwarding.Impact:
Users cannot load or scale Sana LoRAs with the PAG pipeline, despite Sana supporting them and the PAG API page carrying a LoRA badge.
Reproduction:
Relevant precedent:
diffusers/src/diffusers/pipelines/sana/pipeline_sana.py
Line 190 in 0f1abc4
diffusers/src/diffusers/pipelines/sana/pipeline_sana.py
Lines 729 to 753 in 0f1abc4
diffusers/src/diffusers/pipelines/sana/pipeline_sana.py
Lines 888 to 903 in 0f1abc4
Related SanaPAG quality issue #10241 exists, but it is not a duplicate of the LoRA/attention kwargs API gap.
Suggested fix:
Issue 6: PAG docs have broken examples and omit Sana API docs
Affected code:
diffusers/docs/source/en/using-diffusers/pag.md
Lines 124 to 138 in 0f1abc4
diffusers/docs/source/en/using-diffusers/pag.md
Lines 167 to 193 in 0f1abc4
diffusers/docs/source/en/api/pipelines/pag.md
Lines 36 to 113 in 0f1abc4
Problem:
The guide defines
pag_scales/guidance_scalesbut callspag_scale/guidance_scale, referencesAutoPipelineForInpaiting, and usespipeline_t2iwhere the preceding snippet definespipeline_pag. The API page exportsSanaPAGPipelinein code but never documents it.Impact:
Copy-pasted PAG guide snippets fail before generation, and Sana PAG users cannot find the API reference.
Reproduction:
Relevant precedent:
No duplicate found for
AutoPipelineForInpaiting pag_scale guidance_scale PAG docs.Suggested fix:
Also add:
Issue 7: Slow coverage is missing for most PAG variants
Affected code:
diffusers/tests/pipelines/pag/test_pag_animatediff.py
Line 40 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_controlnet_sd.py
Line 51 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_controlnet_sd_inpaint.py
Line 49 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_controlnet_sdxl.py
Line 51 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_controlnet_sdxl_img2img.py
Line 50 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_hunyuan_dit.py
Line 40 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_kolors.py
Line 47 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_pixart_sigma.py
Line 50 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_sana.py
Line 38 in 0f1abc4
diffusers/tests/pipelines/pag/test_pag_sd3.py
Line 33 in 0f1abc4
Problem:
Fast tests exist across the family, but these PAG variants have no
@slowintegration tests. The prompt explicitly requires missing slow tests to be reported.Impact:
Real checkpoint/API drift is not covered for many PAG pipelines, including the stale SD3/Sana/ControlNet gaps above.
Reproduction:
Relevant precedent:
Existing slow PAG tests are present for SD, SD img2img/inpaint, SDXL, SDXL img2img/inpaint, and SD3 img2img.
Suggested fix:
Add at least one
@slowsmoke/integration class per missing pipeline using the smallest stable public checkpoint available, coveringpag_scale=0parity andpag_scale>0execution.