pag model/pipeline review

# `pag` model/pipeline review

Commit tested: `0f1abc4ae8b0eb2a3b40e82a310507281144c423`

Review performed against the repository review rules. Reviewed PAG public exports/lazy imports, pipeline/runtime behavior, related base-pipeline consistency, docs/examples, and `tests/pipelines/pag`.

## Issue 1: `pag_applied_layers="blocks.1"` also matches `blocks.10`

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pag_utils.py#L58-L77

Problem:
`re.search(layer_id, name)` is used directly, and the numeric disambiguation only compares the last dot-separated tokens. For names like `blocks.10.attn1`, the last token is `attn1`, so `blocks.1` still matches `blocks.10`. This contradicts the nearby comment and over-applies PAG to unintended transformer blocks.

Impact:
Users selecting a single DiT block can silently perturb additional blocks, changing quality/performance and making layer ablations unreliable.

Reproduction:
```python
import torch.nn as nn
from diffusers.models.attention_processor import Attention
from diffusers.pipelines.pag.pag_utils import PAGMixin

class TinyTransformer(nn.Module):
    def __init__(self):
        super().__init__()
        self.blocks = nn.ModuleList([nn.Module() for _ in range(11)])
        for block in self.blocks:
            block.attn1 = Attention(query_dim=8, heads=1, dim_head=8)

    @property
    def attn_processors(self):
        return {f"{n}.processor": m.processor for n, m in self.named_modules() if isinstance(m, Attention)}

class Dummy(PAGMixin):
    def __init__(self):
        self.transformer = TinyTransformer()
        self.set_pag_applied_layers(["blocks.1"])

pipe = Dummy()
pipe._set_pag_attn_processor(pipe.pag_applied_layers, do_classifier_free_guidance=False)
print(sorted(pipe.pag_attn_processors))
# Includes both blocks.1.attn1.processor and blocks.10.attn1.processor
```

Relevant precedent:
No duplicate found in GitHub issue/PR searches for `PAG pag_applied_layers blocks.1 blocks.10`.

Suggested fix:
```python
match = re.search(layer_id, name)
if match is None:
    continue
if layer_id[-1].isdigit() and match.end() < len(name) and name[match.end()].isdigit():
    continue
if is_self_attn(module):
    target_modules.append(module)
```

## Issue 2: PAG processors are not restored if generation raises

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sd.py#L996-L1001
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sd.py#L1073-L1077

Problem:
PAG pipelines save the original attention processors before the denoising loop and restore them only on the normal success path. If a callback, scheduler, VAE, or user interrupt raises after `_set_pag_attn_processor`, the model keeps PAG processors installed.

Impact:
A later `pag_scale=0` call can fail or produce wrong results because the UNet/transformer still expects PAG-expanded batches.

Reproduction:
```python
import torch
from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, DDIMScheduler, StableDiffusionPAGPipeline, UNet2DConditionModel

unet = UNet2DConditionModel(block_out_channels=(4, 8), layers_per_block=2, sample_size=32, in_channels=4, out_channels=4,
    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
    cross_attention_dim=8, norm_num_groups=2)
vae = AutoencoderKL(block_out_channels=[4, 8], in_channels=3, out_channels=3, down_block_types=["DownEncoderBlock2D"]*2,
    up_block_types=["UpDecoderBlock2D"]*2, latent_channels=4, norm_num_groups=2)
text_encoder = CLIPTextModel(CLIPTextConfig(bos_token_id=0, eos_token_id=2, hidden_size=8, intermediate_size=16,
    num_attention_heads=2, num_hidden_layers=2, pad_token_id=1, vocab_size=1000))
pipe = StableDiffusionPAGPipeline(
    unet=unet, scheduler=DDIMScheduler(), vae=vae, text_encoder=text_encoder,
    tokenizer=CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip"),
    safety_checker=None, feature_extractor=None,
)
pipe.set_progress_bar_config(disable=True)
try:
    pipe("x", num_inference_steps=1, guidance_scale=5, pag_scale=1, output_type="latent",
         callback_on_step_end=lambda *a, **k: (_ for _ in ()).throw(RuntimeError("boom")))
except RuntimeError:
    pass
pipe("x", num_inference_steps=1, guidance_scale=5, pag_scale=0, output_type="latent")
# ValueError: not enough values to unpack (expected 3, got 2)
```

Relevant precedent:
No duplicate found for `PAG callback exception attention processor restore`.

Suggested fix:
```python
original_attn_proc = None
try:
    if self.do_perturbed_attention_guidance:
        original_attn_proc = self.unet.attn_processors
        self._set_pag_attn_processor(self.pag_applied_layers, self.do_classifier_free_guidance)
    # denoise/decode body
finally:
    if original_attn_proc is not None:
        self.unet.set_attn_processor(original_attn_proc)
```

## Issue 3: SD3 PAG pipelines are stale versus base SD3

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sd_3.py#L136-L176
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sd_3.py#L686-L715
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sd_3_img2img.py#L152-L191
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sd_3_img2img.py#L749-L768

Problem:
Base SD3 supports `mu` for dynamic shifting, SD3 IP-Adapter loading/call inputs, and text-to-image skip-layer guidance. The PAG SD3 variants do not expose those APIs. They also expose callback tensor inputs for negative pooled embeds while omitting the consumed `pooled_prompt_embeds`.

Impact:
SD3.5-style schedulers with `use_dynamic_shifting=True` fail because PAG cannot pass `mu`; IP-Adapter workflows cannot be used with SD3 PAG; callbacks cannot modify pooled conditioning.

Reproduction:
```python
import inspect
from diffusers import FlowMatchEulerDiscreteScheduler, StableDiffusion3PAGPipeline, StableDiffusion3PAGImg2ImgPipeline

for cls in (StableDiffusion3PAGPipeline, StableDiffusion3PAGImg2ImgPipeline):
    params = inspect.signature(cls.__call__).parameters
    print(cls.__name__, "mu" in params, "ip_adapter_image" in params, cls._callback_tensor_inputs)

scheduler = FlowMatchEulerDiscreteScheduler(use_dynamic_shifting=True)
scheduler.set_timesteps(2)
# ValueError: `mu` must be passed when `use_dynamic_shifting` is set to be `True`
```

Relevant precedent:
Base SD3 implements these APIs:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L794-L807
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L1012-L1037

No duplicate found for `StableDiffusion3PAGPipeline mu use_dynamic_shifting` or SD3 PAG IP-Adapter searches.

Suggested fix:
Port the current base SD3 `__call__` API and logic into both SD3 PAG variants, then add the PAG batch expansion around the updated prompt/pooled/IP-Adapter conditioning. Include `SD3IPAdapterMixin`, `mu`, and the base callback tensor contract.

## Issue 4: ControlNet PAG variants dropped `guess_mode`

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_inpaint.py#L980-L1006
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_inpaint.py#L1235-L1260
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_xl.py#L1003-L1044
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_controlnet_sd_xl.py#L1512-L1519

Problem:
`StableDiffusionControlNetPAGInpaintPipeline` and `StableDiffusionXLControlNetPAGPipeline` omit the public `guess_mode` argument and hard-code `guess_mode=False` in image preparation and ControlNet forward. They also skip the base pipeline’s `global_pool_conditions` fallback.

Impact:
ControlNet checkpoints that require guess mode/global pooling cannot be used through these PAG variants, and users cannot request a feature supported by the corresponding base pipelines.

Reproduction:
```python
import inspect
from diffusers import StableDiffusionControlNetPAGInpaintPipeline, StableDiffusionXLControlNetPAGPipeline

for cls in (StableDiffusionControlNetPAGInpaintPipeline, StableDiffusionXLControlNetPAGPipeline):
    print(cls.__name__, "guess_mode" in inspect.signature(cls.__call__).parameters)
# False for both
```

Relevant precedent:
Base inpaint and SDXL ControlNet expose and normalize `guess_mode`:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint.py#L1020-L1027
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py#L1025-L1032

Search found old base SDXL guess-mode issue #4709, but it is closed and not a duplicate of these PAG omissions.

Suggested fix:
```python
# __call__ signature
guess_mode: bool = False,

# after resolving controlnet
global_pool_conditions = (
    controlnet.config.global_pool_conditions
    if isinstance(controlnet, ControlNetModel)
    else controlnet.nets[0].config.global_pool_conditions
)
guess_mode = guess_mode or global_pool_conditions

# replace every hard-coded guess_mode=False with guess_mode=guess_mode
```

## Issue 5: `SanaPAGPipeline` lost Sana LoRA and attention kwargs support

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sana.py#L148-L160
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sana.py#L650-L687
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/pag/pipeline_pag_sana.py#L906-L910

Problem:
Base `SanaPipeline` inherits `SanaLoraLoaderMixin`, accepts `attention_kwargs`, uses `attention_kwargs["scale"]` for prompt encoding, and forwards the kwargs into the transformer. `SanaPAGPipeline` inherits only `DiffusionPipeline, PAGMixin` and has no `attention_kwargs` argument or forwarding.

Impact:
Users cannot load or scale Sana LoRAs with the PAG pipeline, despite Sana supporting them and the PAG API page carrying a LoRA badge.

Reproduction:
```python
import inspect
from diffusers import SanaPipeline, SanaPAGPipeline

print(hasattr(SanaPipeline, "load_lora_weights"), hasattr(SanaPAGPipeline, "load_lora_weights"))
print("attention_kwargs" in inspect.signature(SanaPipeline.__call__).parameters)
print("attention_kwargs" in inspect.signature(SanaPAGPipeline.__call__).parameters)
# True False
# True
# False
```

Relevant precedent:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/sana/pipeline_sana.py#L190-L190
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/sana/pipeline_sana.py#L729-L753
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/src/diffusers/pipelines/sana/pipeline_sana.py#L888-L903

Related SanaPAG quality issue #10241 exists, but it is not a duplicate of the LoRA/attention kwargs API gap.

Suggested fix:
```python
class SanaPAGPipeline(DiffusionPipeline, SanaLoraLoaderMixin, PAGMixin):
    ...

def __call__(..., attention_kwargs: dict[str, Any] | None = None, ...):
    self._attention_kwargs = attention_kwargs
    lora_scale = self.attention_kwargs.get("scale", None) if self.attention_kwargs is not None else None
    ...
    noise_pred = self.transformer(..., attention_kwargs=self.attention_kwargs, return_dict=False)[0]
```

## Issue 6: PAG docs have broken examples and omit Sana API docs

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/docs/source/en/using-diffusers/pag.md#L124-L138
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/docs/source/en/using-diffusers/pag.md#L167-L193
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/docs/source/en/api/pipelines/pag.md#L36-L113

Problem:
The guide defines `pag_scales`/`guidance_scales` but calls `pag_scale`/`guidance_scale`, references `AutoPipelineForInpaiting`, and uses `pipeline_t2i` where the preceding snippet defines `pipeline_pag`. The API page exports `SanaPAGPipeline` in code but never documents it.

Impact:
Copy-pasted PAG guide snippets fail before generation, and Sana PAG users cannot find the API reference.

Reproduction:
```python
from pathlib import Path

guide = Path("docs/source/en/using-diffusers/pag.md").read_text()
api = Path("docs/source/en/api/pipelines/pag.md").read_text()
assert "AutoPipelineForInpaiting" not in guide
assert "pag_scales" not in guide or "pag_scale=pag_scale" not in guide
assert "## SanaPAGPipeline" in api
```

Relevant precedent:
No duplicate found for `AutoPipelineForInpaiting pag_scale guidance_scale PAG docs`.

Suggested fix:
```python
pag_scale = 4.0
guidance_scale = 7.0
...
pipeline = AutoPipelineForInpainting.from_pipe(pipeline_t2i, enable_pag=True)
```
Also add:
```md
## SanaPAGPipeline
[[autodoc]] SanaPAGPipeline
  - all
  - __call__
```

## Issue 7: Slow coverage is missing for most PAG variants

Affected code:
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_animatediff.py#L40
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_controlnet_sd.py#L51
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_controlnet_sd_inpaint.py#L49
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_controlnet_sdxl.py#L51
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_controlnet_sdxl_img2img.py#L50
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_hunyuan_dit.py#L40
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_kolors.py#L47
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_pixart_sigma.py#L50
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_sana.py#L38
https://github.com/huggingface/diffusers/blob/0f1abc4ae8b0eb2a3b40e82a310507281144c423/tests/pipelines/pag/test_pag_sd3.py#L33

Problem:
Fast tests exist across the family, but these PAG variants have no `@slow` integration tests. The prompt explicitly requires missing slow tests to be reported.

Impact:
Real checkpoint/API drift is not covered for many PAG pipelines, including the stale SD3/Sana/ControlNet gaps above.

Reproduction:
```python
from pathlib import Path

missing = [
    p.as_posix()
    for p in sorted(Path("tests/pipelines/pag").glob("test_pag_*.py"))
    if "@slow" not in p.read_text(encoding="utf-8")
]
print("\n".join(missing))
```

Relevant precedent:
Existing slow PAG tests are present for SD, SD img2img/inpaint, SDXL, SDXL img2img/inpaint, and SD3 img2img.

Suggested fix:
Add at least one `@slow` smoke/integration class per missing pipeline using the smallest stable public checkpoint available, covering `pag_scale=0` parity and `pag_scale>0` execution.


	def is_fake_integral_match(layer_id, name):
	layer_id = layer_id.split(".")[-1]
	name = name.split(".")[-1]
	return layer_id.isnumeric() and name.isnumeric() and layer_id == name

	for layer_id in pag_applied_layers:
	# for each PAG layer input, we find corresponding self-attention layers in the unet model
	target_modules = []

	for name, module in model.named_modules():
	# Identify the following simple cases:
	# (1) Self Attention layer existing
	# (2) Whether the module name matches pag layer id even partially
	# (3) Make sure it's not a fake integral match if the layer_id ends with a number
	# For example, blocks.1, blocks.10 should be differentiable if layer_id="blocks.1"
	if (
	is_self_attn(module)
	and re.search(layer_id, name) is not None
	and not is_fake_integral_match(layer_id, name)
	):

	class StableDiffusion3PAGPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, PAGMixin):
	r"""
	[PAG pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag) for text-to-image generation
	using Stable Diffusion 3.

	Args:
	transformer ([`SD3Transformer2DModel`]):
	Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
	scheduler ([`FlowMatchEulerDiscreteScheduler`]):
	A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
	vae ([`AutoencoderKL`]):
	Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
	text_encoder ([`CLIPTextModelWithProjection`]):
	[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
	specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
	with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
	as its dimension.
	text_encoder_2 ([`CLIPTextModelWithProjection`]):
	[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
	specifically the
	[laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
	variant.
	text_encoder_3 ([`T5EncoderModel`]):
	Frozen text-encoder. Stable Diffusion 3 uses
	[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
	[t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
	tokenizer (`CLIPTokenizer`):
	Tokenizer of class
	[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
	tokenizer_2 (`CLIPTokenizer`):
	Second Tokenizer of class
	[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
	tokenizer_3 (`T5TokenizerFast`):
	Tokenizer of class
	[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
	"""

	model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
	_optional_components = []
	_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]

	def __call__(
	self,
	prompt: str \| list[str] = None,
	prompt_2: str \| list[str] \| None = None,
	prompt_3: str \| list[str] \| None = None,
	height: int \| None = None,
	width: int \| None = None,
	num_inference_steps: int = 28,
	sigmas: list[float] \| None = None,
	guidance_scale: float = 7.0,
	negative_prompt: str \| list[str] \| None = None,
	negative_prompt_2: str \| list[str] \| None = None,
	negative_prompt_3: str \| list[str] \| None = None,
	num_images_per_prompt: int \| None = 1,
	generator: torch.Generator \| list[torch.Generator] \| None = None,
	latents: torch.FloatTensor \| None = None,
	prompt_embeds: torch.FloatTensor \| None = None,
	negative_prompt_embeds: torch.FloatTensor \| None = None,
	pooled_prompt_embeds: torch.FloatTensor \| None = None,
	negative_pooled_prompt_embeds: torch.FloatTensor \| None = None,
	output_type: str \| None = "pil",
	return_dict: bool = True,
	joint_attention_kwargs: dict[str, Any] \| None = None,
	clip_skip: int \| None = None,
	callback_on_step_end: Callable[[int, int], None] \| None = None,
	callback_on_step_end_tensor_inputs: list[str] = ["latents"],
	max_sequence_length: int = 256,
	pag_scale: float = 3.0,
	pag_adaptive_scale: float = 0.0,
	):

	class StableDiffusion3PAGImg2ImgPipeline(DiffusionPipeline, SD3LoraLoaderMixin, FromSingleFileMixin, PAGMixin):
	r"""
	[PAG pipeline](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag) for image-to-image generation
	using Stable Diffusion 3.

	Args:
	transformer ([`SD3Transformer2DModel`]):
	Conditional Transformer (MMDiT) architecture to denoise the encoded image latents.
	scheduler ([`FlowMatchEulerDiscreteScheduler`]):
	A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
	vae ([`AutoencoderKL`]):
	Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
	text_encoder ([`CLIPTextModelWithProjection`]):
	[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
	specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant,
	with an additional added projection layer that is initialized with a diagonal matrix with the `hidden_size`
	as its dimension.
	text_encoder_2 ([`CLIPTextModelWithProjection`]):
	[CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModelWithProjection),
	specifically the
	[laion/CLIP-ViT-bigG-14-laion2B-39B-b160k](https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k)
	variant.
	text_encoder_3 ([`T5EncoderModel`]):
	Frozen text-encoder. Stable Diffusion 3 uses
	[T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel), specifically the
	[t5-v1_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
	tokenizer (`CLIPTokenizer`):
	Tokenizer of class
	[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
	tokenizer_2 (`CLIPTokenizer`):
	Second Tokenizer of class
	[CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
	tokenizer_3 (`T5TokenizerFast`):
	Tokenizer of class
	[T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
	"""

	model_cpu_offload_seq = "text_encoder->text_encoder_2->text_encoder_3->transformer->vae"
	_optional_components = []
	_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds", "negative_pooled_prompt_embeds"]

	negative_prompt: str \| list[str] \| None = None,
	negative_prompt_2: str \| list[str] \| None = None,
	negative_prompt_3: str \| list[str] \| None = None,
	num_images_per_prompt: int \| None = 1,
	generator: torch.Generator \| list[torch.Generator] \| None = None,
	latents: torch.FloatTensor \| None = None,
	prompt_embeds: torch.FloatTensor \| None = None,
	negative_prompt_embeds: torch.FloatTensor \| None = None,
	pooled_prompt_embeds: torch.FloatTensor \| None = None,
	negative_pooled_prompt_embeds: torch.FloatTensor \| None = None,
	output_type: str \| None = "pil",
	return_dict: bool = True,
	joint_attention_kwargs: dict[str, Any] \| None = None,
	clip_skip: int \| None = None,
	callback_on_step_end: Callable[[int, int], None] \| None = None,
	callback_on_step_end_tensor_inputs: list[str] = ["latents"],
	max_sequence_length: int = 256,
	pag_scale: float = 3.0,
	pag_adaptive_scale: float = 0.0,
	):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pag model/pipeline review #13594

`pag` model/pipeline review

Issue 1: `pag_applied_layers="blocks.1"` also matches `blocks.10`

Issue 2: PAG processors are not restored if generation raises

Issue 3: SD3 PAG pipelines are stale versus base SD3

Issue 4: ControlNet PAG variants dropped `guess_mode`

Issue 5: `SanaPAGPipeline` lost Sana LoRA and attention kwargs support

Issue 6: PAG docs have broken examples and omit Sana API docs

Issue 7: Slow coverage is missing for most PAG variants

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

	if self.do_perturbed_attention_guidance:
	original_attn_proc = self.unet.attn_processors
	self._set_pag_attn_processor(
	pag_applied_layers=self.pag_applied_layers,
	do_classifier_free_guidance=self.do_classifier_free_guidance,
	)

	# Offload all models
	self.maybe_free_model_hooks()

	if self.do_perturbed_attention_guidance:
	self.unet.set_attn_processor(original_attn_proc)

	ip_adapter_image: PipelineImageInput \| None = None,
	ip_adapter_image_embeds: torch.Tensor \| None = None,
	output_type: str \| None = "pil",
	return_dict: bool = True,
	joint_attention_kwargs: dict[str, Any] \| None = None,
	clip_skip: int \| None = None,
	callback_on_step_end: Callable[[int, int], None] \| None = None,
	callback_on_step_end_tensor_inputs: list[str] = ["latents"],
	max_sequence_length: int = 256,
	skip_guidance_layers: list[int] = None,
	skip_layer_guidance_scale: float = 2.8,
	skip_layer_guidance_stop: float = 0.2,
	skip_layer_guidance_start: float = 0.01,
	mu: float \| None = None,

	scheduler_kwargs = {}
	if self.scheduler.config.get("use_dynamic_shifting", None) and mu is None:
	_, _, height, width = latents.shape
	image_seq_len = (height // self.transformer.config.patch_size) * (
	width // self.transformer.config.patch_size
	)
	mu = calculate_shift(
	image_seq_len,
	self.scheduler.config.get("base_image_seq_len", 256),
	self.scheduler.config.get("max_image_seq_len", 4096),
	self.scheduler.config.get("base_shift", 0.5),
	self.scheduler.config.get("max_shift", 1.16),
	)
	scheduler_kwargs["mu"] = mu
	elif mu is not None:
	scheduler_kwargs["mu"] = mu
	if XLA_AVAILABLE:
	timestep_device = "cpu"
	else:
	timestep_device = device
	timesteps, num_inference_steps = retrieve_timesteps(
	self.scheduler,
	num_inference_steps,
	timestep_device,
	sigmas=sigmas,
	**scheduler_kwargs,

	height: int \| None = None,
	width: int \| None = None,
	padding_mask_crop: int \| None = None,
	strength: float = 1.0,
	num_inference_steps: int = 50,
	guidance_scale: float = 7.5,
	negative_prompt: str \| list[str] \| None = None,
	num_images_per_prompt: int \| None = 1,
	eta: float = 0.0,
	generator: torch.Generator \| list[torch.Generator] \| None = None,
	latents: torch.Tensor \| None = None,
	prompt_embeds: torch.Tensor \| None = None,
	negative_prompt_embeds: torch.Tensor \| None = None,
	ip_adapter_image: PipelineImageInput \| None = None,
	ip_adapter_image_embeds: list[torch.Tensor] \| None = None,
	output_type: str \| None = "pil",
	return_dict: bool = True,
	cross_attention_kwargs: dict[str, Any] \| None = None,
	controlnet_conditioning_scale: float \| list[float] = 0.5,
	control_guidance_start: float \| list[float] = 0.0,
	control_guidance_end: float \| list[float] = 1.0,
	clip_skip: int \| None = None,
	callback_on_step_end: Callable[[int, int], None] \| PipelineCallback \| MultiPipelineCallbacks \| None = None,
	callback_on_step_end_tensor_inputs: list[str] = ["latents"],
	pag_scale: float = 3.0,
	pag_adaptive_scale: float = 0.0,
	):

	batch_size=batch_size * num_images_per_prompt,
	num_images_per_prompt=num_images_per_prompt,
	device=device,
	dtype=controlnet.dtype,
	crops_coords=crops_coords,
	resize_mode=resize_mode,
	do_classifier_free_guidance=self.do_classifier_free_guidance,
	guess_mode=False,
	)
	elif isinstance(controlnet, MultiControlNetModel):
	control_images = []

	for control_image_ in control_image:
	control_image_ = self.prepare_control_image(
	image=control_image_,
	width=width,
	height=height,
	batch_size=batch_size * num_images_per_prompt,
	num_images_per_prompt=num_images_per_prompt,
	device=device,
	dtype=controlnet.dtype,
	crops_coords=crops_coords,
	resize_mode=resize_mode,
	do_classifier_free_guidance=self.do_classifier_free_guidance,
	guess_mode=False,
	)

	down_block_res_samples, mid_block_res_sample = self.controlnet(
	control_model_input,
	t,
	encoder_hidden_states=controlnet_prompt_embeds,
	controlnet_cond=image,
	conditioning_scale=cond_scale,
	guess_mode=False,
	added_cond_kwargs=controlnet_added_cond_kwargs,

	controlnet_conditioning_scale: float \| list[float] = 0.5,
	guess_mode: bool = False,
	control_guidance_start: float \| list[float] = 0.0,
	control_guidance_end: float \| list[float] = 1.0,
	clip_skip: int \| None = None,
	callback_on_step_end: Callable[[int, int], None] \| PipelineCallback \| MultiPipelineCallbacks \| None = None,
	callback_on_step_end_tensor_inputs: list[str] = ["latents"],
	**kwargs,

	class SanaPAGPipeline(DiffusionPipeline, PAGMixin):
	r"""
	Pipeline for text-to-image generation using [Sana](https://huggingface.co/papers/2410.10629). This pipeline
	supports the use of [Perturbed Attention Guidance
	(PAG)](https://huggingface.co/docs/diffusers/main/en/using-diffusers/pag).
	"""

	# fmt: off
	bad_punct_regex = re.compile(r"[" + "#®•©™&@·º½¾¿¡§~" + r"\)" + r"\(" + r"\]" + r"\[" + r"\}" + r"\{" + r"\\|" + "\\" + r"\/" + r"\*" + r"]{1,}")
	# fmt: on

	model_cpu_offload_seq = "text_encoder->transformer->vae"
	_callback_tensor_inputs = ["latents", "prompt_embeds", "negative_prompt_embeds"]

	noise_pred = self.transformer(
	latent_model_input,
	encoder_hidden_states=prompt_embeds,
	encoder_attention_mask=prompt_attention_mask,
	timestep=timestep,

	def __call__(
	self,
	prompt: str \| list[str] = None,
	negative_prompt: str = "",
	num_inference_steps: int = 20,
	timesteps: list[int] = None,
	sigmas: list[float] = None,
	guidance_scale: float = 4.5,
	num_images_per_prompt: int \| None = 1,
	height: int = 1024,
	width: int = 1024,
	eta: float = 0.0,
	generator: torch.Generator \| list[torch.Generator] \| None = None,
	latents: torch.Tensor \| None = None,
	prompt_embeds: torch.Tensor \| None = None,
	prompt_attention_mask: torch.Tensor \| None = None,
	negative_prompt_embeds: torch.Tensor \| None = None,
	negative_prompt_attention_mask: torch.Tensor \| None = None,
	output_type: str \| None = "pil",
	return_dict: bool = True,
	clean_caption: bool = False,
	use_resolution_binning: bool = True,
	callback_on_step_end: Callable[[int, int], None] \| None = None,
	callback_on_step_end_tensor_inputs: list[str] = ["latents"],
	max_sequence_length: int = 300,
	complex_human_instruction: list[str] = [
	"Given a user prompt, generate an 'Enhanced prompt' that provides detailed visual descriptions suitable for image generation. Evaluate the level of detail in the user prompt:",
	"- If the prompt is simple, focus on adding specifics about colors, shapes, sizes, textures, and spatial relationships to create vivid and concrete scenes.",
	"- If the prompt is already detailed, refine and enhance the existing details slightly without overcomplicating.",
	"Here are examples of how to transform or refine prompts:",
	"- User Prompt: A cat sleeping -> Enhanced: A small, fluffy white cat curled up in a round shape, sleeping peacefully on a warm sunny windowsill, surrounded by pots of blooming red flowers.",
	"- User Prompt: A busy city street -> Enhanced: A bustling city street scene at dusk, featuring glowing street lamps, a diverse crowd of people in colorful clothing, and a double-decker bus passing by towering glass skyscrapers.",
	"Please generate only the enhanced description for the prompt below and avoid including any additional commentary or evaluations:",
	"User Prompt: ",
	],
	pag_scale: float = 3.0,
	pag_adaptive_scale: float = 0.0,
	) -> ImagePipelineOutput \| tuple:

	pag_scales = 4.0
	guidance_scales = 7.0

	url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
	init_image = load_image(url)
	prompt = "a dog catching a frisbee in the jungle"

	generator = torch.Generator(device="cpu").manual_seed(0)
	image = pipeline(
	prompt,
	image=init_image,
	strength=0.8,
	guidance_scale=guidance_scale,
	pag_scale=pag_scale,
	generator=generator).images[0]

	pipeline_t2i = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
	pipeline = AutoPipelineForInpaiting.from_pipe(pipeline_t2i, enable_pag=True)
	```

	Let's generate an image!

	```py
	img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
	mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
	init_image = load_image(img_url).convert("RGB")
	mask_image = load_image(mask_url).convert("RGB")

	prompt = "A majestic tiger sitting on a bench"

	pag_scales = 3.0
	guidance_scales = 7.5

	generator = torch.Generator(device="cpu").manual_seed(1)
	images = pipeline(
	prompt=prompt,
	image=init_image,
	mask_image=mask_image,
	strength=0.8,
	num_inference_steps=50,
	guidance_scale=guidance_scale,
	generator=generator,
	pag_scale=pag_scale,


	self._guidance_scale = guidance_scale
	self._attention_kwargs = attention_kwargs
	self._interrupt = False

	# 2. Default height and width to transformer
	if prompt is not None and isinstance(prompt, str):
	batch_size = 1
	elif prompt is not None and isinstance(prompt, list):
	batch_size = len(prompt)
	else:
	batch_size = prompt_embeds.shape[0]

	device = self._execution_device
	lora_scale = self.attention_kwargs.get("scale", None) if self.attention_kwargs is not None else None


	## AnimateDiffPAGPipeline
	[[autodoc]] AnimateDiffPAGPipeline
	- all
	- __call__

	## HunyuanDiTPAGPipeline
	[[autodoc]] HunyuanDiTPAGPipeline
	- all
	- __call__

	## KolorsPAGPipeline
	[[autodoc]] KolorsPAGPipeline
	- all
	- __call__

	## StableDiffusionPAGInpaintPipeline
	[[autodoc]] StableDiffusionPAGInpaintPipeline
	- all
	- __call__

	## StableDiffusionPAGPipeline
	[[autodoc]] StableDiffusionPAGPipeline
	- all
	- __call__

	## StableDiffusionPAGImg2ImgPipeline
	[[autodoc]] StableDiffusionPAGImg2ImgPipeline
	- all
	- __call__

	## StableDiffusionControlNetPAGPipeline
	[[autodoc]] StableDiffusionControlNetPAGPipeline

	## StableDiffusionControlNetPAGInpaintPipeline
	[[autodoc]] StableDiffusionControlNetPAGInpaintPipeline
	- all
	- __call__

	## StableDiffusionXLPAGPipeline
	[[autodoc]] StableDiffusionXLPAGPipeline
	- all
	- __call__

	## StableDiffusionXLPAGImg2ImgPipeline
	[[autodoc]] StableDiffusionXLPAGImg2ImgPipeline
	- all
	- __call__

	## StableDiffusionXLPAGInpaintPipeline
	[[autodoc]] StableDiffusionXLPAGInpaintPipeline
	- all
	- __call__

	## StableDiffusionXLControlNetPAGPipeline
	[[autodoc]] StableDiffusionXLControlNetPAGPipeline
	- all
	- __call__

	## StableDiffusionXLControlNetPAGImg2ImgPipeline
	[[autodoc]] StableDiffusionXLControlNetPAGImg2ImgPipeline
	- all
	- __call__

	## StableDiffusion3PAGPipeline
	[[autodoc]] StableDiffusion3PAGPipeline
	- all
	- __call__

	## StableDiffusion3PAGImg2ImgPipeline
	[[autodoc]] StableDiffusion3PAGImg2ImgPipeline
	- all
	- __call__

	## PixArtSigmaPAGPipeline
	[[autodoc]] PixArtSigmaPAGPipeline
	- all
	- __call__

pag model/pipeline review #13594

Description

pag model/pipeline review

Issue 1: pag_applied_layers="blocks.1" also matches blocks.10

Issue 2: PAG processors are not restored if generation raises

Issue 3: SD3 PAG pipelines are stale versus base SD3

Issue 4: ControlNet PAG variants dropped guess_mode

Issue 5: SanaPAGPipeline lost Sana LoRA and attention kwargs support

Issue 6: PAG docs have broken examples and omit Sana API docs

Issue 7: Slow coverage is missing for most PAG variants

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`pag` model/pipeline review

Issue 1: `pag_applied_layers="blocks.1"` also matches `blocks.10`

Issue 4: ControlNet PAG variants dropped `guess_mode`

Issue 5: `SanaPAGPipeline` lost Sana LoRA and attention kwargs support