Incorrect mask padding in Wan VACE pipeline

### Describe the bug

When using reference images in the Wan VACE pipeline, the reference image latents should be at the front of other latents, as well as the mask. However in the main branch the mask padding for the reference latents comes after masks for other latents:

https://github.com/huggingface/diffusers/blob/cd813499beedc399ba6f6c4ad760325bb074bd2f/src/diffusers/pipelines/wan/pipeline_wan_vace.py#L596

It should be `torch.cat([mask_padding, mask_], dim=1)` which aligns with the official implementation:

https://github.com/ali-vilab/VACE/blob/0897c6d055d7d9ea9e191dce763006664d9780f8/vace/models/wan/wan_vace.py#L187

I've fixed this in [this PR](https://github.com/huggingface/diffusers/pull/11778).



### Reproduction

Running the pipeline with reference images should trigger the bug:

```python
import torch
from diffusers import AutoencoderKLWan, WanVACEPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image


model_id = "Wan-AI/Wan2.1-VACE-1.3B-diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanVACEPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
flow_shift = 3.0  # 5.0 for 720P, 3.0 for 480P
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
pipe.to("cuda")

prompt = "A penguin walks out of a building in a happy mood. The background is a busy street with people walking by. He looks joyful and is smiling and dancing with some crazy moves. The scene is bright. The lighting is warm and inviting, creating a cheerful atmosphere. The camera angle is slightly low, capturing the character from below."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

height = 480
num_frames = 21
width = 832
reference_image = load_image("penguin.png")

output = pipe(
    reference_images=reference_image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    num_inference_steps=30,
    guidance_scale=5.0,
    generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_video(output, "ref.mp4", fps=16)
```

The image I used:

![Image](https://github.com/user-attachments/assets/6228b3b9-2a9d-437e-ad0b-5442b31f0543)

Results before fixing the bug:

https://github.com/user-attachments/assets/e1268d9e-58d4-4bdd-bdcd-1238f802be61

after fixing:

https://github.com/user-attachments/assets/49277d21-1c8f-4370-9733-e16beb018c7f

You can see that before fixing the reference image somehow "leaked" into the generated video, and the prompt-following is worse.

### Logs

```shell

```

### System Info

- 🤗 Diffusers version: 0.34.0.dev0
- Platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.39
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.7.0a0+7c8ec84dab.nv25.03 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.32.2
- Transformers version: 4.52.3
- Accelerate version: 1.8.1
- PEFT version: 0.15.2
- Bitsandbytes version: not installed
- Safetensors version: 0.5.3
- xFormers version: not installed
- Accelerator: NVIDIA A800-SXM4-80GB, 81920 MiB

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Incorrect mask padding in Wan VACE pipeline #11779

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Incorrect mask padding in Wan VACE pipeline #11779

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions