Skip to content

Incorrect mask padding in Wan VACE pipeline #11779

@bennyguo

Description

@bennyguo

Describe the bug

When using reference images in the Wan VACE pipeline, the reference image latents should be at the front of other latents, as well as the mask. However in the main branch the mask padding for the reference latents comes after masks for other latents:

mask_ = torch.cat([mask_, mask_padding], dim=1)

It should be torch.cat([mask_padding, mask_], dim=1) which aligns with the official implementation:

https://github.com/ali-vilab/VACE/blob/0897c6d055d7d9ea9e191dce763006664d9780f8/vace/models/wan/wan_vace.py#L187

I've fixed this in this PR.

Reproduction

Running the pipeline with reference images should trigger the bug:

import torch
from diffusers import AutoencoderKLWan, WanVACEPipeline
from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image


model_id = "Wan-AI/Wan2.1-VACE-1.3B-diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanVACEPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
flow_shift = 3.0  # 5.0 for 720P, 3.0 for 480P
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config, flow_shift=flow_shift)
pipe.to("cuda")

prompt = "A penguin walks out of a building in a happy mood. The background is a busy street with people walking by. He looks joyful and is smiling and dancing with some crazy moves. The scene is bright. The lighting is warm and inviting, creating a cheerful atmosphere. The camera angle is slightly low, capturing the character from below."
negative_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"

height = 480
num_frames = 21
width = 832
reference_image = load_image("penguin.png")

output = pipe(
    reference_images=reference_image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    num_inference_steps=30,
    guidance_scale=5.0,
    generator=torch.Generator().manual_seed(42),
).frames[0]
export_to_video(output, "ref.mp4", fps=16)

The image I used:

Image

Results before fixing the bug:

current.mp4

after fixing:

fixed.mp4

You can see that before fixing the reference image somehow "leaked" into the generated video, and the prompt-following is worse.

Logs

System Info

  • 🤗 Diffusers version: 0.34.0.dev0
  • Platform: Linux-3.10.0-1160.83.1.el7.x86_64-x86_64-with-glibc2.39
  • Running on Google Colab?: No
  • Python version: 3.12.3
  • PyTorch version (GPU?): 2.7.0a0+7c8ec84dab.nv25.03 (True)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Huggingface_hub version: 0.32.2
  • Transformers version: 4.52.3
  • Accelerate version: 1.8.1
  • PEFT version: 0.15.2
  • Bitsandbytes version: not installed
  • Safetensors version: 0.5.3
  • xFormers version: not installed
  • Accelerator: NVIDIA A800-SXM4-80GB, 81920 MiB

Who can help?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions