Skip to content

Wan 2.2 I2V condition shape mismatch #12499

@DefTruth

Description

@DefTruth

Describe the bug

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 81 for tensor number 1 in the list

(cu129)➜ pipeline git:(dev) ✗ python3 run_wan_2.2_i2v.py
torch.Size([1, 1, 4, 90, 68])
torch.Size([1, 1, 84, 90, 68]) 4
torch.Size([1, 21, 4, 90, 68])
torch.Size([1, 4, 21, 90, 68]) torch.Size([1, 16, 81, 90, 68])
Traceback (most recent call last):
File "/workspace/dev/vipshop/cache-dit/examples/pipeline/run_wan_2.2_i2v.py", line 154, in
video = run_pipe()
^^^^^^^^^^
File "/workspace/dev/vipshop/cache-dit/examples/pipeline/run_wan_2.2_i2v.py", line 130, in run_pipe
video = pipe(
^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/dev/vipshop/diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 705, in call
latents_outputs = self.prepare_latents(
^^^^^^^^^^^^^^^^^^^^^
File "/workspace/dev/vipshop/diffusers/src/diffusers/pipelines/wan/pipeline_wan_i2v.py", line 487, in prepare_latents
return latents, torch.concat([mask_lat_size, latent_condition], dim=1)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 21 but got size 81 for tensor number 1 in the list.

Reproduction

Reproduce

from: https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.to(device)


image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG"
)
max_area = 480 * 832
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
generator = torch.Generator(device=device).manual_seed(0)
output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]
export_to_video(output, "i2v_output.mp4", fps=16)

Logs

System Info

diffusers 0.36.0.dev0 (main), pytorch 2.9.0

Who can help?

@yiyixuxu

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions