Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
3529a0a
template1
tolgacangoz Oct 6, 2025
4f2ee5e
temp2
tolgacangoz Oct 6, 2025
778fb54
up
tolgacangoz Oct 6, 2025
d77b6ba
up
tolgacangoz Oct 6, 2025
2fc6ac2
fix-copies
tolgacangoz Oct 6, 2025
d667d03
Add support for Wan2.2-Animate-14B model in convert_wan_to_diffusers.py
tolgacangoz Oct 7, 2025
6182d44
style
tolgacangoz Oct 7, 2025
8c9fd89
Refactor WanAnimate model components
tolgacangoz Oct 7, 2025
d01e941
Enhance `WanAnimatePipeline` with new parameters for mode and tempora…
tolgacangoz Oct 7, 2025
7af953b
Update `WanAnimatePipeline` to require additional video inputs and im…
tolgacangoz Oct 7, 2025
a0372e3
Add Wan 2.2 Animate 14B model support and introduce Wan-Animate frame…
tolgacangoz Oct 7, 2025
05a01c6
Add unit test template for `WanAnimatePipeline` functionality
tolgacangoz Oct 7, 2025
22b83ce
Add unit tests for `WanAnimateTransformer3DModel` in GGUF format
tolgacangoz Oct 7, 2025
7fb6732
style
tolgacangoz Oct 7, 2025
3e6f893
Improve the template of `transformer_wan_animate.py`
tolgacangoz Oct 7, 2025
624a314
Update `WanAnimatePipeline`
tolgacangoz Oct 7, 2025
fc0edb5
style
tolgacangoz Oct 7, 2025
eb7eedd
Refactor test for `WanAnimatePipeline` to include new input structure
tolgacangoz Oct 7, 2025
8968b42
from `einops` to `torch`
tolgacangoz Oct 8, 2025
dce83a8
Merge branch 'main' into integrations/wan2.2-animate
tolgacangoz Oct 8, 2025
75b2382
Add padding functionality to `WanAnimatePipeline` for video frames
tolgacangoz Oct 8, 2025
802896e
style
tolgacangoz Oct 8, 2025
e06098f
Enhance `WanAnimatePipeline` with additional input parameters for imp…
tolgacangoz Oct 8, 2025
84768f6
up
tolgacangoz Oct 8, 2025
06e6138
Refactor `WanAnimatePipeline` for improved tensor handling and mask g…
tolgacangoz Oct 8, 2025
5777ce0
Refactor `WanAnimatePipeline` to streamline latent tensor processing …
tolgacangoz Oct 9, 2025
b8337c6
style
tolgacangoz Oct 9, 2025
f4eb9a0
Add new layers and functions to `transformer_wan_animate.py` for enha…
tolgacangoz Oct 9, 2025
4e6651b
Merge branch 'main' into integrations/wan2.2-animate
tolgacangoz Oct 13, 2025
d80ae19
Refactor `transformer_wan_animate.py` to improve modularity and type …
tolgacangoz Oct 10, 2025
348a945
Refactor `transformer_wan_animate.py` to enhance modularity and updat…
tolgacangoz Oct 15, 2025
7774421
Update the `ConvLayer` class to conditionally apply bias based on act…
tolgacangoz Oct 17, 2025
a5536e2
Simplify
tolgacangoz Oct 17, 2025
6a8662d
refactor transformer
tolgacangoz Oct 17, 2025
96a126a
Enhance `convert_wan_to_diffusers.py` for Animate model integration
tolgacangoz Oct 17, 2025
050b313
Merge branch 'main' into integrations/wan2.2-animate
tolgacangoz Oct 18, 2025
0566e5d
Enhance `convert_wan_to_diffusers.py` and `WanAnimatePipeline` for im…
tolgacangoz Oct 20, 2025
fe02c25
simplify
tolgacangoz Oct 20, 2025
04ab262
Refactor `WanAnimatePipeline` to enhance reference image handling and…
tolgacangoz Oct 20, 2025
7bfbd93
Enhance weight conversion logic in `convert_wan_to_diffusers.py`
tolgacangoz Oct 21, 2025
7092a28
Enhance documentation and tests for WanAnimatePipeline, adding exampl…
tolgacangoz Oct 21, 2025
5d01574
Merge branch 'main' into integrations/wan2.2-animate
tolgacangoz Oct 21, 2025
9c0a65d
Clarify contribution of M. Tolga Cangöz
tolgacangoz Oct 21, 2025
28ac516
Update face_embedder key mappings in `convert_wan_to_diffusers.py`
tolgacangoz Oct 21, 2025
b71d3a9
up
tolgacangoz Oct 21, 2025
5818d71
up
tolgacangoz Oct 21, 2025
bfda25d
Fix image embedding extraction in WanAnimatePipeline to return the la…
tolgacangoz Oct 21, 2025
0ac259c
Adjust default parameters in WanAnimatePipeline for num_frames, num_i…
tolgacangoz Oct 21, 2025
e2e95ed
Update example docstring parameters for num_frames and guidance_scale…
tolgacangoz Oct 21, 2025
7146bb0
Refactor tests in WanAnimatePipeline: remove redundant assertions and…
tolgacangoz Oct 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
255 changes: 238 additions & 17 deletions docs/source/en/api/pipelines/wan.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ The following Wan models are supported in Diffusers:
- [Wan 2.2 T2V 14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)
- [Wan 2.2 I2V 14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers)
- [Wan 2.2 TI2V 5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers)
- [Wan 2.2 Animate 14B](https://huggingface.co/Wan-AI/Wan2.2-Animate-14B-Diffusers)

> [!TIP]
> Click on the Wan models in the right sidebar for more examples of video generation.
Expand Down Expand Up @@ -95,15 +96,15 @@ pipeline = WanPipeline.from_pretrained(
pipeline.to("cuda")

prompt = """
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""

Expand Down Expand Up @@ -150,15 +151,15 @@ pipeline.transformer = torch.compile(
)

prompt = """
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""
negative_prompt = """
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
"""

Expand Down Expand Up @@ -249,6 +250,220 @@ The code snippets available in [this](https://github.com/huggingface/diffusers/p

The general rule of thumb to keep in mind when preparing inputs for the VACE pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color.

</hfoption>
</hfoptions>

### Wan-Animate: Unified Character Animation and Replacement with Holistic Replication

[Wan-Animate](https://huggingface.co/papers/2509.14055) by the Wan Team.

*We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene's lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character's appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.*

The project page: https://humanaigc.github.io/wan-animate

This model was mostly contributed by [M. Tolga Cangöz](https://github.com/tolgacangoz).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


#### Usage

The Wan-Animate pipeline supports two modes of operation:

1. **Animation Mode** (default): Animates a character image based on motion and expression from reference videos
2. **Replacement Mode**: Replaces a character in a background video with a new character while preserving the scene

##### Prerequisites

Before using the pipeline, you need to preprocess your reference video to extract:
- **Pose video**: Contains skeletal keypoints representing body motion
- **Face video**: Contains facial feature representations for expression control

For replacement mode, you additionally need:
- **Background video**: The original video containing the scene
- **Mask video**: A mask indicating where to generate content (white) vs. preserve original (black)

> [!NOTE]
> The preprocessing tools are available in the original Wan-Animate repository. Integration of these preprocessing steps into Diffusers is planned for a future release.
The example below demonstrates how to use the Wan-Animate pipeline:

<hfoptions id="Animate usage">
<hfoption id="Animation mode">

```python
import numpy as np
import torch
from diffusers import AutoencoderKLWan, WanAnimatePipeline
from diffusers.utils import export_to_video, load_image, load_video
from transformers import CLIPVisionModel

model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanAnimatePipeline.from_pretrained(
model_id, vae=vae, torch_dtype=torch.bfloat16
)
pipe.to("cuda")

# Load character image and preprocessed videos
image = load_image("path/to/character.jpg")
pose_video = load_video("path/to/pose_video.mp4") # Preprocessed skeletal keypoints
face_video = load_video("path/to/face_video.mp4") # Preprocessed facial features

# Resize image to match VAE constraints
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
return image, height, width

image, height, width = aspect_ratio_resize(image, pipe)

prompt = "A person dancing energetically in a studio with dynamic lighting and professional camera work"
negative_prompt = "blurry, low quality, distorted, deformed, static, poorly drawn"

# Generate animated video
output = pipe(
image=image,
pose_video=pose_video,
face_video=face_video,
prompt=prompt,
negative_prompt=negative_prompt,
height=height,
width=width,
num_frames=81,
guidance_scale=5.0,
mode="animation", # Animation mode (default)
).frames[0]
export_to_video(output, "animated_character.mp4", fps=16)
```

</hfoption>
<hfoption id="Replacement mode">

```python
import numpy as np
import torch
from diffusers import AutoencoderKLWan, WanAnimatePipeline
from diffusers.utils import export_to_video, load_image, load_video
from transformers import CLIPVisionModel

model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float16)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanAnimatePipeline.from_pretrained(
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
)
pipe.to("cuda")

# Load all required inputs for replacement mode
image = load_image("path/to/new_character.jpg")
pose_video = load_video("path/to/pose_video.mp4") # Preprocessed skeletal keypoints
face_video = load_video("path/to/face_video.mp4") # Preprocessed facial features
background_video = load_video("path/to/background_video.mp4") # Original scene
mask_video = load_video("path/to/mask_video.mp4") # Black: preserve, White: generate

# Resize image to match video dimensions
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
return image, height, width

image, height, width = aspect_ratio_resize(image, pipe)

prompt = "A person seamlessly integrated into the scene with consistent lighting and environment"
negative_prompt = "blurry, low quality, inconsistent lighting, floating, disconnected from scene"

# Replace character in background video
output = pipe(
image=image,
pose_video=pose_video,
face_video=face_video,
background_video=background_video,
mask_video=mask_video,
prompt=prompt,
negative_prompt=negative_prompt,
height=height,
width=width,
num_frames=81,
guidance_scale=5.0,
mode="replacement", # Replacement mode
).frames[0]
export_to_video(output, "character_replaced.mp4", fps=16)
```

</hfoption>
<hfoption id="Advanced options">

```python
import numpy as np
import torch
from diffusers import AutoencoderKLWan, WanAnimatePipeline
from diffusers.utils import export_to_video, load_image, load_video
from transformers import CLIPVisionModel

model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float16)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanAnimatePipeline.from_pretrained(
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
)
pipe.to("cuda")

image = load_image("path/to/character.jpg")
pose_video = load_video("path/to/pose_video.mp4")
face_video = load_video("path/to/face_video.mp4")

def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
return image, height, width

image, height, width = aspect_ratio_resize(image, pipe)

prompt = "A person dancing energetically in a studio"
negative_prompt = "blurry, low quality"

# Advanced: Use temporal guidance and custom callback
def callback_fn(pipe, step_index, timestep, callback_kwargs):
# You can modify latents or other tensors here
print(f"Step {step_index}, Timestep {timestep}")
return callback_kwargs

output = pipe(
image=image,
pose_video=pose_video,
face_video=face_video,
prompt=prompt,
negative_prompt=negative_prompt,
height=height,
width=width,
num_frames=81,
num_inference_steps=50,
guidance_scale=5.0,
num_frames_for_temporal_guidance=5, # Use 5 frames for temporal guidance (1 or 5 recommended)
callback_on_step_end=callback_fn,
callback_on_step_end_tensor_inputs=["latents"],
).frames[0]
export_to_video(output, "animated_advanced.mp4", fps=16)
```

</hfoption>
</hfoptions>

#### Key Parameters

- **mode**: Choose between `"animation"` (default) or `"replacement"`
- **num_frames_for_temporal_guidance**: Number of frames for temporal guidance (1 or 5 recommended). Using 5 provides better temporal consistency but requires more memory
- **guidance_scale**: Controls how closely the output follows the text prompt. Higher values (5-7) produce results more aligned with the prompt
- **num_frames**: Total number of frames to generate. Should be divisible by `vae_scale_factor_temporal` (default: 4)


## Notes

- Wan2.1 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].
Expand Down Expand Up @@ -281,10 +496,10 @@ The general rule of thumb to keep in mind when preparing inputs for the VACE pip

# use "steamboat willie style" to trigger the LoRA
prompt = """
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot,
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
"""

Expand Down Expand Up @@ -359,6 +574,12 @@ The general rule of thumb to keep in mind when preparing inputs for the VACE pip
- all
- __call__

## WanAnimatePipeline

[[autodoc]] WanAnimatePipeline
- all
- __call__

## WanPipelineOutput

[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput
[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput
Loading