|
timesteps: list[int] | None = None, |
|
sigmas: list[float] | None = None, |
|
guidance_scale: float = 7.5, |
|
strength: float = 0.8, |
|
negative_prompt: str | list[str] | None = None, |
|
num_videos_per_prompt: int | None = 1, |
|
eta: float = 0.0, |
|
generator: torch.Generator | list[torch.Generator] | None = None, |
|
latents: torch.Tensor | None = None, |
|
prompt_embeds: torch.Tensor | None = None, |
|
negative_prompt_embeds: torch.Tensor | None = None, |
|
ip_adapter_image: PipelineImageInput | None = None, |
|
ip_adapter_image_embeds: list[torch.Tensor] | None = None, |
|
conditioning_frames: list[PipelineImageInput] | None = None, |
|
output_type: str | None = "pil", |
|
return_dict: bool = True, |
|
cross_attention_kwargs: dict[str, Any] | None = None, |
|
controlnet_conditioning_scale: float | list[float] = 1.0, |
|
guess_mode: bool = False, |
|
control_guidance_start: float | list[float] = 0.0, |
|
control_guidance_end: float | list[float] = 1.0, |
|
clip_skip: int | None = None, |
|
callback_on_step_end: Callable[[int, int], None] | None = None, |
|
callback_on_step_end_tensor_inputs: list[str] = ["latents"], |
|
decode_chunk_size: int = 16, |
|
): |
|
r""" |
|
The call function to the pipeline for generation. |
|
|
|
Args: |
|
video (`list[PipelineImageInput]`): |
|
The input video to condition the generation on. Must be a list of images/frames of the video. |
|
prompt (`str` or `list[str]`, *optional*): |
|
The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. |
|
height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): |
|
The height in pixels of the generated video. |
|
width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): |
|
The width in pixels of the generated video. |
|
num_inference_steps (`int`, *optional*, defaults to 50): |
|
The number of denoising steps. More denoising steps usually lead to a higher quality videos at the |
|
expense of slower inference. |
|
timesteps (`list[int]`, *optional*): |
|
Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument |
|
in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is |
|
passed will be used. Must be in descending order. |
|
sigmas (`list[float]`, *optional*): |
|
Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in |
|
their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed |
|
will be used. |
|
strength (`float`, *optional*, defaults to 0.8): |
|
Higher strength leads to more differences between original video and generated video. |
|
guidance_scale (`float`, *optional*, defaults to 7.5): |
|
A higher guidance scale value encourages the model to generate images closely linked to the text |
|
`prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. |
|
negative_prompt (`str` or `list[str]`, *optional*): |
|
The prompt or prompts to guide what to not include in image generation. If not defined, you need to |
|
pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale < 1`). |
|
eta (`float`, *optional*, defaults to 0.0): |
|
Corresponds to parameter eta (η) from the [DDIM](https://huggingface.co/papers/2010.02502) paper. Only |
|
applies to the [`~schedulers.DDIMScheduler`], and is ignored in other schedulers. |
|
generator (`torch.Generator` or `list[torch.Generator]`, *optional*): |
|
A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make |
|
generation deterministic. |
|
latents (`torch.Tensor`, *optional*): |
|
Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for video |
|
generation. Can be used to tweak the same generation with different prompts. If not provided, a latents |
|
tensor is generated by sampling using the supplied random `generator`. Latents should be of shape |
|
`(batch_size, num_channel, num_frames, height, width)`. |
|
prompt_embeds (`torch.Tensor`, *optional*): |
|
Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not |
|
provided, text embeddings are generated from the `prompt` input argument. |
|
negative_prompt_embeds (`torch.Tensor`, *optional*): |
|
Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If |
|
not provided, `negative_prompt_embeds` are generated from the `negative_prompt` input argument. |
|
ip_adapter_image: (`PipelineImageInput`, *optional*): |
|
Optional image input to work with IP Adapters. |
|
ip_adapter_image_embeds (`list[torch.Tensor]`, *optional*): |
|
Pre-generated image embeddings for IP-Adapter. It should be a list of length same as number of |
|
IP-adapters. Each element should be a tensor of shape `(batch_size, num_images, emb_dim)`. It should |
|
contain the negative image embedding if `do_classifier_free_guidance` is set to `True`. If not |
|
provided, embeddings are computed from the `ip_adapter_image` input argument. |
|
conditioning_frames (`list[PipelineImageInput]`, *optional*): |
|
The ControlNet input condition to provide guidance to the `unet` for generation. If multiple |
|
ControlNets are specified, images must be passed as a list such that each element of the list can be |
|
correctly batched for input to a single ControlNet. |
|
output_type (`str`, *optional*, defaults to `"pil"`): |
|
The output format of the generated video. Choose between `torch.Tensor`, `PIL.Image` or `np.array`. |
|
return_dict (`bool`, *optional*, defaults to `True`): |
|
Whether or not to return a [`AnimateDiffPipelineOutput`] instead of a plain tuple. |
|
cross_attention_kwargs (`dict`, *optional*): |
|
A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in |
|
[`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). |
|
controlnet_conditioning_scale (`float` or `list[float]`, *optional*, defaults to 1.0): |
|
The outputs of the ControlNet are multiplied by `controlnet_conditioning_scale` before they are added |
|
to the residual in the original `unet`. If multiple ControlNets are specified in `init`, you can set |
|
the corresponding scale as a list. |
|
guess_mode (`bool`, *optional*, defaults to `False`): |
|
The ControlNet encoder tries to recognize the content of the input image even if you remove all |
|
prompts. A `guidance_scale` value between 3.0 and 5.0 is recommended. |
|
control_guidance_start (`float` or `list[float]`, *optional*, defaults to 0.0): |
|
The percentage of total steps at which the ControlNet starts applying. |
|
control_guidance_end (`float` or `list[float]`, *optional*, defaults to 1.0): |
|
The percentage of total steps at which the ControlNet stops applying. |
|
clip_skip (`int`, *optional*): |
|
Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that |
|
the output of the pre-final layer will be used for computing the prompt embeddings. |
|
callback_on_step_end (`Callable`, *optional*): |
|
A function that calls at the end of each denoising steps during the inference. The function is called |
|
with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, |
|
callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by |
|
`callback_on_step_end_tensor_inputs`. |
|
callback_on_step_end_tensor_inputs (`list`, *optional*): |
|
The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list |
|
will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the |
|
`._callback_tensor_inputs` attribute of your pipeline class. |
|
decode_chunk_size (`int`, defaults to `16`): |
|
The number of frames to decode at a time when calling `decode_latents` method. |
|
|
|
Examples: |
|
|
|
Returns: |
|
[`pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput`] or `tuple`: |
|
If `return_dict` is `True`, [`pipelines.animatediff.pipeline_output.AnimateDiffPipelineOutput`] is |
|
returned, otherwise a `tuple` is returned where the first element is a list with the generated frames. |
|
""" |
|
|
|
controlnet = self.controlnet._orig_mod if is_compiled_module(self.controlnet) else self.controlnet |
|
|
|
# align format for control guidance |
|
if not isinstance(control_guidance_start, list) and isinstance(control_guidance_end, list): |
|
control_guidance_start = len(control_guidance_end) * [control_guidance_start] |
|
elif not isinstance(control_guidance_end, list) and isinstance(control_guidance_start, list): |
|
control_guidance_end = len(control_guidance_start) * [control_guidance_end] |
|
elif not isinstance(control_guidance_start, list) and not isinstance(control_guidance_end, list): |
|
mult = len(controlnet.nets) if isinstance(controlnet, MultiControlNetModel) else 1 |
|
control_guidance_start, control_guidance_end = ( |
|
mult * [control_guidance_start], |
|
mult * [control_guidance_end], |
|
) |
|
|
|
# 0. Default height and width to unet |
|
height = height or self.unet.config.sample_size * self.vae_scale_factor |
|
width = width or self.unet.config.sample_size * self.vae_scale_factor |
|
|
|
num_videos_per_prompt = 1 |
animatediffmodel/pipeline reviewCommit tested:
0f1abc4ae8b0eb2a3b40e82a310507281144c423Review performed against the repository review rules. Public exports/imports, config/load paths, dtype/device/offload behavior, model attention behavior, docs/examples, and fast/slow tests were checked. Reproductions were run with
.venv.Duplicate search: checked GitHub Issues and PRs for
animatediff, affected classes/files,num_videos_per_prompt, MultiControlNet validation, staleunet_motion_modelimport, SparseControlNet tests, and slow-test coverage. No exact duplicates found. Related but not duplicates: #8664, #9326, #9508, #7378.Issue 1: Video-to-video pipelines ignore
num_videos_per_promptAffected code:
diffusers/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video.py
Lines 752 to 864 in 0f1abc4
diffusers/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Lines 918 to 1062 in 0f1abc4
Problem:
Both public signatures expose
num_videos_per_prompt, but__call__overwrites it with1before input validation, latent preparation, prompt expansion, and denoising. Users requesting multiple videos per prompt silently receive one video.Impact:
Batch semantics are wrong and no test catches it. This also hides related latent-preparation gaps that need to duplicate/expand the input video latents for
num_videos_per_prompt > 1.Reproduction:
Relevant precedent:
diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py
Lines 735 to 783 in 0f1abc4
Suggested fix:
Remove the hard reset in both pipelines and update
prepare_latentsto expand the input video batch like img2img does:Issue 2: AnimateDiff Multi-ControlNet validation can silently drop ControlNets
Affected code:
diffusers/src/diffusers/pipelines/animatediff/pipeline_animatediff_controlnet.py
Lines 562 to 592 in 0f1abc4
diffusers/src/diffusers/pipelines/animatediff/pipeline_animatediff_video2video_controlnet.py
Lines 686 to 719 in 0f1abc4
Problem:
For
MultiControlNetModel, the conditioning-frame list length is not checked againstlen(controlnet.nets). The scale length check is also unreachable because it is underelif isinstance(controlnet_conditioning_scale, list)after anif isinstance(..., list)branch.Impact:
A user can pass two ControlNets but only one conditioning video or one scale. Validation succeeds, then
MultiControlNetModel.forwardzips the lists and silently skips the extra ControlNet.Reproduction:
Relevant precedent:
diffusers/src/diffusers/pipelines/controlnet_sd3/pipeline_stable_diffusion_3_controlnet_inpainting.py
Lines 732 to 752 in 0f1abc4
diffusers/src/diffusers/models/controlnets/multicontrolnet.py
Lines 47 to 48 in 0f1abc4
Suggested fix:
Issue 3: Community AnimateDiff image-to-video example imports a removed module path
Affected code:
diffusers/examples/community/pipeline_animatediff_img2video.py
Lines 29 to 34 in 0f1abc4
Problem:
The example imports
MotionAdapterfromdiffusers.models.unet_motion_model, but that module path does not exist. The public import is available fromdiffusers.Impact:
The community pipeline fails at import time before users can run it.
Reproduction:
Relevant precedent:
diffusers/examples/community/pipeline_animatediff_controlnet.py
Lines 26 to 29 in 0f1abc4
Suggested fix:
Issue 4: Missing slow coverage for most AnimateDiff variants and missing model tests for SparseControlNetModel
Affected code:
diffusers/tests/pipelines/animatediff/test_animatediff.py
Lines 560 to 562 in 0f1abc4
diffusers/tests/pipelines/animatediff/test_animatediff_controlnet.py
Lines 42 to 45 in 0f1abc4
diffusers/tests/pipelines/animatediff/test_animatediff_sparsectrl.py
Lines 41 to 44 in 0f1abc4
diffusers/tests/pipelines/animatediff/test_animatediff_sdxl.py
Lines 34 to 40 in 0f1abc4
diffusers/src/diffusers/models/controlnets/controlnet_sparsectrl.py
Lines 96 to 161 in 0f1abc4
Problem:
Only the base
AnimateDiffPipelinehas a slow test. There are no slow tests for ControlNet, SparseCtrl, SDXL, video-to-video, or video-to-video ControlNet.SparseControlNetModelalso has no model-level test undertests/models/controlnets, so model save/load, config roundtrip, attention processor behavior, and gradient checkpointing are only indirectly covered by pipeline tests.Impact:
Checkpoint-specific regressions and model serialization issues can ship without coverage. This is especially risky for SparseCtrl because the model is public and loadable independently from the pipeline.
Reproduction:
Relevant precedent:
diffusers/tests/models/unets/test_models_unet_motion.py
Lines 41 to 42 in 0f1abc4
diffusers/tests/pipelines/animatediff/test_animatediff.py
Lines 560 to 621 in 0f1abc4
Suggested fix:
Add slow smoke tests for the missing public pipelines using the documented small checkpoint paths where possible, and add
tests/models/controlnets/test_models_controlnet_sparsectrl.pywith the standardModelTesterMixincoverage for forward shape, save/load, variant save/load, attention processors, and gradient checkpointing.