Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] AnimateDiff: Long context video generation #8275

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

a-r-r-o-w
Copy link
Member

@a-r-r-o-w a-r-r-o-w commented May 26, 2024

What does this PR do?

Support for long context and infinite-length video generation has been present for a long time in UIs and custom implementations. This is an attempt at adding the same to diffusers.

Fixes #6521 and a few other unclosed discussions partially.

Code
import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, DDIMScheduler
from diffusers.pipelines.animatediff.context_utils import ContextScheduler
from diffusers.utils import export_to_gif
from PIL import Image

model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
scheduler = DDIMScheduler.from_pretrained(
    model_id, beta_schedule="linear", subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe = AnimateDiffPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to("cuda")

prompt = "a robot walking dominantly standing over a destroyed planet, apocalyptic, surreal, high quality"
negative_prompt = "low quality, worst quality, unrealistic"

video = pipe(
    prompt=prompt, # or provide multiple in a list
    negative_prompt=negative_prompt,
    height=512,
    width=512,
    num_frames=32,
    guidance_scale=7,
    num_inference_steps=20,
    generator=torch.Generator().manual_seed(42),
    decode_batch_size=8,
    # length must be less than max_motion_module_length (usually: 32, default as 16 is good)
    context_scheduler=ContextScheduler(length=16, stride=3, overlap=4, loop=True, type="uniform_constant"), 
    clip_skip=2,
)

# If you want to do normal processing without any of the new additions impacting AnimateDiff, don't
# pass anything for context_scheduler. It will retain old behaviour and you will be limited to a max
# of 24/32 frames based on motion module.
video = pipe(
    prompt=prompt, # or provide multiple in a list
    negative_prompt=negative_prompt,
    height=512,
    width=512,
    num_frames=16,
    guidance_scale=7,
    num_inference_steps=20,
    generator=torch.Generator().manual_seed(42),
    decode_batch_size=8, 
    clip_skip=2,
)

frames = video.frames[0]
export_to_gif(frames, "video.gif", fps=8)
32 frames - single prompt 64 frames - two prompts (does not work too well yet)

There are a couple of problems with the implementation at the moment, and, I believe, even with the original reference repositories:

  • You cannot set loop=False for all kinds of configurations. This results in total_counts to have 0'ed values when certain indices are never processed causing the latents to get filled with nan's
  • ordered_halving has some kind of sorcery going on. From what I think, it is acting as a pseudo random number generation and so we can replace it out with something more understandable (see my commented code for example)
  • Does not seem to be interpolating well between prompts and actually depicting the "prompt travel" yet. Seems like a bug on my end
  • Not sure why the original implementations uses stride lengths of 8 or higher (context_stride >= 4) because the frames are quite far apart to need temporal averaging, and I don't see much difference in results. It seems like unnecessary more iterations.
  • I'm not too sure about the code design. callback implementation for context scheduler?

Additionally, noticing much better results when applying to vid2vid/controlnet but will push those changes once we finalize design and fix any other bugs.

Other methods to potentially look into in the future:

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 @sayakpaul @yiyixuxu

@a-r-r-o-w a-r-r-o-w changed the title [AnimateDiff] Long context video generation [Core] AnimateDiff: Long context video generation May 26, 2024
@DN6
Copy link
Collaborator

DN6 commented May 27, 2024

Given the issues with the context scheduler (not having a good understanding of ordered halving is an issue IMO and the scheduler code is quite confusing to read), perhaps we looks into FIFO as our first option for longer context video?

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented May 27, 2024

Given the issues with the context scheduler (not having a good understanding of ordered halving is an issue IMO and the scheduler code is quite confusing to read), perhaps we looks into FIFO as our first option for longer context video?

I'm quite confident its role is similar to that of a PRNG. The context scheduler is essentially just trying to iterate through different batches of 16 frames (or whatever is set as the length) starting at different offset positions. Let's take a small example of what would yield a good video of 16 frames if we could only process 8 frames at a time using the motion module:

num_frames = 16
length = 8
overlap = 4
stride = 2 (strides = [2 ** 0, 2 ** 1] => [1, 2])

for current stride = 1,
[0, 1, 2, 3, 4, 5, 6, 7]
[4, 5, 6, 7, 8, 9, 10, 11]
[8, 9, 10, 11, 12, 13, 14, 15]
[12, 13, 14, 15, 0, 1, 2, 3] # In case we want a looping effect
# with the above overlaps, we would have enough smoothing to make the transitions not look too abrupt but with added processing of bigger strides, we can see much lesser abruptness

for current stride = 2,
[0, 2, 4, 6, 8, 10, 12, 14] # achieves temporal smoothing. it does not have to be this; it could be the following as well: [1, 3, 5, 7, 9, 11, 13, 15]

From my testing, I think it suffices to use a simple schedule that:

  • processes 16/24/32 frames (based on whatever motion module supports) at a time
  • processes in strides of 1 and strides of 2 (and maybe 4 as well)
  • considers overlap like above

because it gives decent results as well.

FIFODiffusion currently supports Videocrafter, Open Sora and Zeroscope. We should definitely try integrating the crafter family of models due to much research utilizing it and the methods around it. But, I think due to the success of AnimateDiff in the community, having existing methods integrated should be something to look at too? Maybe freenoise would be a better first candidate to take up

@a-r-r-o-w
Copy link
Member Author

Just finished reading the FIFODiffusion paper and have a somewhat decent understanding of the implementation now, and I understand that the proposed inference method is agnostic to the underlying video model. I'll have a PR with AnimateDiff hopefully by the weekend 🤞

@a-r-r-o-w
Copy link
Member Author

a-r-r-o-w commented Jun 2, 2024

@DN6 @sayakpaul I'm working on adding support for FIFODiffusion here: https://github.com/a-r-r-o-w/diffusers/tree/animatediff/fifodiffusion. It is not yet at a working stage.

The idea is quite straightforward to implement but I'm facing difficulties dealing with internal modeling code and am not sure if it supports what I'm trying to do. Is there a way to pass a list of timesteps to the unet instead of a single value? That is, to use different timestep value per frame of the video. Forward pass per frame could be done each with its own timestep but then motion model won't work because it requires that all frames be passed together. I tried modifying it to my needs and failed because if it doesn't break at one place, it breaks elsewhere. For the scheduler, I'd like to use a per-frame timestep (list of timestep values) too instead of a single value but that is more easily doable with a loop and no modifications required.

As an alternative implementation idea, I also think you can make it work by denoising first 16 frames completely, then using their 15th, 14th, 13th, ..., timestep latent prediction for the next 16 frames and so on. So, you would have to maintain a memory per denoising step per frame, needing to be updated every iteration and, I think, seemingly tricky to implement, which adds extra memory of (B, C, motion_seq_length, num_inference_steps, H, W) and will quite easily run into OOM. I'm assuming motion_seq_length generally used is 16 in the above example. Also, num_inference_steps is required to be n * motion_seq_length where n is the number of latent partitions, set to 4 in the paper, because otherwise I'm not sure how you would interpolate the timesteps for queue size to match user-preferred num_inference_steps.

Edit: My bad, I thought I was commenting on the FIFODiffusion issue by clarence but replied here instead :(

@a-r-r-o-w
Copy link
Member Author

cc @jjihwan Would love to have your thoughts and if what I said above looks correct

@jjihwan
Copy link

jjihwan commented Jun 2, 2024

@a-r-r-o-w
Thank you for your effort.
I already have the source code for zeroscope, which is implemented based on diffusers.
If you need it for reference, please send me an email.
: kjh26720@snu.ac.kr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants