Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stable video diffusion #5889

Closed
2 tasks done
zhw-zhang opened this issue Nov 22, 2023 · 17 comments
Closed
2 tasks done

Stable video diffusion #5889

zhw-zhang opened this issue Nov 22, 2023 · 17 comments
Labels
stale Issues that haven't received updates

Comments

@zhw-zhang
Copy link

Model/Pipeline/Scheduler description

Hello, yesterday Stable Diffusion open-sourced their image-to-video model. When it will be merged into Diffusers, and if possible, can Diffusers also provide the merged training code?

Open source status

  • The model implementation is available.
  • The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

ref: https://github.com/Stability-AI/generative-models/tree/main

@drhead
Copy link
Contributor

drhead commented Nov 22, 2023

I would like to add that the temporally-aware VAE decoder should be made into an easily accessible option for AnimateDiff and related pipelines as part of this. It is fully compatible with SD1.x and 2.x based models, and from my testing on AnimateDiff it greatly enhances the outputs, although it may be more memory hungry.

@ShashwatNigam99
Copy link

@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?

@tin2tin
Copy link

tin2tin commented Nov 22, 2023

This is the img2vid model the open source community has been waiting and hoping for, since Runway and Pika img2vid took off.
I've tested SVD and it is very capable: https://youtu.be/aEAy24d8F6E?si=27nOXdxaP29Ncjwn

However, it comes with a requirement of 40 GB VRAM.
There are some optimizations tips here: https://twitter.com/timudk/status/1727064128223855087?t=lLeTOO8JYxuEcEiQm7WCWA&s=34
And Camenduru has found out that if the nsfw filter is deleted, it can be brought down to 13.1GB. But there is still a long way to consumer cards with 6 GB of VRAM. Maybe LCM, pruning, half precision, and the heavy Würstchen compression, can help bring it down in size?

@charchit7
Copy link
Contributor

Looks nice! @tin2tin

@tumurzakov
Copy link

tumurzakov commented Nov 23, 2023

@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?

decoder.yaml

target: sgm.modules.autoencoding.temporal_ae.VideoDecoder
params:
  attn_type: vanilla-xformers
  double_z: True
  z_channels: 4
  resolution: 256   
  in_channels: 3  
  out_ch: 3
  ch: 128 
  ch_mult: [1, 2, 4, 4]
  num_res_blocks: 2
  attn_resolutions: []
  dropout: 0.0 
  video_kernel_size: [3, 1, 1]

load

from sgm.util import instantiate_from_config, load_model_from_config
from omegaconf import OmegaConf
import torch
import gc
from safetensors.torch import load_file as load_safetensors
sd = load_safetensors(f'{models}/stable_video/svd_xt.safetensors')
prefix = 'first_stage_model.decoder.'
weights = {}
for key in sd.keys():
    if prefix in key:
        weights[key.replace(prefix, '')] = sd[key]
del sd
gc.collect()
torch.cuda.empty_cache()

config = OmegaConf.load("../repos/GenerativeModels/scripts/sampling/configs/decoder.yaml")
decoder = instantiate_from_config(config)
m, e = decoder.load_state_dict(weights, strict=False)
print("missing:", len(m), "expected:", len(e))
decoder = decoder.eval().to('cuda', dtype=torch.float16)

infer

    latents = pipe(output_type='latents', **kwargs)

    video_length = latents.shape[2]
    z = rearrange(latents, 'b c f h w -> (b f) c h w')
    n_samples = 12
    n_rounds = math.ceil(z.shape[0] / n_samples)
    scale_factor = 0.18215
    z = 1.0 / scale_factor * z 
    all_out = []
    with torch.autocast("cuda", dtype=torch.float16):
        for n in tqdm(range(n_rounds)):
            timesteps = len(z[n * n_samples : (n + 1) * n_samples])
            out = decoder(z[n * n_samples : (n + 1) * n_samples], timesteps=timesteps)
            all_out.append(out)

    out = torch.cat(all_out, dim=0)
    out = rearrange(out, '(b f) c h w -> b c f h w', f=video_length)
    out = (out / 2 + 0.5).clamp(0, 1)

    out = out.detach().cpu().float()

    media.show_video(out, fps=8)

gives much more clear results then ordinary vae

@painebenjamin
Copy link
Contributor

Just sharing useful resources.

Comfy released their built-in support for SVD just now, with a minimum VRAM requirement of a mere 8GB for 25 frames at 1024x576.

Commit adding support to infrastructure: comfyanonymous/ComfyUI@871cc20

Commit adding nodes: comfyanonymous/ComfyUI@42dfae6

@tin2tin
Copy link

tin2tin commented Nov 24, 2023

#5895

@ghost
Copy link

ghost commented Nov 30, 2023

any chance for the training script?

@patrickvonplaten
Copy link
Contributor

Maybe best to ask directly on https://github.com/Stability-AI/generative-models

@ghost
Copy link

ghost commented Dec 3, 2023

This is something that they don't have. is it possible to put together something similar from the existing diffuser libarary?

@patrickvonplaten
Copy link
Contributor

Sure we'd more than welcome such a training script if the community is interested in creating one

@antonioo-c
Copy link

@patrickvonplaten Hi Patrick, I wonder if the diffusers team will work on the training code for Stable Video Diffusion pipeline? Thank you.

@patrickvonplaten
Copy link
Contributor

We haven't planned anything yet, but we'd be more than happy to sponsor a community effort here

@pixeli99
Copy link

pixeli99 commented Dec 7, 2023

I'm quite happy to implement a code for training, but what I'm unsure about is the usage of a new noise scheduler in SVD. I don't have much experience with this, does anyone have any suggestions for resources I could refer to?

@ghost
Copy link

ghost commented Dec 7, 2023

It's just from the same paper as k-diffusion.

@pixeli99
Copy link

https://github.com/pixeli99/SVD_Xtend

I hope this will be helpful to those looking to fine-tune SVD. Please be aware that this is a setup from a beginner and there may be some hidden errors, so use it selectively.

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Jan 15, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

10 participants