Stable video diffusion #5889

zhw-zhang · 2023-11-22T03:48:39Z

Model/Pipeline/Scheduler description

Hello, yesterday Stable Diffusion open-sourced their image-to-video model. When it will be merged into Diffusers, and if possible, can Diffusers also provide the merged training code?

Open source status

The model implementation is available.
The model weights are available (Only relevant if addition is not a scheduler).

Provide useful links for the implementation

ref: https://github.com/Stability-AI/generative-models/tree/main

drhead · 2023-11-22T06:30:56Z

I would like to add that the temporally-aware VAE decoder should be made into an easily accessible option for AnimateDiff and related pipelines as part of this. It is fully compatible with SD1.x and 2.x based models, and from my testing on AnimateDiff it greatly enhances the outputs, although it may be more memory hungry.

ShashwatNigam99 · 2023-11-22T16:31:43Z

@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?

tin2tin · 2023-11-22T19:52:11Z

This is the img2vid model the open source community has been waiting and hoping for, since Runway and Pika img2vid took off.
I've tested SVD and it is very capable: https://youtu.be/aEAy24d8F6E?si=27nOXdxaP29Ncjwn

However, it comes with a requirement of 40 GB VRAM.
There are some optimizations tips here: https://twitter.com/timudk/status/1727064128223855087?t=lLeTOO8JYxuEcEiQm7WCWA&s=34
And Camenduru has found out that if the nsfw filter is deleted, it can be brought down to 13.1GB. But there is still a long way to consumer cards with 6 GB of VRAM. Maybe LCM, pruning, half precision, and the heavy Würstchen compression, can help bring it down in size?

charchit7 · 2023-11-22T20:57:57Z

Looks nice! @tin2tin

tumurzakov · 2023-11-23T07:40:57Z

@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup?

decoder.yaml

target: sgm.modules.autoencoding.temporal_ae.VideoDecoder
params:
  attn_type: vanilla-xformers
  double_z: True
  z_channels: 4
  resolution: 256   
  in_channels: 3  
  out_ch: 3
  ch: 128 
  ch_mult: [1, 2, 4, 4]
  num_res_blocks: 2
  attn_resolutions: []
  dropout: 0.0 
  video_kernel_size: [3, 1, 1]

load

from sgm.util import instantiate_from_config, load_model_from_config
from omegaconf import OmegaConf
import torch
import gc
from safetensors.torch import load_file as load_safetensors
sd = load_safetensors(f'{models}/stable_video/svd_xt.safetensors')
prefix = 'first_stage_model.decoder.'
weights = {}
for key in sd.keys():
    if prefix in key:
        weights[key.replace(prefix, '')] = sd[key]
del sd
gc.collect()
torch.cuda.empty_cache()

config = OmegaConf.load("../repos/GenerativeModels/scripts/sampling/configs/decoder.yaml")
decoder = instantiate_from_config(config)
m, e = decoder.load_state_dict(weights, strict=False)
print("missing:", len(m), "expected:", len(e))
decoder = decoder.eval().to('cuda', dtype=torch.float16)

infer

    latents = pipe(output_type='latents', **kwargs)

    video_length = latents.shape[2]
    z = rearrange(latents, 'b c f h w -> (b f) c h w')
    n_samples = 12
    n_rounds = math.ceil(z.shape[0] / n_samples)
    scale_factor = 0.18215
    z = 1.0 / scale_factor * z 
    all_out = []
    with torch.autocast("cuda", dtype=torch.float16):
        for n in tqdm(range(n_rounds)):
            timesteps = len(z[n * n_samples : (n + 1) * n_samples])
            out = decoder(z[n * n_samples : (n + 1) * n_samples], timesteps=timesteps)
            all_out.append(out)

    out = torch.cat(all_out, dim=0)
    out = rearrange(out, '(b f) c h w -> b c f h w', f=video_length)
    out = (out / 2 + 0.5).clamp(0, 1)

    out = out.detach().cpu().float()

    media.show_video(out, fps=8)

gives much more clear results then ordinary vae

painebenjamin · 2023-11-24T02:37:49Z

Just sharing useful resources.

Comfy released their built-in support for SVD just now, with a minimum VRAM requirement of a mere 8GB for 25 frames at 1024x576.

Commit adding support to infrastructure: comfyanonymous/ComfyUI@871cc20

Commit adding nodes: comfyanonymous/ComfyUI@42dfae6

tin2tin · 2023-11-24T07:47:18Z

#5895

ghost · 2023-11-30T05:02:39Z

any chance for the training script?

patrickvonplaten · 2023-12-01T15:49:36Z

Maybe best to ask directly on https://github.com/Stability-AI/generative-models

ghost · 2023-12-03T13:45:39Z

This is something that they don't have. is it possible to put together something similar from the existing diffuser libarary?

patrickvonplaten · 2023-12-04T10:44:54Z

Sure we'd more than welcome such a training script if the community is interested in creating one

antonioo-c · 2023-12-04T15:07:08Z

@patrickvonplaten Hi Patrick, I wonder if the diffusers team will work on the training code for Stable Video Diffusion pipeline? Thank you.

patrickvonplaten · 2023-12-06T22:43:12Z

We haven't planned anything yet, but we'd be more than happy to sponsor a community effort here

pixeli99 · 2023-12-07T07:47:53Z

I'm quite happy to implement a code for training, but what I'm unsure about is the usage of a new noise scheduler in SVD. I don't have much experience with this, does anyone have any suggestions for resources I could refer to?

ghost · 2023-12-07T12:25:27Z

It's just from the same paper as k-diffusion.

pixeli99 · 2023-12-19T15:47:39Z

https://github.com/pixeli99/SVD_Xtend

I hope this will be helpful to those looking to fine-tune SVD. Please be aware that this is a setup from a beginner and there may be some hidden errors, so use it selectively.

github-actions · 2024-01-15T15:07:11Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions bot added the stale Issues that haven't received updates label Jan 15, 2024

github-actions bot closed this as completed Jan 26, 2024

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable video diffusion #5889

Stable video diffusion #5889

zhw-zhang commented Nov 22, 2023

drhead commented Nov 22, 2023

ShashwatNigam99 commented Nov 22, 2023

tin2tin commented Nov 22, 2023

charchit7 commented Nov 22, 2023

tumurzakov commented Nov 23, 2023 •

edited

Loading

painebenjamin commented Nov 24, 2023

tin2tin commented Nov 24, 2023

ghost commented Nov 30, 2023

patrickvonplaten commented Dec 1, 2023

ghost commented Dec 3, 2023

patrickvonplaten commented Dec 4, 2023

antonioo-c commented Dec 4, 2023

patrickvonplaten commented Dec 6, 2023

pixeli99 commented Dec 7, 2023

ghost commented Dec 7, 2023

pixeli99 commented Dec 19, 2023

github-actions bot commented Jan 15, 2024

Stable video diffusion #5889

Stable video diffusion #5889

Comments

zhw-zhang commented Nov 22, 2023

Model/Pipeline/Scheduler description

Open source status

Provide useful links for the implementation

drhead commented Nov 22, 2023

ShashwatNigam99 commented Nov 22, 2023

tin2tin commented Nov 22, 2023

charchit7 commented Nov 22, 2023

tumurzakov commented Nov 23, 2023 • edited Loading

painebenjamin commented Nov 24, 2023

tin2tin commented Nov 24, 2023

ghost commented Nov 30, 2023

patrickvonplaten commented Dec 1, 2023

ghost commented Dec 3, 2023

patrickvonplaten commented Dec 4, 2023

antonioo-c commented Dec 4, 2023

patrickvonplaten commented Dec 6, 2023

pixeli99 commented Dec 7, 2023

ghost commented Dec 7, 2023

pixeli99 commented Dec 19, 2023

github-actions bot commented Jan 15, 2024

tumurzakov commented Nov 23, 2023 •

edited

Loading