-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable video diffusion #5889
Comments
I would like to add that the temporally-aware VAE decoder should be made into an easily accessible option for AnimateDiff and related pipelines as part of this. It is fully compatible with SD1.x and 2.x based models, and from my testing on AnimateDiff it greatly enhances the outputs, although it may be more memory hungry. |
@drhead do you mean -> (text to image Stable diffusion) + (AnimateDiff) + (the new temporally aware VAE) works pretty well? Could you share more details about your setup? |
This is the img2vid model the open source community has been waiting and hoping for, since Runway and Pika img2vid took off. However, it comes with a requirement of 40 GB VRAM. |
Looks nice! @tin2tin |
decoder.yaml target: sgm.modules.autoencoding.temporal_ae.VideoDecoder
params:
attn_type: vanilla-xformers
double_z: True
z_channels: 4
resolution: 256
in_channels: 3
out_ch: 3
ch: 128
ch_mult: [1, 2, 4, 4]
num_res_blocks: 2
attn_resolutions: []
dropout: 0.0
video_kernel_size: [3, 1, 1] load from sgm.util import instantiate_from_config, load_model_from_config
from omegaconf import OmegaConf
import torch
import gc
from safetensors.torch import load_file as load_safetensors
sd = load_safetensors(f'{models}/stable_video/svd_xt.safetensors')
prefix = 'first_stage_model.decoder.'
weights = {}
for key in sd.keys():
if prefix in key:
weights[key.replace(prefix, '')] = sd[key]
del sd
gc.collect()
torch.cuda.empty_cache()
config = OmegaConf.load("../repos/GenerativeModels/scripts/sampling/configs/decoder.yaml")
decoder = instantiate_from_config(config)
m, e = decoder.load_state_dict(weights, strict=False)
print("missing:", len(m), "expected:", len(e))
decoder = decoder.eval().to('cuda', dtype=torch.float16) infer latents = pipe(output_type='latents', **kwargs)
video_length = latents.shape[2]
z = rearrange(latents, 'b c f h w -> (b f) c h w')
n_samples = 12
n_rounds = math.ceil(z.shape[0] / n_samples)
scale_factor = 0.18215
z = 1.0 / scale_factor * z
all_out = []
with torch.autocast("cuda", dtype=torch.float16):
for n in tqdm(range(n_rounds)):
timesteps = len(z[n * n_samples : (n + 1) * n_samples])
out = decoder(z[n * n_samples : (n + 1) * n_samples], timesteps=timesteps)
all_out.append(out)
out = torch.cat(all_out, dim=0)
out = rearrange(out, '(b f) c h w -> b c f h w', f=video_length)
out = (out / 2 + 0.5).clamp(0, 1)
out = out.detach().cpu().float()
media.show_video(out, fps=8) gives much more clear results then ordinary vae |
Just sharing useful resources. Comfy released their built-in support for SVD just now, with a minimum VRAM requirement of a mere 8GB for 25 frames at 1024x576. Commit adding support to infrastructure: comfyanonymous/ComfyUI@871cc20 Commit adding nodes: comfyanonymous/ComfyUI@42dfae6 |
any chance for the training script? |
Maybe best to ask directly on https://github.com/Stability-AI/generative-models |
This is something that they don't have. is it possible to put together something similar from the existing diffuser libarary? |
Sure we'd more than welcome such a training script if the community is interested in creating one |
@patrickvonplaten Hi Patrick, I wonder if the diffusers team will work on the training code for Stable Video Diffusion pipeline? Thank you. |
We haven't planned anything yet, but we'd be more than happy to sponsor a community effort here |
I'm quite happy to implement a code for training, but what I'm unsure about is the usage of a new noise scheduler in SVD. I don't have much experience with this, does anyone have any suggestions for resources I could refer to? |
It's just from the same paper as k-diffusion. |
https://github.com/pixeli99/SVD_Xtend I hope this will be helpful to those looking to fine-tune SVD. Please be aware that this is a setup from a beginner and there may be some hidden errors, so use it selectively. |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Model/Pipeline/Scheduler description
Hello, yesterday Stable Diffusion open-sourced their image-to-video model. When it will be merged into Diffusers, and if possible, can Diffusers also provide the merged training code?
Open source status
Provide useful links for the implementation
ref: https://github.com/Stability-AI/generative-models/tree/main
The text was updated successfully, but these errors were encountered: