Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to run text2video-zero with less than 15GB GPU? #2898

Closed
camenduru opened this issue Mar 30, 2023 · 28 comments
Closed

Is it possible to run text2video-zero with less than 15GB GPU? #2898

camenduru opened this issue Mar 30, 2023 · 28 comments
Labels
bug Something isn't working

Comments

@camenduru
Copy link
Contributor

Describe the bug

Hi 👋 everybody, I have a question is it possible to use xformers with text2video-zero

when I add self.pipe.enable_xformers_memory_efficient_attention() I am getting a temporal consistency problem

with xformers without xformers (needs more than 20GB vram)
image_5 image (6)

code is here https://gitlab.com/camenduru/text2-video-zero/-/blob/dev/model.py

Reproduction

🧬

Logs

No response

System Info

🥔

@camenduru camenduru added the bug Something isn't working label Mar 30, 2023
@patrickvonplaten
Copy link
Contributor

That's very interesting! Did you train text2video-zero? Or does it just use pretrained checkpoints?

The other possibility to reduce memory would be to make use of attention slicing: https://huggingface.co/docs/diffusers/api/diffusion_pipeline#diffusers.DiffusionPipeline.enable_attention_slicing

@camenduru
Copy link
Contributor Author

oops maybe this is not text2video-zero but it is inside the demo Pose Conditional https://huggingface.co/spaces/PAIR/Text2Video-Zero

only with self.pipe.enable_attention_slicing() also not working 😭

image (9)

model: https://huggingface.co/plasmo/woolitize-768sd1-5/tree/main
controlnet model: https://huggingface.co/lllyasviel/sd-controlnet-openpose/tree/main

this is the colab you can test https://github.com/camenduru/text2video-zero-colab

how can I reduce the size of the model

https://huggingface.co/webui/ControlNet-modules-safetensors/tree/main
Screenshot 2023-03-31 174934

https://huggingface.co/lllyasviel/sd-controlnet-openpose/tree/main
Screenshot 2023-03-31 175049

@camenduru
Copy link
Contributor Author

Hi @sayakpaul 👋 perhaps the issue is with the diffuser's core. Can you explain why this is happening or if it's expected behavior?

@sayakpaul
Copy link
Member

with xformers without xformers
video no_xformer_video

Does this exist with the latest TextToVideoZeroPipeline?

The above was generated with the latest TextToVideoZeroPipeline. Here's my Colab.

Cc: @19and99

@camenduru
Copy link
Contributor Author

yep same problem maybe switch with xformers and without xformers?

@sayakpaul
Copy link
Member

sayakpaul commented Apr 11, 2023

Did not get you.

#2898 (comment) already has with and without xformers.

@camenduru
Copy link
Contributor Author

the code using control net pose for consistency between frames https://gitlab.com/camenduru/text2-video-zero/-/blob/dev/model.py#L150

228700984-46ac3613-a53a-46d2-a3f1-122b5550fbee

@sayakpaul
Copy link
Member

Could you try it out from here?

@camenduru
Copy link
Contributor Author

woohoo 🥳 now it is working with enable_attention_slicing() thanks ❤

download

@camenduru camenduru reopened this Apr 12, 2023
@camenduru
Copy link
Contributor Author

Unfortunately, this has not been solved. It just looks like it has been solved 😭 There is still a temporal consistency problem, even with the latest code.

@camenduru
Copy link
Contributor Author

enable_attention_slicing() and enable_xformers_memory_efficient_attention() causing the problem

@camenduru
Copy link
Contributor Author

camenduru commented Apr 12, 2023

without enable_attention_slicing() or enable_xformers_memory_efficient_attention()
https://github.com/Picsart-AI-Research/Text2Video-Zero#text-to-video-1

@sayakpaul
Copy link
Member

sayakpaul commented Apr 12, 2023

I am still not sure why do you think it's not working.

Please provide elaborate examples as to why you do you think it's not the case.

I will also let one of the authors @19and99 comment on this.

@camenduru
Copy link
Contributor Author

camenduru commented Apr 12, 2023

please fork https://github.com/Picsart-AI-Research/Text2Video-Zero and add self.pipe.enable_attention_slicing() or enable_xformers_memory_efficient_attention() here https://github.com/Picsart-AI-Research/Text2Video-Zero/blob/44b1639baed624800cb9dce43a29e2549024bc76/model.py#L261-L288

and test with free colab T4 and colab pro A100

%cd /content
!git clone  https://github.com/your_username/Text2Video-Zero
!pip install -q gradio==3.23.0 decord==0.6.0 diffusers==0.14.0 accelerate==0.17.0 safetensors==0.2.7 einops==0.6.0 transformers==4.26.0
!pip install -q torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 torchtext==0.14.1 torchdata==0.5.1 --extra-index-url https://download.pytorch.org/whl/cu116 -U
!pip install -q xformers==0.0.16 triton==2.0.0 -U
!pip install -q kornia==0.6 tomesd basicsr==1.4.2 timm==0.6.12
%cd /content/Text2Video-Zero
!python app.py --public_access

@sayakpaul
Copy link
Member

Unfortunately, we can only debug stuff from diffusers and not from external libraries.

Could you replicate the issues you're facing with the text-to-video zero pipeline from diffusers as in #2898 (comment)?

@camenduru
Copy link
Contributor Author

camenduru commented Apr 12, 2023

You have already replicated the problem at #2898 (comment)

@camenduru
Copy link
Contributor Author

we are getting slideshow with xformers 😐 #2898 (comment)

@sayakpaul
Copy link
Member

I will let @19and99 comment further.

@19and99
Copy link
Contributor

19and99 commented Apr 12, 2023

Thanks for the report @camenduru, @sayakpaul, @patrickvonplaten.
We are using a custom attention processor, while pipe.enable_xformers_memory_efficient_attention() resets all the attention processers to XFormersAttnProcessor, causing the consistency issue.

@evancasey
Copy link

evancasey commented Apr 12, 2023

Sorry if I'm missing something here but isn't a lot of the consistency stuff not enabled when you use the controlnet text-to-video (which this implementation does)?

The TextToVideoZeroPipeline does the latent warping and cross frame attention, but the controlnet text-to-video is only doing cross frame attention.

So it would not be expected to have full consistency as compared to the vanilla TextToVideoZeroPipeline example

EDIT: nevermind, it looks like cross frame attention is all you need, wow!

@camenduru
Copy link
Contributor Author

thanks @19and99 ❤ is it possible to reduce the model size? #2898 (comment)

@sayakpaul
Copy link
Member

We are using a custom attention processor, while pipe.enable_xformers_memory_efficient_attention() resets all the attention processers to XFormersAttnProcessor, causing the consistency issue.

Thanks for this hint, @19and99!

is it possible to reduce the model size?

@camenduru, I think for that we need to implement a corresponding xFormers attention processor. Ccing @patrickvonplaten in case he has any other suggestions.

@camenduru
Copy link
Contributor Author

Hi @sayakpaul 👋 if someone has time yes it will be super cool 🔥 to implement xformers attention processor or attention_slicing attention processor

is it possible to reduce the model size?

is convert_original_controlnet_to_diffusers.py converting to fp32 or fp16 is it possible like --half pipe.to(torch_dtype=torch.float16)
image
image

and what is this

I tried this but I don't understand the --original_config_file 😐 please help

!pip install -q git+https://github.com/huggingface/diffusers transformers omegaconf
!git clone https://github.com/huggingface/diffusers
!wget https://huggingface.co/lllyasviel/ControlNet/resolve/main/models/control_sd15_openpose.pth
!wget https://huggingface.co/lllyasviel/sd-controlnet-openpose/raw/main/config.json ?????
!python /content/diffusers/scripts/convert_original_controlnet_to_diffusers.py --checkpoint_path /content/control_sd15_openpose.pth --dump_path /content/cnet --original_config_file /content/config.json

@sayakpaul
Copy link
Member

Regarding

It's used to convert the original ControlNet parameters to the one we have in diffusers. For the config file, @williamberman can provide more details.

Reducing the size of diffusion_pytorch_model.bin should be possible, pinging @patrickvonplaten for that.

Also, since the questions you're asking are not related to xFormers, we'd appreciate it if you could open a separate thread.

@camenduru camenduru changed the title Is it possible to use xformers with text2video-zero Is it possible to run text2video-zero with less than 15GB GPU? Apr 13, 2023
@camenduru
Copy link
Contributor Author

In this thread, everything is connected to each other. In my opinion, we should continue here.

@sayakpaul
Copy link
Member

The easiest options are attention slicing or xformers attention processor. These need to be implemented.

Second option would be to run the pipeline in half precision which can be done by specifying the torch_dtype argument of the pipeline to torch.float16. For running on a 15GB card, one will likely need to couple half-precision and / or attention slicing, xFormers attention processing.

@camenduru
Copy link
Contributor Author

@camenduru
Copy link
Contributor Author

I converted to diffusion_pytorch_model.bin 🥳 is it possible to convent to fp16 then save like controlnet.to(torch_dtype=torch.float16)

Screenshot 2023-04-13 110803

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants