Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xFormers attention op arg #2049

Merged
merged 11 commits into from
Jan 24, 2023

Conversation

takuma104
Copy link
Contributor

@takuma104 takuma104 commented Jan 20, 2023

What does this PR do?

To add the attention_op argument to enable_xformers_memory_efficient_attention(). This argument can override the op argument of memory_efficient_attention() in xFormers. It was originally written by @patil-suraj on the xformers-attention-op-arg branch, and I have added some tweaks so that it can be merged into the current main branch. A short documentation has also been added.

Usage Example:

import xformers
import xformers.ops

op = xformers.ops.MemoryEfficientAttentionFlashAttentionOp
pipe.enable_xformers_memory_efficient_attention(attention_op=op)

As an example of the application of this PR, The use of Flash Attention improves the reproducibility of Stable Diffusion image generation due to its deterministic behavior. Discussed in #1997.

I am confirming this PR code by using the following code.
https://gist.github.com/takuma104/acc9ff3809e4b259bf24b8130c021823

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jan 20, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@patil-suraj patil-suraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, and looks good to me. Thanks a lot for working on this!

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Patrick, a couple of examples would be nice.

takuma104 and others added 2 commits January 23, 2023 23:04
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
@takuma104
Copy link
Contributor Author

thanks @patrickvonplaten! I've merged examples.

@takuma104
Copy link
Contributor Author

hmm... I've got the following docstring error. The ... is wrong? or something?
Error message: Cannot parse: 2:21: from xformers import ... # some attention op

src/diffusers/models/modeling_utils.py Outdated Show resolved Hide resolved
src/diffusers/pipelines/pipeline_utils.py Outdated Show resolved Hide resolved
@patil-suraj
Copy link
Contributor

Hey @takuma104 , thanks for adding the examples. The docstring error was because of the wrong syntax (...), the docstrings expects a valid python example. I left suggestions for how to fix this in the comments.

takuma104 and others added 3 commits January 24, 2023 00:32
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
@takuma104
Copy link
Contributor Author

Thanks for your help, @patil-suraj ! Wow, that's a very strict code-style checker. Nice.

@takuma104
Copy link
Contributor Author

I ran the example code to make sure, Flash Attention does not work well with SD1.4, but with SD2.1. Can I change "CompVis/stable-diffusion-v1-4" to "stabilityai/stable-diffusion-2-1"?

@takuma104
Copy link
Contributor Author

Which one would be more appropriate as a code example?

  • It does not work as code, but as an example it is simple and understandable. The only change is "CompVis/stable-diffusion-v1-4” to model_id.
from diffusers import DiffusionPipeline
from xformers.ops import MemoryEfficientAttentionFlashAttentionOp
pipe = DiffusionPipeline.from_pretrained(model_id).to("cuda")
pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
  • Fully functional code:
import torch
from diffusers import DiffusionPipeline
from xformers.ops import MemoryEfficientAttentionFlashAttentionOp
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16).to("cuda")
pipe.enable_xformers_memory_efficient_attention(attention_op=MemoryEfficientAttentionFlashAttentionOp)
# Workaround for not accepting attention shape using VAE for Flash Attention
pipe.vae.enable_xformers_memory_efficient_attention(attention_op=None)

@patil-suraj
Copy link
Contributor

Hey @takuma104 the second example looks great, it's always better to have fully functional examples in docs so readers can just copy-paste it to try.

Also out of curiosity, what do you mean by Flash Attention does not work well with SD1.4, think it works no ?

@takuma104
Copy link
Contributor Author

Hi @patil-suraj

Ok, thanks. I got it. I'll update it very soon.

Also out of curiosity, what do you mean by Flash Attention does not work well with SD1.4, think it works no ?

The latest xFormers show the error in detail, and when I do U-Net inference with SD1.x, I get the following.

ValueError: Operator `memory_efficient_attention` does not support inputs:
     query       : shape=(16, 256, 1, 160) (torch.float16)
     key         : shape=(16, 256, 1, 160) (torch.float16)
     value       : shape=(16, 256, 1, 160) (torch.float16)
     attn_bias   : <class 'NoneType'>
     p           : 0.0
`flshattF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 128

In SD1.x, it seems that some of the U-Net attentions' size of K in [B, M, K] doesn't fit the required. It must be less than 128.

In SD2.x, I don't know if this is intentional or not, but it seems that the size of K is 128 or less in all cases.

@patil-suraj
Copy link
Contributor

Thanks for updating the doc!

@patil-suraj patil-suraj merged commit 16bb505 into huggingface:main Jan 24, 2023
@adhikjoshi
Copy link

adhikjoshi commented Jan 31, 2023

I am getting this same error,

text2img works,

but getting this error on : img2img

i even tried to disable memory_efficient_attention,

pipe.disable_xformers_memory_efficient_attention()

But, still doesn't work on img2img,

'Operator `memory_efficient_attention` does not support inputs:
              query       : shape=(16, 25, 1, 160) (torch.float16)
              key         : shape=(16, 25, 1, 160) (torch.float16)
              value       : shape=(16, 25, 1, 160) (torch.float16)
              attn_bias   : <class \'NoneType\'>
              p           : 0.0

`flshattF` is not supported because:
max(query.shape[-1] != value.shape[-1]) > 128'

@patil-suraj
Copy link
Contributor

@adhikjoshi Feel free to open an issue with a reproducible code snippet for this :)

@tianleiwu
Copy link
Contributor

tianleiwu commented Feb 8, 2023

I also encountered same error (max(query.shape[-1] != value.shape[-1]) > 128') in T4 GPU with xformers 0.0.16 or 0.0.17.dev444 on stable diffusion 1.5 model.

Is it supported only in Amper or later GPU (with cuda compute capactity >= 80)?

@takuma104
Copy link
Contributor Author

Hi @tianleiwu,

Want to use Flash Attention? As mentioned above, SD1.x models are not supported with Flash Attention because they don't meet the dim<=128 requirement. SD 2.x models seem to be supported with Flash Attention.

According to the xFormers code, Tesla T4=sm_75 is in the minimum requirement, so Tesla T4 is supported with Flash Attention.

@adhikjoshi
Copy link

adhikjoshi commented Feb 9, 2023

Hi @tianleiwu,

Want to use Flash Attention? As mentioned above, SD1.x models are not supported with Flash Attention because they don't meet the dim<=128 requirement. SD 2.x models seem to be supported with Flash Attention.

According to the xFormers code, Tesla T4=sm_75 is in the minimum requirement, so Tesla T4 is supported with Flash Attention.

Actually, it works with flash attention. I tried with all models in including 1.x and 2.x models, Works great on 3090.

Only img2img and inpainting doesn't work. Hence had to drop it.

But maybe here is solution. Gonna need to try it

#2234

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
* allow passing op to xFormers attention

original code by @patil-suraj
huggingface/diffusers@ae0cc0b

* correct style by `make style`

* add attention_op arg documents

* add usage example to docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* add usage example to docstring

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

* code style correction by `make style`

* Update docstring code to a valid python example

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* Update docstring code to a valid python example

Co-authored-by: Suraj Patil <surajp815@gmail.com>

* style correction by `make style`

* Update code exmaple to fully functional

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>
Co-authored-by: Suraj Patil <surajp815@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants