Add FlashAttention2 for XLM-RoBERTa #28713

DavidAfonsoValente · 2024-01-25T18:16:36Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

amyeroberts · 2024-01-26T11:16:22Z

@DavidAfonsoValente Thanks for opening a PR and adding FA2 for this model! Let us know when the PR is ready to be reviewed 🤗

DavidAfonsoValente · 2024-01-26T16:18:02Z

@DavidAfonsoValente Thanks for opening a PR and adding FA2 for this model! Let us know when the PR is ready to be reviewed 🤗

Thanks! I believe it's ready to be reviewed.

DavidAfonsoValente · 2024-01-29T17:13:39Z

The tests that are failing are due to the changes made in the functions that are labeled as:
Copied from transformers.models.roberta.modeling_roberta.RobertaSelfAttention with Roberta->XLMRoberta

What is the standard procedure when I have made changes to these functions in order to accommodate the new feature?

amyeroberts · 2024-01-29T20:54:21Z

@DavidAfonsoValente If you click on the CI runs, you'll see the error messages relating to the code quality failures. These detail how to resolve the issue: run make fix-copies and push the changes.

If another model is copying this model's attention classes, then it's great because you're adding FA2 for two models 🥳 You just need to apply the equivalent changes to the model that's copying.

amyeroberts

Awesome work - thanks for adding!

Just a few tiny comments and then we're good to merge 🤗

tests/models/xlm_roberta/test_modeling_xlm_roberta.py

docs/source/en/model_doc/xlm-roberta.md

hackyon · 2024-02-07T20:05:34Z

Hello there!

I'm working on integrating scaled_dot_product_attention to BERT #28802, and there might be some merge conflicts with this change.

Most of the changes I'm making will propagate through to XML-RoBERTa (through fix-copies). From what I can, our approaches are a bit different from this in 2 areas:

I've kept the #Copies from line at the class level, so any changes to BERT's init() will be propagated through to the fix-copies classes. I've modified the #Copies from line to use the corresponding ATTENTION_CLASS as necessary.
I've created a BertSdpaSelfAttention to go with BertSelfAttention, instead of creating a BertSdpaAttention. The rationale behind this is that the code in BertSdpaSelfAttention is adapted from the code in BertSelfAttention, so I figured it's better to name it as such.

Let me know if you have any questions about these. I think we should discuss about which approach is better and adopt it. Would you mind looking through #28802 and let me know what you think?

DavidAfonsoValente · 2024-02-08T13:59:58Z

Hello there!

I'm working on integrating scaled_dot_product_attention to BERT #28802, and there might be some merge conflicts with this change.

Most of the changes I'm making will propagate through to XML-RoBERTa (through fix-copies). From what I can, our approaches are a bit different from this in 2 areas:

I've kept the #Copies from line at the class level, so any changes to BERT's init() will be propagated through to the fix-copies classes. I've modified the #Copies from line to use the corresponding ATTENTION_CLASS as necessary.

I've created a BertSdpaSelfAttention to go with BertSelfAttention, instead of creating a BertSdpaAttention. The rationale behind this is that the code in BertSdpaSelfAttention is adapted from the code in BertSelfAttention, so I figured it's better to name it as such.

Let me know if you have any questions about these. I think we should discuss about which approach is better and adopt it. Would you mind looking through #28802 and let me know what you think?

It seems like your implementation doesnt create complicated conflitcts, it should be simple to merge, what do you think?

hackyon · 2024-02-08T15:52:53Z

It should be simple to merge. I think the main question is whether or not we want to use the XLM_ROBERTA_SELF_ATTENTION_CLASSES or XLM_ROBERTA_ATTENTION_CLASSES approach?

DavidAfonsoValente · 2024-02-08T15:57:37Z

I think in order to be consistent with other models we should keep the name XLM_ROBERTA_ATTENTION_CLASSES since this is how other models have their attention classes named.

hackyon · 2024-02-08T16:30:55Z

Thanks for your input.

I am leaning towards using the "SelfAttention" convention in this case, because XLMRobertaFlashAttention2 actually tries to copy the logic of both XLMRobertaAttention and XLMRobertaSelfAttention into one. I think it's cleaner to have a XLMRobertaSelfFlashAttention2 that mirrors XLMRobertaSelfAttention, and then reuse the logic inside XLMRobertaAttention for both types of self attentions.

Reusing the code in XLMRobertaAttention should help avoid some bugs. Judging from the existing code, I think there's already a bit of an issue with the way self.output() is called in XLMRobertaFlashAttention2 (I think you need more than the call to self.output.dense(), otherwise you're missing the dropout and LayerNorm).

DavidAfonsoValente · 2024-02-08T18:21:28Z

Okk, maybe it'll be easier if you merge your PR and I work on top of your version.

hackyon · 2024-02-08T20:54:02Z

Thanks 👍 I'll iterate as fast as possible to get it merged, and will let you know as soon as it's done!

github-actions · 2024-03-04T08:11:52Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

michaelshekasta · 2024-05-08T08:42:14Z

@amyeroberts @DavidAfonsoValente @hackyon Why hasn't this PR been merged yet? How can I help with that?

hackyon · 2024-05-08T12:08:26Z

@michaelshekasta

I have this #30510 open to add scaled dot product attention (SDPA) to RoBERTa-based models, and is still under review. When available, the SDPA could run on a FA2 backend. You could very well try to add FA2 to it as well.

michaelshekasta · 2024-05-08T13:14:53Z

@michaelshekasta

I have this #30510 open to add scaled dot product attention (SDPA) to RoBERTa-based models, and is still under review. When available, the SDPA could run on a FA2 backend. You could very well try to add FA2 to it as well.

@hackyon Thank you for your feedback! As far as I'm aware, SDPA isn't Flash Attention 2, and it's currently in beta, correct? So, will HF support it? Or am I missing something?

I'd be happy to contribute, but I'm not sure how. Could we discuss this on the HF Discord server, perhaps in the NLP channel?

hackyon · 2024-05-08T13:22:28Z

Newer versions of SDPA actually has FA2 integration: https://pytorch.org/blog/pytorch2-2. With that said, you could probably add a FA2 option as well if you'd like (some of the models have that as well), but I'm not HG maintainer so take my opinion with a grain of salt.

SDPA is already supported in a variety of HG models.

michaelshekasta · 2024-05-08T13:27:25Z

Thank you once more for your input, @hackyon!

I'm interested in implementing XLM-RoBERTa with FA2, and I believe integrating it into Hugging Face (HF) could be the way forward. From what I've gathered, this pull request appears to align with my goals.

While I understand you're not directly involved in HF maintenance, your prior work suggests you're knowledgeable in this area. Could you advise on the best course of action? Should I wait for your pull request, or would utilizing the flash_attn library be a viable alternative?

hackyon · 2024-05-08T13:58:19Z

I think you should reference how FA2 was implemented in other libraries, such as modeling_llama, and bring those changes over. I recommend using "Copied From" as much as possible.

My 2 cents would be that it'd be better to use XLM_ROBERTA_SELF_ATTENTION_CLASSES instead of XLM_ROBERTA_ATTENTION_CLASSES, otherwise there'd be conflict with #30510 (and since they have a "Copied From" relationship, it wouldn't be easier to keep both copies).

Unfortunately, I won't be able to discuss on discord. You might be able to find some of the core maintainers there though.

michaelshekasta · 2024-05-08T14:16:28Z

@hackyon thanks again! you are awesome!
I think that I don't really understand the issue with Copy.from.. can you explain me more or refer me to resource?

hackyon · 2024-05-08T15:20:57Z

I think that I don't really understand the issue with Copy.from.. can you explain me more or refer me to resource?

#30510 has a XLMROBERTAXL_SELF_ATTENTION_CLASSES, so it's better to add the FA2 implementation in there instead of creating a new XLM_ROBERTA_ATTENTION_CLASSES.

michaelshekasta · 2024-05-19T08:13:23Z

@hackyon, upon revisiting the thread, I've come to the conclusion that transitioning from SDPA to XLMR seems to be the most appropriate course of action. Given that PyTorch has already implemented FA2 with SDPA, I don't see the necessity of replicating it for XLMR. Is there anything I can assist with to facilitate the seamless integration of this pull request?

hackyon · 2024-05-19T12:16:45Z

@michaelshekasta if you're up for it, you can clone #30510 and try to get that submitted. I will be starting a new job in a week and won't really have time to follow through with that PR.

If this is your first change, you'll need to read through their contributions page and follow the instructions there. Specifically, pay attention to getting "make fix-copies" and "RUN_SLOW" tests to pass before opening the PR.

michaelshekasta · 2024-05-19T16:44:43Z

@hackyon Let's talk in the email shkasta@post.bgu.ac.il

DavidAfonsoValente added 3 commits January 25, 2024 18:13

Add FlashAttention2 for XLM-RoBERTa

13f3361

Fix code quality errors

f5c2a64

Correct flash_attention_2 tag

0c04e9b

Correct Flash Attention 2 implementation

89b01e6

DavidAfonsoValente added 2 commits January 29, 2024 23:38

Fix repo-consistency problems

712191d

Update flash attention documentation

237c1a3

DavidAfonsoValente force-pushed the flashattention2-XLMRoBERTa branch from 90107b7 to 237c1a3 Compare January 30, 2024 02:02

Add tests for XLMRoberta

09f4695

DavidAfonsoValente force-pushed the flashattention2-XLMRoBERTa branch from efcae38 to 09f4695 Compare January 30, 2024 02:22

Corrections to XLM_Roberta flash attention

9cff1b0

DavidAfonsoValente force-pushed the flashattention2-XLMRoBERTa branch from 9b03bc7 to 9cff1b0 Compare January 30, 2024 16:52

amyeroberts reviewed Feb 1, 2024

View reviewed changes

tests/models/xlm_roberta/test_modeling_xlm_roberta.py Outdated Show resolved Hide resolved

tests/models/xlm_roberta/test_modeling_xlm_roberta.py Outdated Show resolved Hide resolved

docs/source/en/model_doc/xlm-roberta.md Show resolved Hide resolved

DavidAfonsoValente added 7 commits February 5, 2024 12:31

Corrections to XLMRoberta tests

29da4f7

Rmove unnecessary is_cross_attention=True

a966a94

fixup style

00ee70b

Correct import order

0af7671

Add dropout option

a9430b5

Correct style

50c0eb2

Add benchmark

0021558

pommedeterresautee mentioned this pull request Feb 7, 2024

[BERT] Add support for sdpa #28802

Merged

5 tasks

github-actions bot closed this Mar 13, 2024

michaelshekasta mentioned this pull request May 7, 2024

Community contribution: Adding Flash Attention 2 support for more architectures #26350

Open

24 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add FlashAttention2 for XLM-RoBERTa #28713

Add FlashAttention2 for XLM-RoBERTa #28713

DavidAfonsoValente commented Jan 25, 2024 •

edited

amyeroberts commented Jan 26, 2024

DavidAfonsoValente commented Jan 26, 2024

DavidAfonsoValente commented Jan 29, 2024 •

edited

amyeroberts commented Jan 29, 2024

amyeroberts left a comment

hackyon commented Feb 7, 2024

DavidAfonsoValente commented Feb 8, 2024

hackyon commented Feb 8, 2024

DavidAfonsoValente commented Feb 8, 2024

hackyon commented Feb 8, 2024

DavidAfonsoValente commented Feb 8, 2024

hackyon commented Feb 8, 2024 •

edited

github-actions bot commented Mar 4, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024 •

edited

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024

michaelshekasta commented May 19, 2024

hackyon commented May 19, 2024

michaelshekasta commented May 19, 2024 •

edited

Add FlashAttention2 for XLM-RoBERTa #28713

Add FlashAttention2 for XLM-RoBERTa #28713

Conversation

DavidAfonsoValente commented Jan 25, 2024 • edited

What does this PR do?

Before submitting

Who can review?

amyeroberts commented Jan 26, 2024

DavidAfonsoValente commented Jan 26, 2024

DavidAfonsoValente commented Jan 29, 2024 • edited

amyeroberts commented Jan 29, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

hackyon commented Feb 7, 2024

DavidAfonsoValente commented Feb 8, 2024

hackyon commented Feb 8, 2024

DavidAfonsoValente commented Feb 8, 2024

hackyon commented Feb 8, 2024

DavidAfonsoValente commented Feb 8, 2024

hackyon commented Feb 8, 2024 • edited

github-actions bot commented Mar 4, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024 • edited

michaelshekasta commented May 8, 2024

hackyon commented May 8, 2024

michaelshekasta commented May 19, 2024

hackyon commented May 19, 2024

michaelshekasta commented May 19, 2024 • edited

DavidAfonsoValente commented Jan 25, 2024 •

edited

DavidAfonsoValente commented Jan 29, 2024 •

edited

hackyon commented Feb 8, 2024 •

edited

hackyon commented May 8, 2024 •

edited

michaelshekasta commented May 19, 2024 •

edited