Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sequence Parallel is not All-gathered when using Flash Attention? #722

Closed
xingyaoww opened this issue Mar 7, 2024 · 2 comments
Closed

Comments

@xingyaoww
Copy link

Describe the bug
Does the current implementation of FlashAttention account for the cases of sequence parallelism?

For example, here the self.core_attention_flash is called, however, the q, k, v passed into flash attention are not all-gathered (correct me if I'm wrong!!), which means we only calculate attention on a chunk of sentences (say total seq_len is L, parallelism being 4, we only calculate attention within L/4 sequence) which will cause issues in the trained models (i.e., the trained model might stop attending to previous content after L/4 tokens).

Their original implementation of ParallelAttention does not have this issue since they perform all-gather in the forward pass and reduce-scatter in the backward pass; see this issue for details.

To Reproduce
N/A

Expected behavior
We should perform all-gather across sequence parallel dimension before using flash-attention, and do reduce-scatter in the backward pass (just like the ParallelAttention implementation).

Stack trace/logs
N/A

@xingyaoww
Copy link
Author

Never mind, I dug in deeper to do some interactive debugging and found out the all-gather operation happens implicitly in self.query_key_value, a ColumnParallelLinear that helps take care of the all-gather. The current implementation should be fine. :)

@wdrink
Copy link

wdrink commented Jun 28, 2024

An interesting question. So does sequence parallel in deepspeed support the scaled_dot_product_attention? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants