[BUG] Sequence Parallel is not All-gathered when using Flash Attention? #722

xingyaoww · 2024-03-07T10:50:29Z

Describe the bug
Does the current implementation of FlashAttention account for the cases of sequence parallelism?

For example, here the self.core_attention_flash is called, however, the q, k, v passed into flash attention are not all-gathered (correct me if I'm wrong!!), which means we only calculate attention on a chunk of sentences (say total seq_len is L, parallelism being 4, we only calculate attention within L/4 sequence) which will cause issues in the trained models (i.e., the trained model might stop attending to previous content after L/4 tokens).

Their original implementation of ParallelAttention does not have this issue since they perform all-gather in the forward pass and reduce-scatter in the backward pass; see this issue for details.

To Reproduce
N/A

Expected behavior
We should perform all-gather across sequence parallel dimension before using flash-attention, and do reduce-scatter in the backward pass (just like the ParallelAttention implementation).

Stack trace/logs
N/A

The text was updated successfully, but these errors were encountered:

xingyaoww · 2024-03-07T11:16:46Z

Never mind, I dug in deeper to do some interactive debugging and found out the all-gather operation happens implicitly in self.query_key_value, a ColumnParallelLinear that helps take care of the all-gather. The current implementation should be fine. :)

wdrink · 2024-06-28T22:05:12Z

An interesting question. So does sequence parallel in deepspeed support the scaled_dot_product_attention? Thanks!

xingyaoww closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sequence Parallel is not All-gathered when using Flash Attention? #722

[BUG] Sequence Parallel is not All-gathered when using Flash Attention? #722

xingyaoww commented Mar 7, 2024

xingyaoww commented Mar 7, 2024

wdrink commented Jun 28, 2024

[BUG] Sequence Parallel is not All-gathered when using Flash Attention? #722

[BUG] Sequence Parallel is not All-gathered when using Flash Attention? #722

Comments

xingyaoww commented Mar 7, 2024

xingyaoww commented Mar 7, 2024

wdrink commented Jun 28, 2024