A ring attention with flash attention kernel implementation #4

zhuzilin · 2024-02-21T08:11:08Z

Hi! Thank you for your work on implementing the ring attention in pytorch!

I've just tried to implement a ring_flash_attn_qkvpacked_func (corresponding to flash_attn_qkvpacked_func in flash attention) with the flash attention kernels here: https://github.com/zhuzilin/ring-flash-attention/

Maybe this can help :)

Updates:

ring_flash_attn_varlen_qkvpacked_func is also implemented.

The text was updated successfully, but these errors were encountered:

lucidrains · 2024-02-21T13:14:02Z

@zhuzilin hey Zilin! this looks like a good start, and what I intended to do at the very end! i was imagining that the ring communication could be done within CUDA using IPC? (however, I am far from CUDA expert, so I could be wrong and it is not possible) Are you planning on upstreaming the finalized implementation to Tri Dao's official flash attention repository? That would be a big contribution!

lucidrains · 2024-02-21T13:50:29Z

@zhuzilin if you do embark on the pull request, the minimal features would be the ring IPC, able to specify the maximum number of ring passes (as I believe they must have curriculum learned the local attention to a full global, or mixed local and global using variable ring passes throughout the transformer), and finally, if you have the bandwidth, specialize masking logic for striped autoregressive attention to balance the workload

lucidrains · 2024-02-21T15:12:21Z

thank you! 🚀 ❤️

lucidrains · 2024-02-21T18:11:09Z

@zhuzilin actually, after looking into CUDA IPC stuff, your approach may be the best for now

zhuzilin · 2024-02-22T02:18:35Z

Are you planning on upstreaming the finalized implementation to Tri Dao's official flash attention repository?

I'll draft an issue to the flash attention repo to see if they have interest in upstreaming (or designing a better version) in the official repo :)

after looking into CUDA IPC stuff, your approach may be the best for now

yeah, using nccl based p2p communication would be at least an easier way to implement with acceptable performance.

andreaskoepf · 2024-02-22T06:59:13Z

@zhuzilin awesome work, we‘ll organize a little hack today 19:00 UTC on the cuda-mode discord to hack on your impl (do some testing, benchmarking and discussion about best comms options for single node and multi node etc.) - just fyi https://x.com/neurosp1ke/status/1760558683136589983

lucidrains · 2024-02-22T14:05:30Z

Germany, Beijing, San Francisco

only in open source (and science)

lucidrains · 2024-02-22T15:18:25Z

lucidrains · 2024-02-22T15:21:31Z

i also wanted to do some LOTR references, but one meme is enough

zhuzilin · 2024-02-22T23:05:54Z

oh... sorry, I took a day off and missed all the notification from github....

lucidrains · 2024-02-28T17:01:19Z

@zhuzilin i think my version is working too now, with a modified forward flash attention kernel to minimize ring passes

thanks for sharing your repo for proof of concept!

andreaskoepf · 2024-02-28T21:40:43Z

@lucidrains thanks a lot for your hard work & very interesting that you used a custom triton kernel! :-)

lucidrains · 2024-02-28T22:30:15Z

@andreaskoepf thanks! seems like there's still issue with backwards, but i'll leave it to someone or some team to fix. yup, i think the forwards requires the key, values to be iterated on the outer loop (to save on extraneous ring passes), so the reduced outputs, row maxes, lse needs to be stored and passed back in on the next ring pass. but i could be wrong and there may be a simpler way

ericauld · 2024-02-29T15:27:34Z

@lucidrains What is the issue you're referring to with the backward pass?

lucidrains · 2024-02-29T15:36:27Z

it isn't correct, probably something small with regards to how i'm using the flash attention api

feel free to submit a PR, i likely won't be able to get to this as i'll be running around bay area meeting people next month

lucidrains · 2024-02-29T17:01:28Z

@ericauld ah, good news, the cuda backwards actually yielded the right gradients (full attention, no causal or key padding mask). it is my naive version that is broken

alright, i guess it is safe to remove the wip

apaz-cli · 2024-03-04T03:27:21Z

@lucidrains Knowing the LSE doesn't actually help you compute the backwards for softmax though, correct? The derivative of LSE is softmax, not the other way around. What am I missing, and what is the utility of returning the LSE?

andreaskoepf · 2024-03-04T04:26:33Z

What am I missing, and what is the utility of returning the LSE?

The returned log sum exp is what allows to apply flash-attenion in a blockwise manner (e.g. without it it wouldn't be possible to use flash-attn to implement ring-attn). See ring_flash_attn/utils.py#L19-L21.

apaz-cli · 2024-03-04T14:48:15Z

Ah, alright. That's what I'm missing. Makes sense :)

zhuzilin changed the title ~~A ring_flash_attn_qkvpacked_func implementation~~ A ring attention with flash attention kernel implementation Feb 21, 2024

lucidrains closed this as completed Feb 28, 2024

andreaskoepf mentioned this issue Feb 29, 2024

Compare ring-flash-attention & ring-attention-pytorch gpu-mode/ring-attention#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A ring attention with flash attention kernel implementation #4

A ring attention with flash attention kernel implementation #4

zhuzilin commented Feb 21, 2024 •

edited

Loading

lucidrains commented Feb 21, 2024 •

edited

Loading

lucidrains commented Feb 21, 2024 •

edited

Loading

lucidrains commented Feb 21, 2024

lucidrains commented Feb 21, 2024

zhuzilin commented Feb 22, 2024 •

edited

Loading

andreaskoepf commented Feb 22, 2024 •

edited

Loading

lucidrains commented Feb 22, 2024 •

edited

Loading

lucidrains commented Feb 22, 2024

lucidrains commented Feb 22, 2024

zhuzilin commented Feb 22, 2024

lucidrains commented Feb 28, 2024

andreaskoepf commented Feb 28, 2024

lucidrains commented Feb 28, 2024 •

edited

Loading

ericauld commented Feb 29, 2024 •

edited

Loading

lucidrains commented Feb 29, 2024

lucidrains commented Feb 29, 2024

apaz-cli commented Mar 4, 2024

andreaskoepf commented Mar 4, 2024

apaz-cli commented Mar 4, 2024

A ring attention with flash attention kernel implementation #4

A ring attention with flash attention kernel implementation #4

Comments

zhuzilin commented Feb 21, 2024 • edited Loading

lucidrains commented Feb 21, 2024 • edited Loading

lucidrains commented Feb 21, 2024 • edited Loading

lucidrains commented Feb 21, 2024

lucidrains commented Feb 21, 2024

zhuzilin commented Feb 22, 2024 • edited Loading

andreaskoepf commented Feb 22, 2024 • edited Loading

lucidrains commented Feb 22, 2024 • edited Loading

lucidrains commented Feb 22, 2024

lucidrains commented Feb 22, 2024

zhuzilin commented Feb 22, 2024

lucidrains commented Feb 28, 2024

andreaskoepf commented Feb 28, 2024

lucidrains commented Feb 28, 2024 • edited Loading

ericauld commented Feb 29, 2024 • edited Loading

lucidrains commented Feb 29, 2024

lucidrains commented Feb 29, 2024

apaz-cli commented Mar 4, 2024

andreaskoepf commented Mar 4, 2024

apaz-cli commented Mar 4, 2024

zhuzilin commented Feb 21, 2024 •

edited

Loading

lucidrains commented Feb 21, 2024 •

edited

Loading

lucidrains commented Feb 21, 2024 •

edited

Loading

zhuzilin commented Feb 22, 2024 •

edited

Loading

andreaskoepf commented Feb 22, 2024 •

edited

Loading

lucidrains commented Feb 22, 2024 •

edited

Loading

lucidrains commented Feb 28, 2024 •

edited

Loading

ericauld commented Feb 29, 2024 •

edited

Loading