Flash attention 2 support ? #2027

jorgeantonio21 · 2024-04-07T16:15:53Z

Does candle currently supports flash attention 2 ? (as in https://arxiv.org/abs/2307.08691) If not, how desirable is this atm ? If desirable, any suggestions on how to start working on this feature ?

I am also interested in knowing if there are future plans to eventually support Blockwise parallel transformer (https://arxiv.org/abs/2305.19370), as well as Ring Attention (https://arxiv.org/abs/2310.01889) on candle ?

LaurentMazare · 2024-04-07T16:41:01Z

Flash attention v2 is already supported in candle via the flash-attn feature flag and the candle-flash-attn directory, it's available for a bunch of examples.
Besides this, candle is extensible so it should be easy to plug the accelerated layers that you care about. You can find examples of this in text-embeddings-inference that manages pretty high performance by leveraging custom kernels.

jorgeantonio21 · 2024-04-07T20:06:08Z

This is great, thank you for the explanation @LaurentMazare !

LaurentMazare closed this as completed Apr 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention 2 support ? #2027

Flash attention 2 support ? #2027

jorgeantonio21 commented Apr 7, 2024

LaurentMazare commented Apr 7, 2024

jorgeantonio21 commented Apr 7, 2024

Flash attention 2 support ? #2027

Flash attention 2 support ? #2027

Comments

jorgeantonio21 commented Apr 7, 2024

LaurentMazare commented Apr 7, 2024

jorgeantonio21 commented Apr 7, 2024