You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does candle currently supports flash attention 2 ? (as in https://arxiv.org/abs/2307.08691) If not, how desirable is this atm ? If desirable, any suggestions on how to start working on this feature ?
Flash attention v2 is already supported in candle via the flash-attn feature flag and the candle-flash-attn directory, it's available for a bunch of examples.
Besides this, candle is extensible so it should be easy to plug the accelerated layers that you care about. You can find examples of this in text-embeddings-inference that manages pretty high performance by leveraging custom kernels.
Does candle currently supports flash attention 2 ? (as in https://arxiv.org/abs/2307.08691) If not, how desirable is this atm ? If desirable, any suggestions on how to start working on this feature ?
I am also interested in knowing if there are future plans to eventually support Blockwise parallel transformer (https://arxiv.org/abs/2305.19370), as well as Ring Attention (https://arxiv.org/abs/2310.01889) on candle ?
The text was updated successfully, but these errors were encountered: