Consider using PyTorch 2.0 version of FlashAttention (remove dependency on flash-attn) #103

Sciumo · 2023-05-11T01:52:03Z

if you're using pytorch 2.0 then FlashAttention is already available through torch.nn.functional.scaled_dot_product_attention.

The flash-attn project has build problems for many people.
Is it possible to consider using PyTorch 2.0 equivalent Flash Attention?

vchiley · 2023-05-11T04:59:29Z

For MPT we need to be able to use causal=True and we'd need to use attn_mask (aka attn_bias) to have ALiBi.

The variant exposed in scaled_dot_product_attention docs does not allow both. From docs:

Sciumo · 2023-05-11T12:00:23Z

So this issues needs to propagate to PyTorch.
Until then, I'll work with Flash Attn then as is.
Thanks.

Single-line typo fix

vchiley self-assigned this May 11, 2023

Sciumo closed this as completed May 11, 2023

bmosaicml pushed a commit that referenced this issue Jun 6, 2023

Update README.md (#103)

59891cb

Single-line typo fix

Provide feedback