Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

FeSens
Copy link
Contributor

@FeSens FeSens commented Apr 23, 2024

This is a very, very gentle introduction to Flash Attention 2 with CuTe (Cutlass v3).
It's gentle because it's not finished.

What I've got so far:

  • Work partitioned between Query block, Batch, and Head (as is Flash attention 2 to the best of my knowledge);
  • Efficiently copying tiles of Q and K using CuTe;
  • Using CuTe primitives to do matrix multiply (gemm) and scalar multiplication (axpby)

I'm converting this to a full PR because it may help people who want to start working with CuTe, and it may be less scary than jumping headfirst into a full flash attention implementation.

I will be hanging out on the CudaMode discord if anyone wants to pair or has a better understanding of CuTe and wants to help =).

@FeSens
Copy link
Contributor Author

FeSens commented Apr 23, 2024

Ok, I've figured out that I need to make the layouts of sQ and sK static for now if I want to use gemm(sQ, sK, ...).

@ericauld
Copy link

Is it compiling now?

@FeSens
Copy link
Contributor Author

FeSens commented Apr 24, 2024

It is! Not to the performance I wish it had, but it definitely is compiling.
Thanks for the Cutlass class @ericauld !
Any tips on how to speed up this kernel?

@FeSens FeSens force-pushed the feat/intro-to-flash-attention-with-cute branch from 8ffa323 to 0b5ae59 Compare April 24, 2024 07:56
@FeSens FeSens changed the title feat(attention_forward.cu): Gentle introduction to flash attention 2 feat(attention_forward.cu): Gentle introduction to cutlass Apr 24, 2024
@FeSens FeSens changed the title feat(attention_forward.cu): Gentle introduction to cutlass feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) Apr 24, 2024
@FeSens FeSens marked this pull request as ready for review April 24, 2024 08:03
@FeSens FeSens force-pushed the feat/intro-to-flash-attention-with-cute branch from c4efb39 to 2dbf745 Compare April 24, 2024 15:43
@karpathy
Copy link
Owner

@FeSens can you post what kind of perf you're seeing for this?

@FeSens
Copy link
Contributor Author

FeSens commented Apr 24, 2024

It's still far from Cublas.
I'm working on getting the proper thread partitions before advancing on the other parts necessary for flash attention.

attention_query_key_kernel2 | Best time 54.376431 ms
cublasSgemmStridedBatched   | Best time 4.497695 ms

Once this part is 90% of the speed of cublas then we will probably see improvements after implementing the missing parts.

@FeSens
Copy link
Contributor Author

FeSens commented Apr 25, 2024

This is at 65% of the speed of Cublas cublasSgemmStridedBatched

@ngc92
Copy link
Contributor

ngc92 commented Apr 25, 2024

I believe you're running into the same trap I did. We currently don't enable tensor cores in this dev file, so cublasSgemmStridedBatched will be much slower than it is going to be in the real model. So that 65% is actually still a lower number.

@FeSens
Copy link
Contributor Author

FeSens commented Apr 25, 2024

@ngc92 This is because of the variable types we are using right? Or do we need to turn on a flag explicitly?
My plan today is to make the shapes going to the gemm() template function some proper shape that has tensor operation support and benefit from that when we change the variable types.

@ngc92
Copy link
Contributor

ngc92 commented Apr 25, 2024

It's a flag that needs to be set for cuBLAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants