New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233
base: master
Are you sure you want to change the base?
feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233
Conversation
Ok, I've figured out that I need to make the layouts of sQ and sK static for now if I want to use gemm(sQ, sK, ...). |
Is it compiling now? |
It is! Not to the performance I wish it had, but it definitely is compiling. |
8ffa323
to
0b5ae59
Compare
c4efb39
to
2dbf745
Compare
@FeSens can you post what kind of perf you're seeing for this? |
It's still far from Cublas.
Once this part is 90% of the speed of cublas then we will probably see improvements after implementing the missing parts. |
This is at 65% of the speed of Cublas |
I believe you're running into the same trap I did. We currently don't enable tensor cores in this dev file, so |
@ngc92 This is because of the variable types we are using right? Or do we need to turn on a flag explicitly? |
It's a flag that needs to be set for cuBLAS. |
This is a very, very gentle introduction to Flash Attention 2 with CuTe (Cutlass v3).
It's gentle because it's not finished.
What I've got so far:
I'm converting this to a full PR because it may help people who want to start working with CuTe, and it may be less scary than jumping headfirst into a full flash attention implementation.
I will be hanging out on the CudaMode discord if anyone wants to pair or has a better understanding of CuTe and wants to help =).