feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233

FeSens · 2024-04-23T20:48:43Z

This is a very, very gentle introduction to Flash Attention 2 with CuTe (Cutlass v3).
It's gentle because it's not finished.

What I've got so far:

Work partitioned between Query block, Batch, and Head (as is Flash attention 2 to the best of my knowledge);
Efficiently copying tiles of Q and K using CuTe;
Using CuTe primitives to do matrix multiply (gemm) and scalar multiplication (axpby)

I'm converting this to a full PR because it may help people who want to start working with CuTe, and it may be less scary than jumping headfirst into a full flash attention implementation.

I will be hanging out on the CudaMode discord if anyone wants to pair or has a better understanding of CuTe and wants to help =).

FeSens · 2024-04-23T22:16:42Z

Ok, I've figured out that I need to make the layouts of sQ and sK static for now if I want to use gemm(sQ, sK, ...).

ericauld · 2024-04-24T01:32:20Z

Is it compiling now?

FeSens · 2024-04-24T07:50:02Z

It is! Not to the performance I wish it had, but it definitely is compiling.
Thanks for the Cutlass class @ericauld !
Any tips on how to speed up this kernel?

…with CuTe, gemm not working yet

…ation

karpathy · 2024-04-24T18:36:34Z

@FeSens can you post what kind of perf you're seeing for this?

FeSens · 2024-04-24T19:21:45Z

It's still far from Cublas.
I'm working on getting the proper thread partitions before advancing on the other parts necessary for flash attention.

attention_query_key_kernel2 | Best time 54.376431 ms
cublasSgemmStridedBatched   | Best time 4.497695 ms

Once this part is 90% of the speed of cublas then we will probably see improvements after implementing the missing parts.

FeSens · 2024-04-25T07:15:30Z

This is at 65% of the speed of Cublas cublasSgemmStridedBatched

ngc92 · 2024-04-25T08:42:55Z

I believe you're running into the same trap I did. We currently don't enable tensor cores in this dev file, so cublasSgemmStridedBatched will be much slower than it is going to be in the real model. So that 65% is actually still a lower number.

FeSens · 2024-04-25T17:15:06Z

@ngc92 This is because of the variable types we are using right? Or do we need to turn on a flag explicitly?
My plan today is to make the shapes going to the gemm() template function some proper shape that has tensor operation support and benefit from that when we change the variable types.

ngc92 · 2024-04-25T17:40:14Z

It's a flag that needs to be set for cuBLAS.

FeSens added 3 commits April 24, 2024 00:56

feat(attention_forward.cu): Gentle introduction to flash attention 2 …

4d2d750

…with CuTe, gemm not working yet

docs(attention_forward.cu): Add CUTLASS compile flag

6eee29b

feat(attention_forward.cu): Use gemm and axpby to do matrix multiplic…

0b5ae59

…ation

FeSens force-pushed the feat/intro-to-flash-attention-with-cute branch from 8ffa323 to 0b5ae59 Compare April 24, 2024 07:56

FeSens changed the title ~~feat(attention_forward.cu): Gentle introduction to flash attention 2~~ feat(attention_forward.cu): Gentle introduction to cutlass Apr 24, 2024

FeSens changed the title ~~feat(attention_forward.cu): Gentle introduction to cutlass~~ feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) Apr 24, 2024

FeSens marked this pull request as ready for review April 24, 2024 08:03

docs(attention_forward.cu): Remove the chained command after compilation

2dbf745

FeSens force-pushed the feat/intro-to-flash-attention-with-cute branch from c4efb39 to 2dbf745 Compare April 24, 2024 15:43

perf(attention_forward.cu): Improve the utilization of threads

b6d606c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233

feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233

FeSens commented Apr 23, 2024 •

edited

FeSens commented Apr 23, 2024

ericauld commented Apr 24, 2024

FeSens commented Apr 24, 2024

karpathy commented Apr 24, 2024

FeSens commented Apr 24, 2024 •

edited

FeSens commented Apr 25, 2024

ngc92 commented Apr 25, 2024 •

edited

FeSens commented Apr 25, 2024

ngc92 commented Apr 25, 2024

feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233

Are you sure you want to change the base?

feat(attention_forward.cu): Gentle introduction to CuTe(cutlass) #233

Conversation

FeSens commented Apr 23, 2024 • edited

FeSens commented Apr 23, 2024

ericauld commented Apr 24, 2024

FeSens commented Apr 24, 2024

karpathy commented Apr 24, 2024

FeSens commented Apr 24, 2024 • edited

FeSens commented Apr 25, 2024

ngc92 commented Apr 25, 2024 • edited

FeSens commented Apr 25, 2024

ngc92 commented Apr 25, 2024

FeSens commented Apr 23, 2024 •

edited

FeSens commented Apr 24, 2024 •

edited

ngc92 commented Apr 25, 2024 •

edited