cuDNN Flash Attention Forward & Backwards BF16 (+35% performance) #322

ademeure · 2024-05-01T16:19:39Z

RTX 4090 with BF16 and batch size of 24:

Baseline: 232.37ms (~106K tokens/s)
cuDNN: 170.77ms (~144K tokens/s) ==> +35% performance!
Compile time: Priceless(TM) (~2.7s to 48.7s - it's a big dependency and part of the reason PyTorch is so big!)

In the future we'd ideally want to implement our own version of this and hopefully avoid the requirement to include cuDNN for maximum performance, but for now it allows the GPU to go brrrrrrrrrrrrrrrrr! :)

Currently on by default with #define ENABLE_CUDNN at the top of train_gpt2.cu, this should probably become a Makefile change and become off by default. Potentially using cudnn-backend directly instead of cudnn-frontend would result in lower compile times, but that would be a lot of work and frontend is what NVIDIA recommends these days.

There are 11 "#if(n)def ENABLE_CUDNN" lines in train_gpt2.cu:

5 of them are to reduce the size of the memory allocations as much as possible (these are quite unfortunate...)
2 of them are for creating & destroying the handle and free the workspace memory if any was allocated
1 in the middle of the file for the new functions (should this be in a separate file?)
1 at the top of the file for the handle etc. (should this be in a separate file?)
1 in gpt2_forward()
1 in gpt2_backward()

Also 2 ifdefs in both test_gpt2.cu and profile_gpt2.cu just to create/destroy the handle and workspace memory.

Currently missing the /dev/cuda/attention_backward.cu implementation (only the forward is in /dev/cuda/) but it should be easy for someone else to do it if needed, and hopefully not a blocker to integrate this.

…_gpt2.cu! (backwards pass is broken for now as a conquence)

…t2.cu!

…e as small a cuDNN workspace as possible

…udnn_try2

… stats in "att"

ademeure added 13 commits April 30, 2024 14:54

re-commit of my old cudnn forward attention changes

1c516c7

cuDNN Forward Flash Attention is working in both /dev/cuda/ and train…

435ac92

…_gpt2.cu! (backwards pass is broken for now as a conquence)

Fully working forward+backward cuDNN BF16 Flash Attention in train_gp…

61ad4b1

…t2.cu!

Cleanup + do not allocate gradient preatt memory with cuDNN + allocat…

d42ee0d

…e as small a cuDNN workspace as possible

Merge remote-tracking branch 'origin/cudnn_try2' into cudnn_try2

d508463

Merge remote-tracking branch 'karpathy/master' into cudnn_try2

c4ecc04

Add makefile changes for cuDNN

7beff72

Merge branch 'cudnn_try2' of https://github.com/ademeure/llm.c into c…

57c0952

…udnn_try2

slightly increase wte threshold to make test_gpt2.cu pass (?)

c778dc0

Rename lowp_float to floatX in /dev/cuda/attention_forward.cu

be54842

attention_forward.cu ~matches train_gpt2cu & allocate less memory for…

2b540b6

… stats in "att"

Fixed profile_gpt2.cu/test_gpt2.cu for cuDNN

d9e7a0a

Add missing cudnnDestroy() calls

1147983

karpathy mentioned this pull request May 1, 2024

feature/cudnn for flash-attention #323

Merged

karpathy merged commit 1147983 into karpathy:master May 1, 2024
3 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN Flash Attention Forward & Backwards BF16 (+35% performance) #322

cuDNN Flash Attention Forward & Backwards BF16 (+35% performance) #322

ademeure commented May 1, 2024

cuDNN Flash Attention Forward & Backwards BF16 (+35% performance) #322

cuDNN Flash Attention Forward & Backwards BF16 (+35% performance) #322

Conversation

ademeure commented May 1, 2024