cuDNN Flash Attention Forward & Backwards BF16 (+35% performance) #322
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
RTX 4090 with BF16 and batch size of 24:
In the future we'd ideally want to implement our own version of this and hopefully avoid the requirement to include cuDNN for maximum performance, but for now it allows the GPU to go brrrrrrrrrrrrrrrrr! :)
Currently on by default with #define ENABLE_CUDNN at the top of train_gpt2.cu, this should probably become a Makefile change and become off by default. Potentially using cudnn-backend directly instead of cudnn-frontend would result in lower compile times, but that would be a lot of work and frontend is what NVIDIA recommends these days.
There are 11 "#if(n)def ENABLE_CUDNN" lines in train_gpt2.cu:
Also 2 ifdefs in both test_gpt2.cu and profile_gpt2.cu just to create/destroy the handle and workspace memory.
Currently missing the /dev/cuda/attention_backward.cu implementation (only the forward is in /dev/cuda/) but it should be easy for someone else to do it if needed, and hopefully not a blocker to integrate this.