cuDNN Forward Attention + FP16 non-cuDNN version in /dev/cuda/ #215

ademeure · 2024-04-22T12:32:31Z

Previous Kernel 4: 1.74ms
Kernel 4 with TF32: 1.70ms
Kernel 5 (4 with BF16 I/O): 0.91ms
Kernel 6 (5 without permute, not realistic): 0.76ms
Kernel 10 (cuDNN BF16, with FP32 conversion): 0.33ms
Kernel 11 (cuDNN BF16 with direct BF16 inputs): 0.13ms

This has been a mess to get working, e.g. wasted 3+ hours until I realised that even for cuBLASLt with explicit type parameters for them, alpha/beta need to be FP16 with CUBLAS_COMPUTE_16F and need to be FP32 with _32F, but there are zero warnings if you get it wrong, it just returns garbage :(

I still haven't managed to get the cuDNN backward pass to give the correct results, means I can't integrate the forward pass as an option for the full training pass in train_gpt2.cu because our current backwards pass requires "att" which cuDNN doesn't provide (it needs its own stats tensor instead) unfortunately.

… forwardcudnn

…ion_forward.cu by default

dagelf · 2024-04-22T14:21:26Z

The CI fail isn't because of your commit, it's because HF is taking punishment today 🐳

#217 https://status.huggingface.co/

ademeure added 8 commits April 21, 2024 19:38

Initial working FP16 cuDNN Forward Attention (+ FP16 port of kernel 4)

da987fa

Merge remote-tracking branch 'karpathy/master' into forwardcudnn

0109bd1

WIP - cuBLAS Ex() not working as expected...

8a93ecf

Merge branch 'forwardcudnn' of https://github.com/ademeure/llm.c into…

60c258a

… forwardcudnn

Fix FP16/BF16 kernel version issues

6ed6e99

Use -DENABLE_CUDNN for cuDNN path instead, and enable TF32 for attent…

929ad2f

…ion_forward.cu by default

extra comments + tiny fix

33a31a7

remove unintentional train_gpu2.cu change

2c47bd4

dagelf mentioned this pull request Apr 22, 2024

Make msvc compile and improve CI #216

Closed

karpathy closed this May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuDNN Forward Attention + FP16 non-cuDNN version in /dev/cuda/ #215

cuDNN Forward Attention + FP16 non-cuDNN version in /dev/cuda/ #215

ademeure commented Apr 22, 2024

dagelf commented Apr 22, 2024 •

edited

cuDNN Forward Attention + FP16 non-cuDNN version in /dev/cuda/ #215

cuDNN Forward Attention + FP16 non-cuDNN version in /dev/cuda/ #215

Conversation

ademeure commented Apr 22, 2024

dagelf commented Apr 22, 2024 • edited

dagelf commented Apr 22, 2024 •

edited