Replies: 1 comment
-
Where are the cutlass calls coming from? The current code seems to only use cublas. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
[April 22, 2024]
I will post here once in a while on where the code is, focusing especially on the mainline CUDA code. These results can be calculated running
python profile_gpt2cu.py
(if you get a crash, add sudo).runtime, DRAM traffic, instructions:
Spending 76% in NVIDIA cutlass kernels, which is encouraging. This was run on an A10. On my A100 we are currently at ~73ms/iteration. PyTorch comparison (fp32, no flash attention, slightly stale PyTorch) is 78.2ms/iteration, so we are ~6.4% faster than PyTorch, in this constrained setting.
peak memory
On
nvidia-smi
we see a nice and constant 8753 MiB, this was heavily optimized by @ngc92 . In comparison, current PyTorch code comes up to 12879MiB. So we are 32% lower. To reproduce, run like:lines of code
train_gpt2.cu
is at 2097 of clean LOClatency
nvcc compile latency: 2.4 s
run latency (ENTER to first step): 2.2s
"big stones" ongoing work:
major merged improvements last few days:
first notable forks appearing
Beta Was this translation helpful? Give feedback.
All reactions