-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378
Conversation
… matmul_backward_bias.
… + add missing common.h changes
(the CI issue is fake, i pushed a fix to master, will try out the speed of this PR tomorrow, assuming it is a bit faster) |
…(hacky -> better way?)
It's +4% on A100 which is more than I expected! (quite a bit less on RTX 4090). Locking GPU clocks at 1275MHz on an A100 SXM4 40GB: BEFORE: step 2/12: train loss 4.591615 (acc 4.591615) (127.596542 ms, 192607.046875 tok/s) This is the full "after" report from profile_gpt2cu.py:
For comparison, here's the before:
At one point I managed to get layernorm to be an extra 20% faster on top of this by playing with occupancy settings and cache streaming hints, but I can't remember the exact settings and can't seem to replicate it anymore... might be worth looking into that and how to generalise it at some point but it's already so close to peak after these changes that it's very much diminishing returns. After this, pretty much the only obvious things left on A100 are:
|
Confirm I also saw ~5% lift on my end, very cool!! |
These are fairly difficult optimisations to describe, hopefully the comments are helpful/enough! I'd focus on the changes in train_gpt2.cu rather than the similar ones in /dev/cuda/ (I didn't include a dev version of the new bias kernel, that file is very out of date and needs other changes).
layernorm_backward needed x128 but with the additional complexity that atomics are normally 32-bit rather than 128-bit, so naively implementing this resulted in an 8-way bank conflict and terrible performance! It required doing everything in a bank friendly order then reordering before the final write to global memory. This is kind of an annoying side-effect of x128, I think on Hopper there is a native 128-bit atomicAdd though.
matmul_backward_bias is roughly the same story except that was already the bottleneck in kernel6 and needed fixing... it was limited by shared memory bank conflicts.
The fused classifier is 4 separate optimisations:
(+ fixed blockReduce that needed out_of_bounds to be -FLT_MAX rather than the default 0.0f)