-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packing for Gelu backwards #306
Conversation
68c16ab
to
71da2d2
Compare
This PR can't compare the previous kernel and the new one, and also isn't there a compile bug? x128 typedef doesn't exist. |
Sorry, I must've deleted the definition for x128 while cleaning up my branch. Pushed the correction now. |
2771d7e
to
09b313a
Compare
I updated the gelu_backward.cu to match the most recent changes in master for the other .cu files -- updating the other pr now |
95c59ea
to
3de3c53
Compare
} | ||
|
||
void gelu_backward2(floatX* dinp, const floatX* inp, const floatX* dout, int N, const int block_size) { | ||
const int grid_size = ceil_div(N, block_size * x128::size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add an assert(N % x128::size == 0) here? documents the assumption and we may get a better error message in case we call the kernel wrongly later
Update gelu backwards kernel to do packing into 128 bits, and create gelu brackward cuda file
Previous kernel:
block_size 32 | time 0.1498 ms | bandwidth 503.99 GB/s
block_size 64 | time 0.0760 ms | bandwidth 993.32 GB/s
block_size 128 | time 0.0490 ms | bandwidth 1540.78 GB/s
block_size 256 | time 0.0487 ms | bandwidth 1548.88 GB/s
block_size 512 | time 0.0487 ms | bandwidth 1548.88 GB/s
block_size 1024 | time 0.0497 ms | bandwidth 1518.38 GB/s
total average iteration time: 39.030942 ms
New Kernel
block_size 32 | time 0.0328 ms | bandwidth 1535.18 GB/s
block_size 64 | time 0.0319 ms | bandwidth 1575.59 GB/s
block_size 128 | time 0.0333 ms | bandwidth 1509.35 GB/s
block_size 256 | time 0.0337 ms | bandwidth 1491.94 GB/s
block_size 512 | time 0.0340 ms | bandwidth 1478.92 GB/s
block_size 1024 | time 0.0352 ms | bandwidth 1430.92 GB/s
total average iteration time: 38.145030 ms