Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packing for Gelu backwards #306

Merged
merged 6 commits into from
May 2, 2024
Merged

Conversation

JaneIllario
Copy link
Contributor

Update gelu backwards kernel to do packing into 128 bits, and create gelu brackward cuda file

Previous kernel:
block_size 32 | time 0.1498 ms | bandwidth 503.99 GB/s
block_size 64 | time 0.0760 ms | bandwidth 993.32 GB/s
block_size 128 | time 0.0490 ms | bandwidth 1540.78 GB/s
block_size 256 | time 0.0487 ms | bandwidth 1548.88 GB/s
block_size 512 | time 0.0487 ms | bandwidth 1548.88 GB/s
block_size 1024 | time 0.0497 ms | bandwidth 1518.38 GB/s

total average iteration time: 39.030942 ms

New Kernel

block_size 32 | time 0.0328 ms | bandwidth 1535.18 GB/s
block_size 64 | time 0.0319 ms | bandwidth 1575.59 GB/s
block_size 128 | time 0.0333 ms | bandwidth 1509.35 GB/s
block_size 256 | time 0.0337 ms | bandwidth 1491.94 GB/s
block_size 512 | time 0.0340 ms | bandwidth 1478.92 GB/s
block_size 1024 | time 0.0352 ms | bandwidth 1430.92 GB/s

total average iteration time: 38.145030 ms

@karpathy
Copy link
Owner

This PR can't compare the previous kernel and the new one, and also isn't there a compile bug? x128 typedef doesn't exist.

@JaneIllario
Copy link
Contributor Author

Sorry, I must've deleted the definition for x128 while cleaning up my branch. Pushed the correction now.

@JaneIllario
Copy link
Contributor Author

I updated the gelu_backward.cu to match the most recent changes in master for the other .cu files -- updating the other pr now

train_gpt2.cu Outdated Show resolved Hide resolved
dev/cuda/gelu_backward.cu Outdated Show resolved Hide resolved
dev/cuda/gelu_backward.cu Outdated Show resolved Hide resolved
}

void gelu_backward2(floatX* dinp, const floatX* inp, const floatX* dout, int N, const int block_size) {
const int grid_size = ceil_div(N, block_size * x128::size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add an assert(N % x128::size == 0) here? documents the assumption and we may get a better error message in case we call the kernel wrongly later

@karpathy karpathy merged commit 99f51ba into karpathy:master May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants