Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster matmul_backward_bias using coalesced reads and shared memory in the kernel #221

Merged
merged 3 commits into from Apr 22, 2024

Conversation

al0vya
Copy link
Contributor

@al0vya al0vya commented Apr 22, 2024

This kernel seems to offer a <4x runtime improvement over matmul_backward_bias_kernel2 on an RTX 2070 Super GPU, runtime comparison shown below:

matmul_backward_bias_kernel2:
block_size 32 time 0.9027 ms
block_size 64 time 4.0396 ms
block_size 128 time 3.4077 ms
block_size 256 time 1.4159 ms
block_size 512 time 0.7302 ms
block_size 1024 time 0.9947 ms

matmul_backward_bias_kernel4:
block_size 32 time 0.6875 ms
block_size 64 time 0.4235 ms
block_size 128 time 0.2925 ms
block_size 256 time 0.2419 ms
block_size 512 time 0.2714 ms
block_size 1024 time 0.2519 ms

The kernel passes test_gpt2cu without any issue, and also the tests in dev/cuda/matmul_backward_bias.cu after increasing the tolerance from 1e-3 to 5e-3: with the tolerence set to 1e-3, the weights printed to the console are very similar to the CPU reference weights, so I am hazarding that this is an acceptable change -- I've seen a tolerance of up to 1e-2 in other parts of the code...

In test_gpt2cu, the runtime seems to reduce from ~42 ms/iter to ~39 ms/iter (again, using an RTX 2070 Super GPU).

I've included some comments at the start of the kernel to explain the philosophy behind it.

Happy to receive feedback to improve the PR, sorry if I've missed something obvious.

@al0vya al0vya marked this pull request as ready for review April 22, 2024 16:06
@al0vya al0vya changed the title Faster matmul_backward using coalesced reads and shared memory in the kernel Faster matmul_backward_bias using coalesced reads and shared memory in the kernel Apr 22, 2024
@ChrisDryden
Copy link
Contributor

Was thinking that the weight tolerance could be fixed by setting the summation variable to a double #144

@karpathy
Copy link
Owner

@ChrisDryden good idea i'm working on the kernel right now, will check

@karpathy karpathy merged commit a42f739 into karpathy:master Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants