Faster `matmul_backward_bias` using coalesced reads and shared memory in the kernel #221

al0vya · 2024-04-22T16:03:01Z

This kernel seems to offer a <4x runtime improvement over matmul_backward_bias_kernel2 on an RTX 2070 Super GPU, runtime comparison shown below:

matmul_backward_bias_kernel2:
block_size 32 time 0.9027 ms
block_size 64 time 4.0396 ms
block_size 128 time 3.4077 ms
block_size 256 time 1.4159 ms
block_size 512 time 0.7302 ms
block_size 1024 time 0.9947 ms

matmul_backward_bias_kernel4:
block_size 32 time 0.6875 ms
block_size 64 time 0.4235 ms
block_size 128 time 0.2925 ms
block_size 256 time 0.2419 ms
block_size 512 time 0.2714 ms
block_size 1024 time 0.2519 ms

The kernel passes test_gpt2cu without any issue, and also the tests in dev/cuda/matmul_backward_bias.cu after increasing the tolerance from 1e-3 to 5e-3: with the tolerence set to 1e-3, the weights printed to the console are very similar to the CPU reference weights, so I am hazarding that this is an acceptable change -- I've seen a tolerance of up to 1e-2 in other parts of the code...

In test_gpt2cu, the runtime seems to reduce from ~42 ms/iter to ~39 ms/iter (again, using an RTX 2070 Super GPU).

I've included some comments at the start of the kernel to explain the philosophy behind it.

Happy to receive feedback to improve the PR, sorry if I've missed something obvious.

…d shared memory

ChrisDryden · 2024-04-22T17:12:40Z

Was thinking that the weight tolerance could be fixed by setting the summation variable to a double #144

karpathy · 2024-04-22T17:14:03Z

@ChrisDryden good idea i'm working on the kernel right now, will check

al0vya added 2 commits April 22, 2024 16:35

add a faster matmul backward bias kernel that uses coalesced reads an…

55fbd7e

…d shared memory

add comment

b82ec20

al0vya marked this pull request as ready for review April 22, 2024 16:06

al0vya changed the title ~~Faster matmul_backward using coalesced reads and shared memory in the kernel~~ Faster matmul_backward_bias using coalesced reads and shared memory in the kernel Apr 22, 2024

add more comments to explain the philosophy behind the kernel

35393b4

karpathy merged commit a42f739 into karpathy:master Apr 22, 2024

ChrisDryden mentioned this pull request Apr 22, 2024

[todo] Accumulate in double instead of float #144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster `matmul_backward_bias` using coalesced reads and shared memory in the kernel #221

Faster `matmul_backward_bias` using coalesced reads and shared memory in the kernel #221

al0vya commented Apr 22, 2024 •

edited

ChrisDryden commented Apr 22, 2024

karpathy commented Apr 22, 2024

Faster matmul_backward_bias using coalesced reads and shared memory in the kernel #221

Faster matmul_backward_bias using coalesced reads and shared memory in the kernel #221

Conversation

al0vya commented Apr 22, 2024 • edited

ChrisDryden commented Apr 22, 2024

karpathy commented Apr 22, 2024

Faster `matmul_backward_bias` using coalesced reads and shared memory in the kernel #221

Faster `matmul_backward_bias` using coalesced reads and shared memory in the kernel #221

al0vya commented Apr 22, 2024 •

edited