Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matmul refactor using only cuBLASLt + GELU Fusion #653

Merged
merged 5 commits into from
Jul 1, 2024

Conversation

ademeure
Copy link
Contributor

In preparation for FP8, this replaces all cuBLAS calls by cuBLASLt which is now wrapped by a single matmul_cublaslt() function.

It also adds support for GELU fusion which can be controlled on the command line with "-ge": 0 for disabled, 1 for forward only, 2 for forward+backward. The default is 2 is for H100+ and 0 for older GPUs based on seeing regressions on RTX 4090 previously but you might want to consider disabling it by default before merging due to the following:

In terms of accuracy and validation loss, the fused GELU seems very slightly worse than ours (how/why?!) which is not ideal especially when combined with GELU recomputation since it means the activations used for the backward pass won't be bit-for-bit identical to the ones used in the forward pass.

It's hard to tell how much this is just noise because the loss is only slightly worse for fused GELU (and the tensor thresholds are still too aggressive by default, so out of sheer luck the fused GELU passes but the non-fused doesn't on my system!) - based on this data, I think it's probably real. But then again with the non-deterministic cuDNN performance runs below (before my other PR) the "best" val loss is seen with "-ge 1" and the worst with "-ge 0" so it's very much within the noise threshold in that case... so who knows?

The performance is noticeably improved (H100 with cuDNN enabled) - I did 2 runs of each since it wasn't deterministic due to cuDNN:

./train_gpt2cu -r 1 -ge 0 -b 24 -e "d48" ==> 31653 tok/s (val loss 6.893524)
./train_gpt2cu -r 1 -ge 0 -b 24 -e "d48" ==> 31464 tok/s (val loss 6.890060)

./train_gpt2cu -r 1 -ge 1 -b 24 -e "d48" ==> 32192 tok/s (val loss 6.889246)
./train_gpt2cu -r 1 -ge 1 -b 24 -e "d48" ==> 32227 tok/s (val loss 6.891509)

./train_gpt2cu -r 1 -ge 2 -b 24 -e "d48" ==> 32716 tok/s (val loss 6.891244)
./train_gpt2cu -r 1 -ge 2 -b 24 -e "d48" ==> 32851 tok/s (val loss 6.891099)

==> +3.4%! (for results of the 1st run)

-r 0 -ge 0 (BF16 cuDNN disabled so fully deterministic):

TENSOR OK, max diff: 3.901e-03, with rel error: 4.284e+00 (calculated=  0.002991, ref= -0.000911), 95.81% of maximum error
TENSOR OK, max diff: 1.192e-01, with rel error: 8.910e-02 (calculated=  1.218750, ref=  1.337970), 86.91% of maximum error
TENSOR OK, max diff: 3.112e-02, with rel error: 2.968e+00 (calculated= -0.020630, ref=  0.010485), 87.19% of maximum error
TENSOR OK, max diff: 2.367e-02, with rel error: 1.522e-01 (calculated= -0.131836, ref= -0.155501), 73.30% of maximum error
TENSOR OK, max diff: 2.460e-02, with rel error: 5.366e-01 (calculated= -0.021240, ref= -0.045837), 73.16% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR OK, max diff: 1.142e-02, with rel error: 1.212e+00 (calculated=  0.001999, ref= -0.009424), 72.55% of maximum error
TENSOR OK, max diff: 2.934e-04, with rel error: 4.422e-02 (calculated=  0.006927, ref=  0.006634), 41.86% of maximum error
TENSOR OK, max diff: 7.905e-03, with rel error: 1.528e+01 (calculated=  0.008423, ref=  0.000517), 98.31% of maximum error
TENSOR OK, max diff: 2.274e-03, with rel error: 1.098e-01 (calculated= -0.018433, ref= -0.020706), 86.76% of maximum error
TENSOR OK, max diff: 2.454e-03, with rel error: 4.789e-01 (calculated=  0.002670, ref=  0.005125), 84.49% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.595e-02, with rel error: 5.161e+00 (calculated= -0.019043, ref= -0.003091), 78.80% of maximum error

loss ok at step 1: 5.242411 5.270009
loss ok at step 2: 4.045264 4.060681
loss ok at step 3: 3.299340 3.320085
loss ok at step 4: 2.719682 2.717550
loss ok at step 5: 2.199129 2.181066
loss ok at step 6: 1.658379 1.653923
loss ok at step 7: 1.178264 1.168050
loss ok at step 8: 0.757203 0.736873
loss ok at step 9: 0.409954 0.401021
loss ok at step 10: 0.197095 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.718741

-r 0 -ge 1:

TENSOR OK, max diff: 3.703e-03, with rel error: 4.067e+00 (calculated=  0.002792, ref= -0.000911), 90.94% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 87.85% of maximum error
TENSOR OK, max diff: 3.160e-02, with rel error: 3.014e+00 (calculated= -0.021118, ref=  0.010485), 89.61% of maximum error
TENSOR OK, max diff: 2.600e-02, with rel error: 1.756e-01 (calculated= -0.122070, ref= -0.148071), 82.03% of maximum error
TENSOR OK, max diff: 2.631e-02, with rel error: 5.739e-01 (calculated= -0.019531, ref= -0.045837), 78.24% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.361e-02, with rel error: 1.444e+00 (calculated=  0.004181, ref= -0.009424), 86.41% of maximum error
TENSOR OK, max diff: 2.415e-04, with rel error: 1.233e-01 (calculated= -0.001717, ref= -0.001958), 36.88% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 92.08% of maximum error
TENSOR OK, max diff: 2.622e-03, with rel error: 5.117e-01 (calculated=  0.002502, ref=  0.005125), 90.27% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.040064 4.060681
loss ok at step 3: 3.311543 3.320085
loss ok at step 4: 2.704113 2.717550
loss ok at step 5: 2.185935 2.181066
loss ok at step 6: 1.654894 1.653923
loss ok at step 7: 1.190601 1.168050
loss ok at step 8: 0.762185 0.736873
loss ok at step 9: 0.423399 0.401021
loss ok at step 10: 0.201602 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.725565

-r 0 -ge 2:

TENSOR OK, max diff: 3.733e-03, with rel error: 4.100e+00 (calculated=  0.002823, ref= -0.000911), 91.69% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 86.91% of maximum error
TENSOR OK, max diff: 3.185e-02, with rel error: 3.037e+00 (calculated= -0.021362, ref=  0.010485), 89.27% of maximum error
TENSOR OK, max diff: 2.659e-02, with rel error: 1.710e-01 (calculated= -0.128906, ref= -0.155501), 83.57% of maximum error
TENSOR OK, max diff: 2.606e-02, with rel error: 5.686e-01 (calculated= -0.019775, ref= -0.045837), 77.52% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.199e-02, with rel error: 1.272e+00 (calculated=  0.002563, ref= -0.009424), 76.14% of maximum error
TENSOR OK, max diff: 2.338e-04, with rel error: 1.194e-01 (calculated= -0.001724, ref= -0.001958), 35.72% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 93.12% of maximum error
TENSOR OK, max diff: 2.576e-03, with rel error: 5.027e-01 (calculated=  0.002548, ref=  0.005125), 88.69% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.061561 4.060681
loss ok at step 3: 3.322728 3.320085
loss ok at step 4: 2.727595 2.717550
loss ok at step 5: 2.185776 2.181066
loss ok at step 6: 1.665764 1.653923
loss ok at step 7: 1.181406 1.168050
loss ok at step 8: 0.760672 0.736873
loss ok at step 9: 0.411712 0.401021
loss ok at step 10: 0.202858 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.732559

-r 2 -ge 0:

TENSOR OK, max diff: 3.901e-03, with rel error: 4.284e+00 (calculated=  0.002991, ref= -0.000911), 95.81% of maximum error
TENSOR OK, max diff: 1.192e-01, with rel error: 8.910e-02 (calculated=  1.218750, ref=  1.337970), 86.91% of maximum error
TENSOR OK, max diff: 3.112e-02, with rel error: 2.968e+00 (calculated= -0.020630, ref=  0.010485), 87.19% of maximum error
TENSOR OK, max diff: 2.367e-02, with rel error: 1.522e-01 (calculated= -0.131836, ref= -0.155501), 73.30% of maximum error
TENSOR OK, max diff: 2.460e-02, with rel error: 5.366e-01 (calculated= -0.021240, ref= -0.045837), 73.16% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR OK, max diff: 1.142e-02, with rel error: 1.212e+00 (calculated=  0.001999, ref= -0.009424), 72.55% of maximum error
TENSOR OK, max diff: 2.934e-04, with rel error: 4.422e-02 (calculated=  0.006927, ref=  0.006634), 41.86% of maximum error
TENSOR OK, max diff: 7.905e-03, with rel error: 1.528e+01 (calculated=  0.008423, ref=  0.000517), 98.31% of maximum error
TENSOR OK, max diff: 2.274e-03, with rel error: 1.098e-01 (calculated= -0.018433, ref= -0.020706), 86.76% of maximum error
TENSOR OK, max diff: 2.454e-03, with rel error: 4.789e-01 (calculated=  0.002670, ref=  0.005125), 84.49% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.595e-02, with rel error: 5.161e+00 (calculated= -0.019043, ref= -0.003091), 78.80% of maximum error

loss ok at step 1: 5.242411 5.270009
loss ok at step 2: 4.045264 4.060681
loss ok at step 3: 3.299340 3.320085
loss ok at step 4: 2.719682 2.717550
loss ok at step 5: 2.199129 2.181066
loss ok at step 6: 1.658379 1.653923
loss ok at step 7: 1.178264 1.168050
loss ok at step 8: 0.757203 0.736873
loss ok at step 9: 0.409954 0.401021
loss ok at step 10: 0.197095 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.718741

-r 2 -ge 1

TENSOR OK, max diff: 3.703e-03, with rel error: 4.067e+00 (calculated=  0.002792, ref= -0.000911), 90.94% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 87.85% of maximum error
TENSOR OK, max diff: 3.160e-02, with rel error: 3.014e+00 (calculated= -0.021118, ref=  0.010485), 89.61% of maximum error
TENSOR OK, max diff: 2.600e-02, with rel error: 1.756e-01 (calculated= -0.122070, ref= -0.148071), 82.03% of maximum error
TENSOR OK, max diff: 2.631e-02, with rel error: 5.739e-01 (calculated= -0.019531, ref= -0.045837), 78.24% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.361e-02, with rel error: 1.444e+00 (calculated=  0.004181, ref= -0.009424), 86.41% of maximum error
TENSOR OK, max diff: 2.415e-04, with rel error: 1.233e-01 (calculated= -0.001717, ref= -0.001958), 36.88% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 92.08% of maximum error
TENSOR OK, max diff: 2.622e-03, with rel error: 5.117e-01 (calculated=  0.002502, ref=  0.005125), 90.27% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.031170 4.060681
loss ok at step 3: 3.303350 3.320085
loss ok at step 4: 2.718124 2.717550
loss ok at step 5: 2.191558 2.181066
loss ok at step 6: 1.654259 1.653923
loss ok at step 7: 1.177093 1.168050
loss ok at step 8: 0.748835 0.736873
loss ok at step 9: 0.415033 0.401021
loss ok at step 10: 0.197646 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.732420

-r 2 -ge 2

TENSOR OK, max diff: 3.733e-03, with rel error: 4.100e+00 (calculated=  0.002823, ref= -0.000911), 91.69% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 86.91% of maximum error
TENSOR OK, max diff: 3.185e-02, with rel error: 3.037e+00 (calculated= -0.021362, ref=  0.010485), 89.27% of maximum error
TENSOR OK, max diff: 2.659e-02, with rel error: 1.710e-01 (calculated= -0.128906, ref= -0.155501), 83.57% of maximum error
TENSOR OK, max diff: 2.606e-02, with rel error: 5.686e-01 (calculated= -0.019775, ref= -0.045837), 77.52% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.199e-02, with rel error: 1.272e+00 (calculated=  0.002563, ref= -0.009424), 76.14% of maximum error
TENSOR OK, max diff: 2.338e-04, with rel error: 1.194e-01 (calculated= -0.001724, ref= -0.001958), 35.72% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 93.12% of maximum error
TENSOR OK, max diff: 2.576e-03, with rel error: 5.027e-01 (calculated=  0.002548, ref=  0.005125), 88.69% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.043187 4.060681
loss ok at step 3: 3.321911 3.320085
loss ok at step 4: 2.713270 2.717550
loss ok at step 5: 2.186564 2.181066
loss ok at step 6: 1.655319 1.653923
loss ok at step 7: 1.184024 1.168050
loss ok at step 8: 0.750603 0.736873
loss ok at step 9: 0.422000 0.401021
loss ok at step 10: 0.199289 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.725532

@@ -121,6 +121,7 @@ int main(int argc, char *argv[]) {
if (argv[i][0] != '-') { exit(EXIT_FAILURE); } // must start with dash
if (argv[i][1] == 'w') { model.use_master_weights = atoi(argv[i+1]); }
else if (argv[i][1] == 'r') { model.recompute = atoi(argv[i+1]); }
else if (argv[i][1] == 'g' && argv[i][2] == 'e') { model.gelu_fusion = atoi(argv[i+1]); }
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mildly scary indexing into [2] here
Should we add

if (!(strlen(argv[i]) == 2 || strlen(argv[i]) == 3)) { error_usage(); } // must be -x[y] (one dash, one or two letters)

like in train code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, hadn't noticed that was missing in test! Adding it now.

(it's a bit non-obvious that only works because strlen==2 means [2] is '\0', and that we'd need a better check if we wanted to use argv[i][3], but probably good enough for now)

@@ -1167,13 +1165,11 @@ void common_start(bool override_enable_tf32 = true, bool print_device_info = tru
nvtxNameCudaStreamA(main_stream, "main stream");

// set up cuBLAS and cuBLASLt
cublasCheck(cublasCreate(&cublas_handle));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@karpathy karpathy merged commit a876282 into karpathy:master Jul 1, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants