Matmul refactor using only cuBLASLt + GELU Fusion #653

ademeure · 2024-06-27T15:17:46Z

In preparation for FP8, this replaces all cuBLAS calls by cuBLASLt which is now wrapped by a single matmul_cublaslt() function.

It also adds support for GELU fusion which can be controlled on the command line with "-ge": 0 for disabled, 1 for forward only, 2 for forward+backward. The default is 2 is for H100+ and 0 for older GPUs based on seeing regressions on RTX 4090 previously but you might want to consider disabling it by default before merging due to the following:

In terms of accuracy and validation loss, the fused GELU seems very slightly worse than ours (how/why?!) which is not ideal especially when combined with GELU recomputation since it means the activations used for the backward pass won't be bit-for-bit identical to the ones used in the forward pass.

It's hard to tell how much this is just noise because the loss is only slightly worse for fused GELU (and the tensor thresholds are still too aggressive by default, so out of sheer luck the fused GELU passes but the non-fused doesn't on my system!) - based on this data, I think it's probably real. But then again with the non-deterministic cuDNN performance runs below (before my other PR) the "best" val loss is seen with "-ge 1" and the worst with "-ge 0" so it's very much within the noise threshold in that case... so who knows?

The performance is noticeably improved (H100 with cuDNN enabled) - I did 2 runs of each since it wasn't deterministic due to cuDNN:

./train_gpt2cu -r 1 -ge 0 -b 24 -e "d48" ==> 31653 tok/s (val loss 6.893524)
./train_gpt2cu -r 1 -ge 0 -b 24 -e "d48" ==> 31464 tok/s (val loss 6.890060)

./train_gpt2cu -r 1 -ge 1 -b 24 -e "d48" ==> 32192 tok/s (val loss 6.889246)
./train_gpt2cu -r 1 -ge 1 -b 24 -e "d48" ==> 32227 tok/s (val loss 6.891509)

./train_gpt2cu -r 1 -ge 2 -b 24 -e "d48" ==> 32716 tok/s (val loss 6.891244)
./train_gpt2cu -r 1 -ge 2 -b 24 -e "d48" ==> 32851 tok/s (val loss 6.891099)

==> +3.4%! (for results of the 1st run)

-r 0 -ge 0 (BF16 cuDNN disabled so fully deterministic):

TENSOR OK, max diff: 3.901e-03, with rel error: 4.284e+00 (calculated=  0.002991, ref= -0.000911), 95.81% of maximum error
TENSOR OK, max diff: 1.192e-01, with rel error: 8.910e-02 (calculated=  1.218750, ref=  1.337970), 86.91% of maximum error
TENSOR OK, max diff: 3.112e-02, with rel error: 2.968e+00 (calculated= -0.020630, ref=  0.010485), 87.19% of maximum error
TENSOR OK, max diff: 2.367e-02, with rel error: 1.522e-01 (calculated= -0.131836, ref= -0.155501), 73.30% of maximum error
TENSOR OK, max diff: 2.460e-02, with rel error: 5.366e-01 (calculated= -0.021240, ref= -0.045837), 73.16% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR OK, max diff: 1.142e-02, with rel error: 1.212e+00 (calculated=  0.001999, ref= -0.009424), 72.55% of maximum error
TENSOR OK, max diff: 2.934e-04, with rel error: 4.422e-02 (calculated=  0.006927, ref=  0.006634), 41.86% of maximum error
TENSOR OK, max diff: 7.905e-03, with rel error: 1.528e+01 (calculated=  0.008423, ref=  0.000517), 98.31% of maximum error
TENSOR OK, max diff: 2.274e-03, with rel error: 1.098e-01 (calculated= -0.018433, ref= -0.020706), 86.76% of maximum error
TENSOR OK, max diff: 2.454e-03, with rel error: 4.789e-01 (calculated=  0.002670, ref=  0.005125), 84.49% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.595e-02, with rel error: 5.161e+00 (calculated= -0.019043, ref= -0.003091), 78.80% of maximum error

loss ok at step 1: 5.242411 5.270009
loss ok at step 2: 4.045264 4.060681
loss ok at step 3: 3.299340 3.320085
loss ok at step 4: 2.719682 2.717550
loss ok at step 5: 2.199129 2.181066
loss ok at step 6: 1.658379 1.653923
loss ok at step 7: 1.178264 1.168050
loss ok at step 8: 0.757203 0.736873
loss ok at step 9: 0.409954 0.401021
loss ok at step 10: 0.197095 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.718741

-r 0 -ge 1:

TENSOR OK, max diff: 3.703e-03, with rel error: 4.067e+00 (calculated=  0.002792, ref= -0.000911), 90.94% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 87.85% of maximum error
TENSOR OK, max diff: 3.160e-02, with rel error: 3.014e+00 (calculated= -0.021118, ref=  0.010485), 89.61% of maximum error
TENSOR OK, max diff: 2.600e-02, with rel error: 1.756e-01 (calculated= -0.122070, ref= -0.148071), 82.03% of maximum error
TENSOR OK, max diff: 2.631e-02, with rel error: 5.739e-01 (calculated= -0.019531, ref= -0.045837), 78.24% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.361e-02, with rel error: 1.444e+00 (calculated=  0.004181, ref= -0.009424), 86.41% of maximum error
TENSOR OK, max diff: 2.415e-04, with rel error: 1.233e-01 (calculated= -0.001717, ref= -0.001958), 36.88% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 92.08% of maximum error
TENSOR OK, max diff: 2.622e-03, with rel error: 5.117e-01 (calculated=  0.002502, ref=  0.005125), 90.27% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.040064 4.060681
loss ok at step 3: 3.311543 3.320085
loss ok at step 4: 2.704113 2.717550
loss ok at step 5: 2.185935 2.181066
loss ok at step 6: 1.654894 1.653923
loss ok at step 7: 1.190601 1.168050
loss ok at step 8: 0.762185 0.736873
loss ok at step 9: 0.423399 0.401021
loss ok at step 10: 0.201602 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.725565

-r 0 -ge 2:

TENSOR OK, max diff: 3.733e-03, with rel error: 4.100e+00 (calculated=  0.002823, ref= -0.000911), 91.69% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 86.91% of maximum error
TENSOR OK, max diff: 3.185e-02, with rel error: 3.037e+00 (calculated= -0.021362, ref=  0.010485), 89.27% of maximum error
TENSOR OK, max diff: 2.659e-02, with rel error: 1.710e-01 (calculated= -0.128906, ref= -0.155501), 83.57% of maximum error
TENSOR OK, max diff: 2.606e-02, with rel error: 5.686e-01 (calculated= -0.019775, ref= -0.045837), 77.52% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.199e-02, with rel error: 1.272e+00 (calculated=  0.002563, ref= -0.009424), 76.14% of maximum error
TENSOR OK, max diff: 2.338e-04, with rel error: 1.194e-01 (calculated= -0.001724, ref= -0.001958), 35.72% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 93.12% of maximum error
TENSOR OK, max diff: 2.576e-03, with rel error: 5.027e-01 (calculated=  0.002548, ref=  0.005125), 88.69% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.061561 4.060681
loss ok at step 3: 3.322728 3.320085
loss ok at step 4: 2.727595 2.717550
loss ok at step 5: 2.185776 2.181066
loss ok at step 6: 1.665764 1.653923
loss ok at step 7: 1.181406 1.168050
loss ok at step 8: 0.760672 0.736873
loss ok at step 9: 0.411712 0.401021
loss ok at step 10: 0.202858 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.732559

-r 2 -ge 0:

TENSOR OK, max diff: 3.901e-03, with rel error: 4.284e+00 (calculated=  0.002991, ref= -0.000911), 95.81% of maximum error
TENSOR OK, max diff: 1.192e-01, with rel error: 8.910e-02 (calculated=  1.218750, ref=  1.337970), 86.91% of maximum error
TENSOR OK, max diff: 3.112e-02, with rel error: 2.968e+00 (calculated= -0.020630, ref=  0.010485), 87.19% of maximum error
TENSOR OK, max diff: 2.367e-02, with rel error: 1.522e-01 (calculated= -0.131836, ref= -0.155501), 73.30% of maximum error
TENSOR OK, max diff: 2.460e-02, with rel error: 5.366e-01 (calculated= -0.021240, ref= -0.045837), 73.16% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR NOT OK, max diff: 6.596e-02, with rel error: 5.363e-02 (calculated=  1.164062, ref=  1.230026), 103.83% of maximum error
TENSOR OK, max diff: 1.142e-02, with rel error: 1.212e+00 (calculated=  0.001999, ref= -0.009424), 72.55% of maximum error
TENSOR OK, max diff: 2.934e-04, with rel error: 4.422e-02 (calculated=  0.006927, ref=  0.006634), 41.86% of maximum error
TENSOR OK, max diff: 7.905e-03, with rel error: 1.528e+01 (calculated=  0.008423, ref=  0.000517), 98.31% of maximum error
TENSOR OK, max diff: 2.274e-03, with rel error: 1.098e-01 (calculated= -0.018433, ref= -0.020706), 86.76% of maximum error
TENSOR OK, max diff: 2.454e-03, with rel error: 4.789e-01 (calculated=  0.002670, ref=  0.005125), 84.49% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.595e-02, with rel error: 5.161e+00 (calculated= -0.019043, ref= -0.003091), 78.80% of maximum error

loss ok at step 1: 5.242411 5.270009
loss ok at step 2: 4.045264 4.060681
loss ok at step 3: 3.299340 3.320085
loss ok at step 4: 2.719682 2.717550
loss ok at step 5: 2.199129 2.181066
loss ok at step 6: 1.658379 1.653923
loss ok at step 7: 1.178264 1.168050
loss ok at step 8: 0.757203 0.736873
loss ok at step 9: 0.409954 0.401021
loss ok at step 10: 0.197095 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.718741

-r 2 -ge 1

TENSOR OK, max diff: 3.703e-03, with rel error: 4.067e+00 (calculated=  0.002792, ref= -0.000911), 90.94% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 87.85% of maximum error
TENSOR OK, max diff: 3.160e-02, with rel error: 3.014e+00 (calculated= -0.021118, ref=  0.010485), 89.61% of maximum error
TENSOR OK, max diff: 2.600e-02, with rel error: 1.756e-01 (calculated= -0.122070, ref= -0.148071), 82.03% of maximum error
TENSOR OK, max diff: 2.631e-02, with rel error: 5.739e-01 (calculated= -0.019531, ref= -0.045837), 78.24% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.361e-02, with rel error: 1.444e+00 (calculated=  0.004181, ref= -0.009424), 86.41% of maximum error
TENSOR OK, max diff: 2.415e-04, with rel error: 1.233e-01 (calculated= -0.001717, ref= -0.001958), 36.88% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 92.08% of maximum error
TENSOR OK, max diff: 2.622e-03, with rel error: 5.117e-01 (calculated=  0.002502, ref=  0.005125), 90.27% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.031170 4.060681
loss ok at step 3: 3.303350 3.320085
loss ok at step 4: 2.718124 2.717550
loss ok at step 5: 2.191558 2.181066
loss ok at step 6: 1.654259 1.653923
loss ok at step 7: 1.177093 1.168050
loss ok at step 8: 0.748835 0.736873
loss ok at step 9: 0.415033 0.401021
loss ok at step 10: 0.197646 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.732420

-r 2 -ge 2

TENSOR OK, max diff: 3.733e-03, with rel error: 4.100e+00 (calculated=  0.002823, ref= -0.000911), 91.69% of maximum error
TENSOR OK, max diff: 1.250e-01, with rel error: 1.096e-01 (calculated=  1.015625, ref=  1.140599), 86.91% of maximum error
TENSOR OK, max diff: 3.185e-02, with rel error: 3.037e+00 (calculated= -0.021362, ref=  0.010485), 89.27% of maximum error
TENSOR OK, max diff: 2.659e-02, with rel error: 1.710e-01 (calculated= -0.128906, ref= -0.155501), 83.57% of maximum error
TENSOR OK, max diff: 2.606e-02, with rel error: 5.686e-01 (calculated= -0.019775, ref= -0.045837), 77.52% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 6.925e-02, with rel error: 6.878e-02 (calculated= -0.937500, ref= -1.006749), 93.92% of maximum error
TENSOR OK, max diff: 1.199e-02, with rel error: 1.272e+00 (calculated=  0.002563, ref= -0.009424), 76.14% of maximum error
TENSOR OK, max diff: 2.338e-04, with rel error: 1.194e-01 (calculated= -0.001724, ref= -0.001958), 35.72% of maximum error
TENSOR OK, max diff: 7.966e-03, with rel error: 1.539e+01 (calculated=  0.008484, ref=  0.000517), 99.07% of maximum error
TENSOR OK, max diff: 2.518e-03, with rel error: 1.216e-01 (calculated= -0.018188, ref= -0.020706), 93.12% of maximum error
TENSOR OK, max diff: 2.576e-03, with rel error: 5.027e-01 (calculated=  0.002548, ref=  0.005125), 88.69% of maximum error
TENSOR OK, max diff: 1.018e-01, with rel error: 7.858e-01 (calculated= -0.231445, ref= -0.129603), 92.38% of maximum error
TENSOR OK, max diff: 1.607e-02, with rel error: 5.200e+00 (calculated= -0.019165, ref= -0.003091), 79.40% of maximum error

loss ok at step 1: 5.258774 5.270009
loss ok at step 2: 4.043187 4.060681
loss ok at step 3: 3.321911 3.320085
loss ok at step 4: 2.713270 2.717550
loss ok at step 5: 2.186564 2.181066
loss ok at step 6: 1.655319 1.653923
loss ok at step 7: 1.184024 1.168050
loss ok at step 8: 0.750603 0.736873
loss ok at step 9: 0.422000 0.401021
loss ok at step 10: 0.199289 0.187493

Validation loss for ./train_gpt2cu -e "d48" on tinyshakespeare:
val loss 6.725532

karpathy · 2024-07-01T17:26:52Z

test_gpt2.cu

@@ -121,6 +121,7 @@ int main(int argc, char *argv[]) {
        if (argv[i][0] != '-') { exit(EXIT_FAILURE); } // must start with dash
        if (argv[i][1] == 'w') { model.use_master_weights = atoi(argv[i+1]); }
        else if (argv[i][1] == 'r') { model.recompute = atoi(argv[i+1]); }
+        else if (argv[i][1] == 'g' && argv[i][2] == 'e') { model.gelu_fusion = atoi(argv[i+1]); }


mildly scary indexing into [2] here
Should we add

if (!(strlen(argv[i]) == 2 || strlen(argv[i]) == 3)) { error_usage(); } // must be -x[y] (one dash, one or two letters)

like in train code?

Good catch, hadn't noticed that was missing in test! Adding it now.

(it's a bit non-obvious that only works because strlen==2 means [2] is '\0', and that we'd need a better check if we wanted to use argv[i][3], but probably good enough for now)

karpathy · 2024-07-01T17:27:54Z

train_gpt2.cu

@@ -1167,13 +1165,11 @@ void common_start(bool override_enable_tf32 = true, bool print_device_info = tru
    nvtxNameCudaStreamA(main_stream, "main stream");

    // set up cuBLAS and cuBLASLt
-    cublasCheck(cublasCreate(&cublas_handle));


…access

ademeure added 3 commits June 30, 2024 20:45

Refactor matmul to only use cuBLASLt and optionally use fused GELU

fa5a196

Make GELU fusion default 2 for H100+ and 0 on older GPUs for now

0f11a4e

fix default for test_gpt2cu

6e29dd0

ademeure force-pushed the cublaslt_refactor branch from 0073f5f to 6e29dd0 Compare June 30, 2024 19:46

merge fix

20e450c

karpathy reviewed Jul 1, 2024

View reviewed changes

Disable GELU fusion by deffault & fix potential argv[] out of bounds …

00247dc

…access

karpathy merged commit a876282 into karpathy:master Jul 1, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matmul refactor using only cuBLASLt + GELU Fusion #653

Matmul refactor using only cuBLASLt + GELU Fusion #653

ademeure commented Jun 27, 2024

karpathy Jul 1, 2024

ademeure Jul 1, 2024

karpathy Jul 1, 2024

Matmul refactor using only cuBLASLt + GELU Fusion #653

Matmul refactor using only cuBLASLt + GELU Fusion #653

Conversation

ademeure commented Jun 27, 2024

-r 0 -ge 0 (BF16 cuDNN disabled so fully deterministic):

-r 0 -ge 1:

-r 0 -ge 2:

-r 2 -ge 0:

-r 2 -ge 1

-r 2 -ge 2

karpathy Jul 1, 2024

Choose a reason for hiding this comment

ademeure Jul 1, 2024

Choose a reason for hiding this comment

karpathy Jul 1, 2024

Choose a reason for hiding this comment