Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378

ademeure · 2024-05-07T00:19:05Z

These are fairly difficult optimisations to describe, hopefully the comments are helpful/enough! I'd focus on the changes in train_gpt2.cu rather than the similar ones in /dev/cuda/ (I didn't include a dev version of the new bias kernel, that file is very out of date and needs other changes).

layernorm_backward needed x128 but with the additional complexity that atomics are normally 32-bit rather than 128-bit, so naively implementing this resulted in an 8-way bank conflict and terrible performance! It required doing everything in a bank friendly order then reordering before the final write to global memory. This is kind of an annoying side-effect of x128, I think on Hopper there is a native 128-bit atomicAdd though.

matmul_backward_bias is roughly the same story except that was already the bottleneck in kernel6 and needed fixing... it was limited by shared memory bank conflicts.

The fused classifier is 4 separate optimisations:

Increasing number of threads by using launch_bounds which forces the compiler to allocate registers to fit 2 blocks of 1024 threads.
Add missing .cs (cache streaming/low persistence) modifier to the final store of the gradients (overwriting the logits). This massively reduces the cache footprint and prevents L1/L2 thrashing after increasing occupancy in (1).
Split both loops in "multiple-of-x128-size" and "bounds-checked remainder" parts so the critical path is as clean as possible (unfortunately this does result in a bit of code duplication, but not enough to justify splitting that logic into a small function imo).
Use templates for the WriteLogits and WriteProbs conditions so this is known at compile time and avoids any branching in the critical path.

(+ fixed blockReduce that needed out_of_bounds to be -FLT_MAX rather than the default 0.0f)

… matmul_backward_bias.

… + add missing common.h changes

karpathy · 2024-05-07T00:48:39Z

(the CI issue is fake, i pushed a fix to master, will try out the speed of this PR tomorrow, assuming it is a bit faster)

…(hacky -> better way?)

ademeure · 2024-05-07T02:56:30Z

It's +4% on A100 which is more than I expected! (quite a bit less on RTX 4090). Locking GPU clocks at 1275MHz on an A100 SXM4 40GB:

BEFORE: step 2/12: train loss 4.591615 (acc 4.591615) (127.596542 ms, 192607.046875 tok/s)
AFTER: step 2/12: train loss 4.592581 (acc 4.592581) (122.571777 ms, 200502.906250 tok/s)

This is the full "after" report from profile_gpt2cu.py:

                                                          ms     GB/s   core %      GiB      GiB      GiB      GiB    MInst
...........................................................................................................................
00 enc×1   encoder_forward_kernel3                      0.06   1220.3      0.0     0.05     0.02     0.07     0.04     8.18
01 fwd×12  layernorm_forward_kernel3                    0.76    948.6      0.0     0.45     0.27     0.43     0.44   209.09
02 fwd×12  ampere_bf16                                  5.53    299.6     79.8     0.50     1.16    15.20     1.27   745.17
03 fwd×12  cudnn_generated_fort_native_sdpa             4.53    390.0     51.6     1.36     0.41     4.22     0.44   976.97
04 fwd×12  ampere_bf16                                  2.01    381.7     75.4     0.47     0.30     5.07     0.42   248.39
05 fwd×12  residual_forward_kernel                      0.86   1359.9      0.0     0.91     0.26     0.84     0.42    38.93
06 fwd×12  layernorm_forward_kernel3                    0.85   1080.3      0.0     0.45     0.46     0.43     0.44   209.09
07 fwd×12  ampere_bf16                                  7.25    293.9     80.6     0.52     1.61    20.26     1.69   993.56
08 fwd×12  gelu_forward_kernel2                         2.74   1247.5      4.4     1.81     1.60     1.69     1.69   913.05
09 fwd×12  ampere_bf16                                  7.36    341.4     88.6     2.14     0.37    15.19     0.42   653.60
10 fwd×12  residual_forward_kernel                      0.86   1362.8      0.0     0.91     0.26     0.84     0.42    38.93
11 cls×1   layernorm_forward_kernel3                    0.07   1084.1      0.0     0.04     0.04     0.04     0.04    17.42
12 cls×1   ampere_bf16                                  9.61    434.6     81.8     1.72     2.46    27.63     2.30  1225.41
13 cls×1   fused_classifier_kernel5                     3.74   1320.2      0.0     2.48     2.45     2.39     2.30   918.23
14 cls×1   ampere_bf16                                  8.36    391.5     95.1     3.09     0.19    20.87     0.18   764.03
15 cls×1   ampere_bf16                                  8.12    374.0     97.5     2.97     0.07    20.80     0.07   730.77
16 cls×1   layernorm_backward_kernel8                   0.15   1003.0      0.1     0.11     0.03     0.12     0.07    26.73
17 bwd×12  matmul_backward_bias_kernel7                 0.41   1160.8      0.0     0.45     0.03     0.42     0.00    33.96
18 bwd×12  cast_and_add_kernel                          0.04      3.2      0.0     0.00     0.00     0.00     0.00     0.01
19 bwd×12  ampere_bf16                                  6.99    304.6     83.6     0.51     1.62    20.25     1.69   872.35
20 bwd×12  ampere_bf16                                  6.26    481.9     93.7     2.93     0.09    15.35     0.16   556.36
21 bwd×12  gelu_backward_kernel                         4.48   1166.9      2.4     3.62     1.60     3.38     1.69  1645.61
22 bwd×12  matmul_backward_bias_kernel7                 1.30   1414.3      0.0     1.81     0.03     1.69     0.00   104.84
23 bwd×12  cast_and_add_kernel                          0.04     10.0      0.0     0.00     0.00     0.00     0.00     0.02
24 bwd×12  ampere_bf16                                  7.19    349.2     90.6     2.14     0.37    15.19     0.42   618.60
25 bwd×12  ampere_bf16                                  6.25    418.2     93.7     2.53     0.08    15.35     0.16   556.17
26 bwd×12  layernorm_backward_kernel8                   1.78    987.3      0.1     1.37     0.39     1.47     0.80   320.80
27 bwd×12  matmul_backward_bias_kernel7                 0.42   1158.6      0.0     0.45     0.03     0.42     0.00    34.04
28 bwd×12  cast_and_add_kernel                          0.04      3.2      0.0     0.00     0.00     0.00     0.00     0.01
29 bwd×12  ampere_bf16                                  1.96    390.8     77.6     0.47     0.30     5.06     0.42   218.09
30 bwd×12  ampere_bf16                                  1.62    575.6     91.4     0.91     0.03     5.06     0.04   145.89
31 bwd×12  cublasLt::splitKreduce_kernel                0.17    325.4      1.5     0.06     0.00     0.05     0.01    12.83
32 bwd×12  cudnn::fusion::compute_dot_do_o              0.78   1215.1      0.0     0.91     0.04     0.84     0.02    80.51
33 bwd×12  cudnn_generated_fort_native_sdpa            13.84    318.0     41.1     2.75     1.65     4.76     0.84  1862.59
34 bwd×12  cudnn::fusion::rearrange_n_convert_dq        1.73   1194.8      0.0     0.91     1.16     0.84     1.56   321.16
35 bwd×12  matmul_backward_bias_kernel7                 1.17   1351.0      0.0     1.36     0.23     1.27     0.00    81.07
36 bwd×12  cast_and_add_kernel                          0.04      6.1      0.0     0.00     0.00     0.00     0.00     0.02
37 bwd×12  ampere_bf16                                  5.46    352.0     89.6     1.55     0.37    11.39     0.42   491.53
38 bwd×12  ampere_bf16                                  4.65    405.0     94.6     1.85     0.03    11.47     0.08   409.59
39 bwd×12  layernorm_backward_kernel8                   1.76   1000.4      0.1     1.37     0.39     1.48     0.80   320.55
40 enc×1   encoder_backward_kernel                      0.21    492.1      0.0     0.08     0.03     0.11     0.00    65.18
41 init×0  copy_and_cast_kernel                         0.64   1136.5      0.0     0.25     0.48     0.23     0.46    62.24
42 opt×1   adamw_kernel3                                2.57   1348.3      0.0     1.74     1.73     1.62     1.62   350.09
...........................................................................................................................
           Total                                      134.02    536.4     64.7    49.74    22.14   253.56    23.37 17819.40

Kernel type summaries:
  name                                       time   frac   count
  ampere_bf16                               88.63  66.13%    147
  cudnn_generated_fort_native_sdpa          18.37  13.70%     24
  gelu_backward_kernel                       4.48   3.34%     12
  fused_classifier_kernel5                   3.74   2.79%      1
  layernorm_backward_kernel8                 3.69   2.75%     25
  matmul_backward_bias_kernel7               3.31   2.47%     48
  gelu_forward_kernel2                       2.74   2.04%     12
  adamw_kernel3                              2.57   1.92%      1
  cudnn::fusion::rearrange_n_convert_dq      1.73   1.29%     12
  residual_forward_kernel                    1.71   1.28%     24
  layernorm_forward_kernel3                  1.68   1.25%     25
  cudnn::fusion::compute_dot_do_o            0.78   0.58%     12
  copy_and_cast_kernel                       0.64   0.48%      0
  encoder_backward_kernel                    0.21   0.16%      1
  cublasLt::splitKreduce_kernel              0.17   0.13%     12
  cast_and_add_kernel                        0.17   0.13%     48
  encoder_forward_kernel3                    0.06   0.04%      1

In total, a training step takes 134.0ms, distributed as:
  0.3ms (0.2%) in the encoder,
  32.7ms (24.4%) in forward blocks,
  30.1ms (22.4%) in the classifier part,
  68.4ms (51.0%) in backward blocks, and
  2.6ms (1.9%) in the optimizer.

We read 49.7GiB (371.1GB/s) and write 22.1GiB (165.2GB/s) to DRAM,
read 253.6GiB (1891.9GB/s) and write 23.4GiB (174.4GB/s) to L2,
and execute 17.8 billion instructions (133.0 GInst/s).

Assuming that every kernel should be either fully DRAM bandwidth or tensor core limited,
with a peak DRAM bandwidth of 1414.3GB/s and a peak tensor throughput of 100.0%,
our overall efficiency is 81.6%.

For comparison, here's the before:

  name                                       time   frac   count
  ampere_bf16                               88.62  63.15%    147
  cudnn_generated_fort_native_sdpa          18.31  13.05%     24
  layernorm_backward_kernel7                 7.47   5.32%     25
  fused_classifier_kernel3                   4.91   3.50%      1
  matmul_backward_bias_kernel6               4.67   3.33%     48
  gelu_backward_kernel                       4.48   3.19%     12
  gelu_forward_kernel2                       2.75   1.96%     12
  adamw_kernel3                              2.57   1.83%      1
  residual_forward_kernel                    1.73   1.23%     24
  cudnn::fusion::rearrange_n_convert_dq      1.73   1.23%     12
  layernorm_forward_kernel3                  1.70   1.21%     25
  cudnn::fusion::compute_dot_do_o            0.77   0.55%     12
  copy_and_cast_kernel                       0.64   0.46%      0
  encoder_backward_kernel                    0.21   0.15%      1
  cublasLt::splitKreduce_kernel              0.18   0.13%     12
  cast_and_add_kernel                        0.17   0.12%     48
  encoder_forward_kernel3                    0.06   0.04%      1```

layernorm_backward: +100% performance (~75% of DRAM BW, very occupancy dependent, hacky... better way?)
matmul_backward_bias: +40% performance (~100% of DRAM BW)
fused_classifier: +30% performance (~95% of DRAM BW)

At one point I managed to get layernorm to be an extra 20% faster on top of this by playing with occupancy settings and cache streaming hints, but I can't remember the exact settings and can't seem to replicate it anymore... might be worth looking into that and how to generalise it at some point but it's already so close to peak after these changes that it's very much diminishing returns.

After this, pretty much the only obvious things left on A100 are:

Residual + Layernorm Forward merging that ngc92 is working on.
GELU merging (good luck with cuBLASLt + BF16, sigh, might have to wait).
Better use of CUDA streams to avoid CPU synchronisations and improve memset/memcpy overlap.

karpathy · 2024-05-07T09:36:57Z

Confirm I also saw ~5% lift on my end, very cool!!

ademeure added 3 commits May 7, 2024 00:11

More crazy optimisations to layernorm_backward, fused_classifier, and…

9b55ea8

… matmul_backward_bias.

tiny irrelevant optimisation to final unaligned fused_classifier loop…

1ea7f9b

… + add missing common.h changes

remove BF16 default from classified_fused before PR

e1f89b3

3 x 512 threads max for layernorm_backward to avoid cache thrashing …

c261240

…(hacky -> better way?)

karpathy merged commit 5b07090 into karpathy:master May 7, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378

Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378

ademeure commented May 7, 2024

karpathy commented May 7, 2024

ademeure commented May 7, 2024

karpathy commented May 7, 2024

Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378

Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378

Conversation

ademeure commented May 7, 2024

karpathy commented May 7, 2024

ademeure commented May 7, 2024

karpathy commented May 7, 2024