Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisations for layernorm_backward / matmul_backward_bias / fused_classifier #378

Merged
merged 4 commits into from
May 7, 2024

Conversation

ademeure
Copy link
Contributor

@ademeure ademeure commented May 7, 2024

These are fairly difficult optimisations to describe, hopefully the comments are helpful/enough! I'd focus on the changes in train_gpt2.cu rather than the similar ones in /dev/cuda/ (I didn't include a dev version of the new bias kernel, that file is very out of date and needs other changes).

layernorm_backward needed x128 but with the additional complexity that atomics are normally 32-bit rather than 128-bit, so naively implementing this resulted in an 8-way bank conflict and terrible performance! It required doing everything in a bank friendly order then reordering before the final write to global memory. This is kind of an annoying side-effect of x128, I think on Hopper there is a native 128-bit atomicAdd though.

matmul_backward_bias is roughly the same story except that was already the bottleneck in kernel6 and needed fixing... it was limited by shared memory bank conflicts.

The fused classifier is 4 separate optimisations:

  1. Increasing number of threads by using launch_bounds which forces the compiler to allocate registers to fit 2 blocks of 1024 threads.
  2. Add missing .cs (cache streaming/low persistence) modifier to the final store of the gradients (overwriting the logits). This massively reduces the cache footprint and prevents L1/L2 thrashing after increasing occupancy in (1).
  3. Split both loops in "multiple-of-x128-size" and "bounds-checked remainder" parts so the critical path is as clean as possible (unfortunately this does result in a bit of code duplication, but not enough to justify splitting that logic into a small function imo).
  4. Use templates for the WriteLogits and WriteProbs conditions so this is known at compile time and avoids any branching in the critical path.

(+ fixed blockReduce that needed out_of_bounds to be -FLT_MAX rather than the default 0.0f)

@karpathy
Copy link
Owner

karpathy commented May 7, 2024

(the CI issue is fake, i pushed a fix to master, will try out the speed of this PR tomorrow, assuming it is a bit faster)

@ademeure
Copy link
Contributor Author

ademeure commented May 7, 2024

It's +4% on A100 which is more than I expected! (quite a bit less on RTX 4090). Locking GPU clocks at 1275MHz on an A100 SXM4 40GB:

BEFORE: step 2/12: train loss 4.591615 (acc 4.591615) (127.596542 ms, 192607.046875 tok/s)
AFTER: step 2/12: train loss 4.592581 (acc 4.592581) (122.571777 ms, 200502.906250 tok/s)

This is the full "after" report from profile_gpt2cu.py:

                                                          ms     GB/s   core %      GiB      GiB      GiB      GiB    MInst
...........................................................................................................................
00 enc×1   encoder_forward_kernel3                      0.06   1220.3      0.0     0.05     0.02     0.07     0.04     8.18
01 fwd×12  layernorm_forward_kernel3                    0.76    948.6      0.0     0.45     0.27     0.43     0.44   209.09
02 fwd×12  ampere_bf16                                  5.53    299.6     79.8     0.50     1.16    15.20     1.27   745.17
03 fwd×12  cudnn_generated_fort_native_sdpa             4.53    390.0     51.6     1.36     0.41     4.22     0.44   976.97
04 fwd×12  ampere_bf16                                  2.01    381.7     75.4     0.47     0.30     5.07     0.42   248.39
05 fwd×12  residual_forward_kernel                      0.86   1359.9      0.0     0.91     0.26     0.84     0.42    38.93
06 fwd×12  layernorm_forward_kernel3                    0.85   1080.3      0.0     0.45     0.46     0.43     0.44   209.09
07 fwd×12  ampere_bf16                                  7.25    293.9     80.6     0.52     1.61    20.26     1.69   993.56
08 fwd×12  gelu_forward_kernel2                         2.74   1247.5      4.4     1.81     1.60     1.69     1.69   913.05
09 fwd×12  ampere_bf16                                  7.36    341.4     88.6     2.14     0.37    15.19     0.42   653.60
10 fwd×12  residual_forward_kernel                      0.86   1362.8      0.0     0.91     0.26     0.84     0.42    38.93
11 cls×1   layernorm_forward_kernel3                    0.07   1084.1      0.0     0.04     0.04     0.04     0.04    17.42
12 cls×1   ampere_bf16                                  9.61    434.6     81.8     1.72     2.46    27.63     2.30  1225.41
13 cls×1   fused_classifier_kernel5                     3.74   1320.2      0.0     2.48     2.45     2.39     2.30   918.23
14 cls×1   ampere_bf16                                  8.36    391.5     95.1     3.09     0.19    20.87     0.18   764.03
15 cls×1   ampere_bf16                                  8.12    374.0     97.5     2.97     0.07    20.80     0.07   730.77
16 cls×1   layernorm_backward_kernel8                   0.15   1003.0      0.1     0.11     0.03     0.12     0.07    26.73
17 bwd×12  matmul_backward_bias_kernel7                 0.41   1160.8      0.0     0.45     0.03     0.42     0.00    33.96
18 bwd×12  cast_and_add_kernel                          0.04      3.2      0.0     0.00     0.00     0.00     0.00     0.01
19 bwd×12  ampere_bf16                                  6.99    304.6     83.6     0.51     1.62    20.25     1.69   872.35
20 bwd×12  ampere_bf16                                  6.26    481.9     93.7     2.93     0.09    15.35     0.16   556.36
21 bwd×12  gelu_backward_kernel                         4.48   1166.9      2.4     3.62     1.60     3.38     1.69  1645.61
22 bwd×12  matmul_backward_bias_kernel7                 1.30   1414.3      0.0     1.81     0.03     1.69     0.00   104.84
23 bwd×12  cast_and_add_kernel                          0.04     10.0      0.0     0.00     0.00     0.00     0.00     0.02
24 bwd×12  ampere_bf16                                  7.19    349.2     90.6     2.14     0.37    15.19     0.42   618.60
25 bwd×12  ampere_bf16                                  6.25    418.2     93.7     2.53     0.08    15.35     0.16   556.17
26 bwd×12  layernorm_backward_kernel8                   1.78    987.3      0.1     1.37     0.39     1.47     0.80   320.80
27 bwd×12  matmul_backward_bias_kernel7                 0.42   1158.6      0.0     0.45     0.03     0.42     0.00    34.04
28 bwd×12  cast_and_add_kernel                          0.04      3.2      0.0     0.00     0.00     0.00     0.00     0.01
29 bwd×12  ampere_bf16                                  1.96    390.8     77.6     0.47     0.30     5.06     0.42   218.09
30 bwd×12  ampere_bf16                                  1.62    575.6     91.4     0.91     0.03     5.06     0.04   145.89
31 bwd×12  cublasLt::splitKreduce_kernel                0.17    325.4      1.5     0.06     0.00     0.05     0.01    12.83
32 bwd×12  cudnn::fusion::compute_dot_do_o              0.78   1215.1      0.0     0.91     0.04     0.84     0.02    80.51
33 bwd×12  cudnn_generated_fort_native_sdpa            13.84    318.0     41.1     2.75     1.65     4.76     0.84  1862.59
34 bwd×12  cudnn::fusion::rearrange_n_convert_dq        1.73   1194.8      0.0     0.91     1.16     0.84     1.56   321.16
35 bwd×12  matmul_backward_bias_kernel7                 1.17   1351.0      0.0     1.36     0.23     1.27     0.00    81.07
36 bwd×12  cast_and_add_kernel                          0.04      6.1      0.0     0.00     0.00     0.00     0.00     0.02
37 bwd×12  ampere_bf16                                  5.46    352.0     89.6     1.55     0.37    11.39     0.42   491.53
38 bwd×12  ampere_bf16                                  4.65    405.0     94.6     1.85     0.03    11.47     0.08   409.59
39 bwd×12  layernorm_backward_kernel8                   1.76   1000.4      0.1     1.37     0.39     1.48     0.80   320.55
40 enc×1   encoder_backward_kernel                      0.21    492.1      0.0     0.08     0.03     0.11     0.00    65.18
41 init×0  copy_and_cast_kernel                         0.64   1136.5      0.0     0.25     0.48     0.23     0.46    62.24
42 opt×1   adamw_kernel3                                2.57   1348.3      0.0     1.74     1.73     1.62     1.62   350.09
...........................................................................................................................
           Total                                      134.02    536.4     64.7    49.74    22.14   253.56    23.37 17819.40

Kernel type summaries:
  name                                       time   frac   count
  ampere_bf16                               88.63  66.13%    147
  cudnn_generated_fort_native_sdpa          18.37  13.70%     24
  gelu_backward_kernel                       4.48   3.34%     12
  fused_classifier_kernel5                   3.74   2.79%      1
  layernorm_backward_kernel8                 3.69   2.75%     25
  matmul_backward_bias_kernel7               3.31   2.47%     48
  gelu_forward_kernel2                       2.74   2.04%     12
  adamw_kernel3                              2.57   1.92%      1
  cudnn::fusion::rearrange_n_convert_dq      1.73   1.29%     12
  residual_forward_kernel                    1.71   1.28%     24
  layernorm_forward_kernel3                  1.68   1.25%     25
  cudnn::fusion::compute_dot_do_o            0.78   0.58%     12
  copy_and_cast_kernel                       0.64   0.48%      0
  encoder_backward_kernel                    0.21   0.16%      1
  cublasLt::splitKreduce_kernel              0.17   0.13%     12
  cast_and_add_kernel                        0.17   0.13%     48
  encoder_forward_kernel3                    0.06   0.04%      1

In total, a training step takes 134.0ms, distributed as:
  0.3ms (0.2%) in the encoder,
  32.7ms (24.4%) in forward blocks,
  30.1ms (22.4%) in the classifier part,
  68.4ms (51.0%) in backward blocks, and
  2.6ms (1.9%) in the optimizer.

We read 49.7GiB (371.1GB/s) and write 22.1GiB (165.2GB/s) to DRAM,
read 253.6GiB (1891.9GB/s) and write 23.4GiB (174.4GB/s) to L2,
and execute 17.8 billion instructions (133.0 GInst/s).

Assuming that every kernel should be either fully DRAM bandwidth or tensor core limited,
with a peak DRAM bandwidth of 1414.3GB/s and a peak tensor throughput of 100.0%,
our overall efficiency is 81.6%.

For comparison, here's the before:

  name                                       time   frac   count
  ampere_bf16                               88.62  63.15%    147
  cudnn_generated_fort_native_sdpa          18.31  13.05%     24
  layernorm_backward_kernel7                 7.47   5.32%     25
  fused_classifier_kernel3                   4.91   3.50%      1
  matmul_backward_bias_kernel6               4.67   3.33%     48
  gelu_backward_kernel                       4.48   3.19%     12
  gelu_forward_kernel2                       2.75   1.96%     12
  adamw_kernel3                              2.57   1.83%      1
  residual_forward_kernel                    1.73   1.23%     24
  cudnn::fusion::rearrange_n_convert_dq      1.73   1.23%     12
  layernorm_forward_kernel3                  1.70   1.21%     25
  cudnn::fusion::compute_dot_do_o            0.77   0.55%     12
  copy_and_cast_kernel                       0.64   0.46%      0
  encoder_backward_kernel                    0.21   0.15%      1
  cublasLt::splitKreduce_kernel              0.18   0.13%     12
  cast_and_add_kernel                        0.17   0.12%     48
  encoder_forward_kernel3                    0.06   0.04%      1```
  • layernorm_backward: +100% performance (~75% of DRAM BW, very occupancy dependent, hacky... better way?)
  • matmul_backward_bias: +40% performance (~100% of DRAM BW)
  • fused_classifier: +30% performance (~95% of DRAM BW)

At one point I managed to get layernorm to be an extra 20% faster on top of this by playing with occupancy settings and cache streaming hints, but I can't remember the exact settings and can't seem to replicate it anymore... might be worth looking into that and how to generalise it at some point but it's already so close to peak after these changes that it's very much diminishing returns.

After this, pretty much the only obvious things left on A100 are:

  • Residual + Layernorm Forward merging that ngc92 is working on.
  • GELU merging (good luck with cuBLASLt + BF16, sigh, might have to wait).
  • Better use of CUDA streams to avoid CPU synchronisations and improve memset/memcpy overlap.

@karpathy karpathy merged commit 5b07090 into karpathy:master May 7, 2024
6 checks passed
@karpathy
Copy link
Owner

karpathy commented May 7, 2024

Confirm I also saw ~5% lift on my end, very cool!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants