Support for FP16/BF16 in train_gpt2.cu (1.86x Perf) #218

ademeure · 2024-04-22T15:23:06Z

Now finished and reasonably happy with it!
1.86x performance on my RTX 4090:

FP32: ~80ms
BF16: ~43ms (with layernorm params in FP32, but all activations in BF16)

This allows the same train_gpt2.cu to work as full FP32, full BF16, full FP16, or full (BF/FP)16 + FP32 layernorm simply by changing the define at the top of the file. Also included stochastic rounding for the Adam kernel (but nothing else at this point, possibly worth adding to gradients in general when we move to FP8?)

To simplify the logic compared to the first version of the PR, all activation tensors are now always "floatX", we cannot mix-and-match. However, because atomicAdd on 16-bit values in some of the backwards kernels are HORRIBLY slow (10x slower or worse), and because this kind of flexibility seems useful in general for layernorm accuracy, layernorm is kept at FP32 by defining "floatN" as "float".

I reduced the amount of code duplication by using very lightweight templates for the kernel types. It's still a BIG change though, unfortunately I don't think there's any way around that!

…ults)

…wer than FP32 for now)

ademeure · 2024-04-22T17:42:01Z

It is trivial to use the exact same code with everything in FP32, at the top of train_gpt2.cu simply replace this:

typedef __nv_bfloat16 floatX;
#define CUBLAS_LOWP CUDA_R_16BF

with this:

typedef float floatX;
#define CUBLAS_LOWP CUDA_R_32F

This is now able to train in BF16 for many layers and kinda-sorta works for test_gpt2.cu, as the loss converges much slower than FP32 for now (need to debug how to improve that afterwards):

LOSS MISMATCH AT STEP 1: 4.598247 4.059707
LOSS MISMATCH AT STEP 2: 4.152971 3.375123
LOSS MISMATCH AT STEP 3: 3.828835 2.800783
LOSS MISMATCH AT STEP 4: 3.538793 2.315382
LOSS MISMATCH AT STEP 5: 3.260888 1.849029
LOSS MISMATCH AT STEP 6: 3.000814 1.394656
LOSS MISMATCH AT STEP 7: 2.768756 0.999147
LOSS MISMATCH AT STEP 8: 2.557551 0.624080
LOSS MISMATCH AT STEP 9: 2.352901 0.376511

It does eventually converge:

step 99: loss 0.001294 (took 8.712021 ms)

…( (is it doing CAS?)

…ts only)

ademeure · 2024-04-23T05:57:04Z

Debugged the BF16 convergence issue and fixed it by adding stochastic rounding support. Simplified code by making all activations the same type (but params can still be different types partly due to severe perf issues for atomicAdd otherwise).

The PR is now in a good state in my opinion where it's worth thinking about what it would take to integrate it.

ademeure · 2024-04-23T13:44:10Z

BTW this approach should work perfectly fine for FP8 as well, the main issue (besides loss scaling) to get that working is cuBLAS non-Lt doesn't support FP8 at all, so we can't use StridedBatched GEMMs for attention, and we need padding to move the other cuBLAS calls to Lt, etc...

By hacking things so all the "cannot be FP8" GEMMs stay at BF16 while halving k (obviously not functionally correct), I got FP8 to run at ~29.5ms (vs ~43ms for BF16). So it does seem to scale reasonably well despite suffering a little bit from Amdahl's Law.

It should be possible to keep gradients as e5m2 and everything else as e4m3 by just adding a "floatG" for e5m2 and casting appropriately, since the storage requirements are the same. What would not work without a LOT more complexity is using types with different sizes (e.g. activations at FP8 and gradients at BF16) but I think we agreed we shouldn't really need that anytime soon.

karpathy · 2024-04-23T17:16:20Z

merging this. we'll iterate in master.

WIP support for FP16/BF16 in train_gpt2.cu (compiles, not correct res…

f35adbe

…ults)

ademeure marked this pull request as draft April 22, 2024 15:23

ademeure added 2 commits April 22, 2024 18:09

Gradients now working in BF16 mode! (still need adam etc...)

58df262

Fully working BF16 training! (very hacky Adam, and converges much slo…

d605b99

…wer than FP32 for now)

ademeure added 8 commits April 23, 2024 01:39

Updated FP16/BF16 mix-and-match ncluding stochastic rounding

85290a5

BF16/FP16 Attention + bug fixes

bdc661a

Good news: it works. Bad news: 16-bit atomics are *incredibly slow* :…

7c193bd

…( (is it doing CAS?)

Added floatN to choose FP32 vs FP16 for layernorm (parameters/gradien…

8775856

…ts only)

Merge remote-tracking branch 'karpathy/master' into linear16

c1992a1

Fixes for merge with latest, now ~86% faster!

53dc40e

Fill in param_sizeof in a slightly less terrible way

a876485

fix typo

91ec92f

ademeure changed the title ~~WIP support for FP16/BF16 in train_gpt2.cu (compiles, not correct yet)~~ Support for FP16/BF16 in train_gpt2.cu (1.86x Perf) Apr 23, 2024

ademeure marked this pull request as ready for review April 23, 2024 05:59

tiny fixes to make fp8 work in the future (missing conversions)

0333e36

karpathy merged commit 6b6ad35 into karpathy:master Apr 23, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for FP16/BF16 in train_gpt2.cu (1.86x Perf) #218

Support for FP16/BF16 in train_gpt2.cu (1.86x Perf) #218

ademeure commented Apr 22, 2024 •

edited

ademeure commented Apr 22, 2024

ademeure commented Apr 23, 2024

ademeure commented Apr 23, 2024

karpathy commented Apr 23, 2024

Support for FP16/BF16 in train_gpt2.cu (1.86x Perf) #218

Support for FP16/BF16 in train_gpt2.cu (1.86x Perf) #218

Conversation

ademeure commented Apr 22, 2024 • edited

ademeure commented Apr 22, 2024

ademeure commented Apr 23, 2024

ademeure commented Apr 23, 2024

karpathy commented Apr 23, 2024

ademeure commented Apr 22, 2024 •

edited