Full BF16 including layernorms by default (minimising number of BF16 atomics) #272

ademeure · 2024-04-27T23:37:02Z

I added 4 different new versions of layernorm_backward_kernel, performance is best for:

Kernel 4 (using atomicCAS, no scratch, but rounding many times so probably worse numerical accuracy
Kernel 6 ==> new default (FP32 scratchpad, do one final rounding to BF16 at the end, could be made stochastic in the future).

We probably just want to integrate Kernel 6 but might want to add all of them to /dev/cuda/ in the future. I haven't fixed the BF16 atomics in encoder_backward yet but the performance penalty of that is much much smaller, it might be worth doing it manually using atomicCAS so we can implement stochastic rounding there as well though.

Performance on my RTX 4090 is a tiny bit faster (potentially noise) than with the previous mixed FP32/BF16.

karpathy · 2024-04-28T16:38:36Z

I'm assuming this is not meant to merge as-is? Would it make sense to put most of this into dev/cuda and then cherry-pick the layernorm we want to use, as usual, into train_gpt2.cu?

ngc92 · 2024-04-28T21:09:12Z

train_gpt2.cu

+__global__ void layernorm_backward_kernel3(Tdinp* dinp, Tparams* dweight, Tparams* dbias,
+                        const Tdout* dout, const Trest* inp, const Tparams* weight, const Trest* mean, const Trest* rstd,
+                        int B, int T, int C) {
+    extern __shared__ float shared[]; // size = 2 * C


(2*C+1) * 4?

ngc92 · 2024-04-28T21:11:16Z

train_gpt2.cu

+       dbias_shared[i] = 0.0f;
+       dweight_shared[i] = 0.0f;
+    }
+    uint *tmp_flag = (uint*)(shared + C*2);


is this actually used in this kernel?

ngc92 · 2024-04-28T21:12:05Z

train_gpt2.cu

+       dbias_shared[i] = 0.0f;
+       dweight_shared[i] = 0.0f;
+    }
+    uint *tmp_flag = (uint*)(shared + C*2);


…bugfixes)

ademeure · 2024-04-29T00:56:42Z

Thanks! Fixed those bugs, and only kept Kernel6 in train_gpt2.cu, adding all the kernels to /dev/cuda/

I think this might be the first /dev/cuda change with BF16, so I added support for comparing CPU and GPU data of different types (converting GPU data to CPU type before comparison) in validate_result().

ngc92 · 2024-04-29T14:45:00Z

dev/cuda/common.h

-    cudaCheck(cudaMemcpy(out_gpu, device_result, num_elements * sizeof(T), cudaMemcpyDeviceToHost));
+template<class D, class T>
+void validate_result(D* device_result, const T* cpu_reference, const char* name, std::size_t num_elements, T tolerance=1e-4) {
+    D* out_gpu = (D*)malloc(num_elements * sizeof(T));


Good catch, fixed!

ademeure added 2 commits April 28, 2024 00:31

Full BF16 including layernorm at good perf by minimising BF16 atomics

8768f6b

turn on full BF16 on by default

1d93edf

ngc92 reviewed Apr 28, 2024

View reviewed changes

karpathy marked this pull request as draft April 28, 2024 21:09

ngc92 reviewed Apr 28, 2024

View reviewed changes

ademeure added 2 commits April 29, 2024 01:51

Use kernel6 for train_gpt2, and add all other kernels to /dev/cuda (+…

82fef03

…bugfixes)

Only enable kernel4 with BF16

1e50f3b

ademeure marked this pull request as ready for review April 29, 2024 03:29

ngc92 reviewed Apr 29, 2024

View reviewed changes

ademeure added 4 commits April 29, 2024 15:47

Fix sizeof(T) -> sizeof(D) (was overallocating memory)

49ffd68

Merge remote-tracking branch 'karpathy/master' into layernormbf16

d23bb1f

Fix train_gpt2.py to write all weights as BF16 (+merge fixes)

ebedadf

gpt2_build_from_checkpoint

8be7370

karpathy mentioned this pull request Apr 29, 2024

Ademeure layernormbf16 3 #294

Merged

karpathy merged commit 8be7370 into karpathy:master Apr 29, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full BF16 including layernorms by default (minimising number of BF16 atomics) #272

Full BF16 including layernorms by default (minimising number of BF16 atomics) #272

ademeure commented Apr 27, 2024

karpathy commented Apr 28, 2024

ngc92 Apr 28, 2024

ngc92 Apr 28, 2024

ngc92 Apr 28, 2024

ademeure commented Apr 29, 2024

ngc92 Apr 29, 2024

ademeure Apr 29, 2024

Full BF16 including layernorms by default (minimising number of BF16 atomics) #272

Full BF16 including layernorms by default (minimising number of BF16 atomics) #272

Conversation

ademeure commented Apr 27, 2024

karpathy commented Apr 28, 2024

ngc92 Apr 28, 2024

Choose a reason for hiding this comment

ngc92 Apr 28, 2024

Choose a reason for hiding this comment

ngc92 Apr 28, 2024

Choose a reason for hiding this comment

ademeure commented Apr 29, 2024

ngc92 Apr 29, 2024

Choose a reason for hiding this comment

ademeure Apr 29, 2024

Choose a reason for hiding this comment