Updated adamw to use packed data types #303

ChrisDryden · 2024-04-30T04:26:13Z

Before Runtime
total average iteration time: 38.547570 ms

After Runtime:
total average iteration time: 37.901735 ms

Kernel development file specs:
Barely noticeable with the current test suite:
Before:
time gpu 0.0098 ms
After:
time gpu 0.0097 ms

ademeure

Need feedback from others on what the right approach is here I think, we didn't think of these issues during the f128 discussion :(

ademeure · 2024-04-30T12:08:58Z

dev/cuda/adamw.cu

+   x128 packed_params_memory = load128(params_memory+(i*x128::size));
+   f128 packed_m_memory = load128(m_memory+(i*f128::size));
+   f128 packed_v_memory = load128(v_memory+(i*f128::size));
+   for(int k = 0; k < packed_v_memory.size; ++k){


This is iterating based on the size of a f128 = 4 elements, but packed_grads and packed_params are x128 = 8 elements (for BF16), so I think this means we are loading twice as much data as we need for the latter and wasting it (or, hopefully, the compiler optimises the loads away and we end up with LDG.64? that might be OK if so tbh)

Ideally we'd assert that the number of elements in a x128 is an integer multiple of a f128 (e.g. 1/2/4), and the kernel would work on the larger number of elements of the two, with both an inner and an outer loop... more complicated than I expected, and one case where the fetch8 approach that always fetches 8 elements would have been very slightly simpler :(

@ngc92 do you have any thoughts about how this should work with your FP16 moments changes? Potentially we could just be lazy and combine both changes and assert the sizeof() of the params is the same as the sizeof() of the moments?...

Did some testing with hard coding loading as a 64 bit and didn't see a noticable time difference, I think the majority of the latency if from the warp stalls which appear to have the same amount whether its a 64 or a 128 read.

Let's not hard-code the assumption that these are the same datatype. Let's just code this kernel properly, so that it iterates correctly.

we can generally assume sizeof(param) <= sizeof(moment), right? That should be enough for a simple implementation.

I'll add that assumption and the boundary checks in the kernel instantiation

ademeure · 2024-04-30T12:10:26Z

dev/cuda/adamw.cu

+   f128 packed_m_memory = load128(m_memory+(i*f128::size));
+   f128 packed_v_memory = load128(v_memory+(i*f128::size));
+   for(int k = 0; k < packed_v_memory.size; ++k){
+    if (i*4 + k >= num_parameters) return;  // guard


Can we get rid of this guard by asserting "(num_parameters % 4) == 0" outside the kernel?

I really like the idea of removing all of the bounds checks and bring them outside the kernel, I'm fairly confident we have some kernels that have incorrect sizing inputs that are just fixed by having the bounds check in the kernel, the fused kernel with the softmax as an example

we really should codify our assumption in some prominent place, and ensure that any model we generate has nice enough shapes. additional asserts won't hurt, though.

ademeure · 2024-04-30T12:10:58Z

train_gpt2.cu

@@ -1917,7 +1927,7 @@ void gpt2_update(GPT2 *model, float learning_rate, float beta1, float beta2, flo
    }

    int block_size = 512;
-    int num_blocks = CEIL_DIV(model->num_parameters, block_size);
+    int num_blocks = CEIL_DIV(model->num_parameters, block_size)/x128::size;


Should the division by x128::size be inside the CEIL_DIV?

I was having difficulty getting it to compile in that format in the kernel file, if you know whats going on there and whats blocking it I would love to know

I think I should just be able to typecast it, will modify to follow the format inside the ceil div

ademeure · 2024-05-02T00:38:14Z

I think the problem was it can't work with only 1 loop, it was skipping some of the elements for some of the arrays because fo the different f128/x128 sizes, here's my attempt at fixing that in Chris' kernel which seems to work (haven't looked into perf yet):

__global__ void adamw_kernel4(floatX* params_memory, const floatX* grads_memory, float* m_memory, float* v_memory, size_t num_parameters,
                              float learning_rate, float beta1, float beta2, float beta1_correction, float beta2_correction, float eps, float weight_decay,
                              unsigned int seed) {
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   int idx_offset = idx*x128::size;
   if (idx_offset >= num_parameters) { return; }

   x128 packed_grads_memory = load128(grads_memory + idx_offset);
   x128 packed_params_memory = load128(params_memory + idx_offset);
   for (int n = 0; n < (x128::size / f128::size); n++) {
    int idx_n_offset = idx_offset + n*f128::size;
    f128 packed_m_memory = load128(m_memory + idx_n_offset);
    f128 packed_v_memory = load128(v_memory + idx_n_offset);
    for(int k = 0; k < f128::size; ++k){
        int k_n_offset = k + n*f128::size;
        float grad = (float)packed_grads_memory[k_n_offset];
        float m = packed_m_memory[k];
        float v = packed_v_memory[k];
        // update the first moment (momentum)
        m = lerp(grad, m, beta1);
        packed_m_memory[k] = m;
        // update the second moment (RMSprop)
        v = lerp(grad * grad, v, beta2);
        packed_v_memory[k] = v;
        m /= beta1_correction; // Setting these values explicitly due to compiler error for modifying
        v /= beta2_correction; // packed128 values when using
        // update the parameters (weight/bias)
        float param = (float)packed_params_memory[k_n_offset] - (learning_rate * (m / (sqrtf(v) + eps) + weight_decay * (float)packed_params_memory[k_n_offset]));
        unsigned int random = Get2dNoiseUint(threadIdx.x, blockIdx.x, seed);
        // todo - explain stochastic rounding here
        stochastic_rounding(param, &packed_params_memory[k_n_offset], random);
    }
    store128(m_memory + idx_n_offset, packed_m_memory);
    store128(v_memory + idx_n_offset, packed_v_memory);
   }
   store128(params_memory+idx_offset, packed_params_memory);
}

ChrisDryden · 2024-05-02T02:51:53Z

Updated the PR to show the new kernel, it does have a speedup in the train loop for me of:
total average iteration time: 38.287047 ms
to
total average iteration time: 37.143633 ms

ngc92 · 2024-05-02T09:23:46Z

dev/cuda/adamw.cu

+   params_memory[i] -= learning_rate * (m / (sqrtf(v) + eps) + weight_decay * (float) params_memory[i]);
+}
+
+// Optimized kernel to use lower precision data types for params memory and grads memory


is this comment accurate? kernel2 also uses floatX

ngc92 · 2024-05-02T09:25:10Z

dev/cuda/adamw.cu

+void adamw_dispatch4(floatX* params_memory, const floatX* grads_memory, float* m_memory, float* v_memory, long num_parameters,
+                     float learning_rate, float beta1, float beta2, float beta1_correction, float beta2_correction, float eps, float weight_decay) {
+    unsigned int block_size = 512;
+    assert(num_parameters % 4 == 0 && f128::size <= x128::size); // asserting here to not require bounds check in kernel


f128::size <= x128::size is a compile-time property, best make that a static_assert inside the actual kernel
also, num_parameters % x128::size == 0 would be the safer choice, I think,

ngc92 · 2024-05-02T09:28:15Z

train_gpt2.cu

+   }
+   store128(m_memory+(i*f128::size), packed_m_memory);
+   store128(v_memory+(i*f128::size), packed_v_memory);
+   store128(params_memory+(i*x128::size), packed_params_memory);


for sizeof(f128) != sizeof(x128), I believe this write might result in a race condition.
probably not in practice because the optimizer ends up with just a 64 bit store.

ademeure reviewed Apr 30, 2024

View reviewed changes

ChrisDryden force-pushed the adamw_dtype_packing branch from 7549bd6 to d4f5139 Compare May 1, 2024 03:06

ChrisDryden added 3 commits May 2, 2024 02:42

Updated adamw to use packed data types

5868b7e

Added kernel files for adamw updates

6330ddf

Added b16 kernel testing suite and addressed comments

49d2bb9

ChrisDryden force-pushed the adamw_dtype_packing branch from d4f5139 to 49d2bb9 Compare May 2, 2024 02:51

ngc92 reviewed May 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated adamw to use packed data types #303

Updated adamw to use packed data types #303

ChrisDryden commented Apr 30, 2024

ademeure left a comment

ademeure Apr 30, 2024

ChrisDryden Apr 30, 2024 •

edited

ngc92 Apr 30, 2024

ngc92 Apr 30, 2024

ChrisDryden Apr 30, 2024

ademeure Apr 30, 2024

ChrisDryden Apr 30, 2024

ngc92 Apr 30, 2024

ademeure Apr 30, 2024

ChrisDryden Apr 30, 2024

ChrisDryden Apr 30, 2024

ademeure commented May 2, 2024

ChrisDryden commented May 2, 2024

ngc92 May 2, 2024

ngc92 May 2, 2024 •

edited

ngc92 May 2, 2024

Updated adamw to use packed data types #303

Are you sure you want to change the base?

Updated adamw to use packed data types #303

Conversation

ChrisDryden commented Apr 30, 2024

ademeure left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisDryden Apr 30, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ademeure commented May 2, 2024

ChrisDryden commented May 2, 2024

Choose a reason for hiding this comment

ngc92 May 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChrisDryden Apr 30, 2024 •

edited

ngc92 May 2, 2024 •

edited