fp16 buffers for ADAM #289

ngc92 · 2024-04-29T14:58:22Z

First proof-of-concept implementation

ngc92 · 2024-04-29T15:11:25Z

Instead of having a single scale factor per tensor, we have scales for individual groups of 32. This is less about getting more accuracy (though it might help with that), and more to ensure that we don't need any form of cross-warp communication to handle the scales.
I'd expect the group size of 32 to increase once we switch to vertorized adam kernels anyway.

karpathy · 2024-04-29T16:41:46Z

I think I'm missing a bit of context on this PR. Is this following some paper / approach?

ngc92 · 2024-04-29T17:22:12Z

It comes from the appendix of "Efficient Large Scale Language Modeling with Mixtures of Experts", which in turn cites "Jukebox: A Generative Model for Music".

However, this is not actually a 1:1 implementation of that. If you want to have one scaling factor per tensor, you need to know inside the adam kernel in which tensor you are (my other draft adam PR). It also requires synchronization, because you need to process the entire tensor, determine the max, scale things accordingly, and write to memory.

Having one scale factor per block requires more memory (though the amount should still be neglible, esp. since I assume the block size will increase when we use vector loads here.

ngc92 · 2024-04-29T20:37:40Z

rebased on the lastest changes from master.
I used #288 to generate a gpt2-large model.
Without this patch, training at batch size 1 requires 12658MiB
with the fp16 buffers, this goes down to 9892MiB

Sadly, it's not enough to allow me to test the gpt2-xl on my 16GB card, even with batch size one.

ademeure mentioned this pull request Apr 29, 2024

Remove FloatN & simplify adam/reduce with BF16 LayerNorms #295

Merged

ngc92 force-pushed the fp16-adam branch from 1040440 to 0c5a154 Compare April 29, 2024 20:34

ngc92 force-pushed the fp16-adam branch 2 times, most recently from 3252b88 to 1d96fe5 Compare April 29, 2024 23:47

ngc92 added 2 commits April 30, 2024 02:49

fp16 buffers for ADAM

a49cdf0

this has bugged me for long enough: the divisions had to go!

0a3d332

ngc92 force-pushed the fp16-adam branch from 1d96fe5 to 0a3d332 Compare April 29, 2024 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp16 buffers for ADAM #289

fp16 buffers for ADAM #289

ngc92 commented Apr 29, 2024

ngc92 commented Apr 29, 2024

karpathy commented Apr 29, 2024

ngc92 commented Apr 29, 2024

ngc92 commented Apr 29, 2024

fp16 buffers for ADAM #289

Are you sure you want to change the base?

fp16 buffers for ADAM #289

Conversation

ngc92 commented Apr 29, 2024

ngc92 commented Apr 29, 2024

karpathy commented Apr 29, 2024

ngc92 commented Apr 29, 2024

ngc92 commented Apr 29, 2024