Use proper GeLU on CPU #441

jart · 2024-05-21T08:44:08Z

This change removes the tanh GeLU approximation. This gives us the benefit of better accuracy, roughly equal perf and strict standard conformance, since we no longer need any compiler-specific tricks.

Here's the last lines of train_gpt2 output before this change:

step 37: train loss 3.739647 (took 598.548076 ms)
step 38: train loss 4.611735 (took 596.626145 ms)
step 39: train loss 3.970751 (took 598.439552 ms)
val loss 4.016658
generating:
---
Come Running Away,
Greater conquer
With the Imperial blood
the heaviest host of the gods
into this wondrous world beyond.
I will not back thee, for how sweet after birth
Netflix against repounder,
will not
flourish against the earlocks of
Allay
---
step 40: train loss 4.377756 (took 592.704936 ms)

Here's the last lines of train_gpt2 output after this change:

step 37: train loss 3.731596 (took 594.893995 ms)
step 38: train loss 4.561646 (took 600.064035 ms)
step 39: train loss 3.933512 (took 599.666173 ms)
val loss 4.014135
generating:
---
Whether Hipocrates,
Bigon Nicinius, or rep'd
With Thy fair winter-tail your outraged hand,
The richness of the good smour
Nine years by turns covered my Member. Thou art
Nay, I fear be; but
Lets o' thee know, if it
---
step 40: train loss 4.358461 (took 597.594065 ms)

This change has the disadvantage of diverging from PyTorch. I view this as being justified and worthwhile, for numerous reasons, e.g.

"I used the tanh approximation simply because the error function erf was slow in tensorflow some years ago. If the exact version is fast enough now and does not have numerical issues, I do not see a reason to use an inexact version." ──Quoth Dan Hendrycks

See pytorch/pytorch#39853

dagelf · 2024-05-21T19:06:20Z

Sure works! ... Is this applicable to the CUDA version? How will this affect fine tuning? (Or a hypothetical retraining run of the base model?) (I'm still learning A LOT here)

It's even a few ms faster 😅

Good way to run benchmarks:

( kill -STOP -1  # Stop all processes, NB don't run this outside a script or screen if remote!
timeout 40s ./train_gpt2
kill -CONT -1 )

Benchmark:

$ grep model /proc/cpuinfo |tail -1
model name	: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz

(this)
step 1: train loss 4.451209 (took 4816.851841 ms)
step 2: train loss 4.662212 (took 4816.346237 ms)
step 3: train loss 4.672174 (took 4817.421769 ms)
step 4: train loss 4.670977 (took 4810.751746 ms)
step 5: train loss 4.335294 (took 4807.962372 ms)

vs

(previous)
step 0: train loss 5.356185 (took 5332.631576 ms)
step 1: train loss 4.301033 (took 4840.134017 ms)
step 2: train loss 4.623316 (took 4828.423850 ms)
step 3: train loss 4.600415 (took 4828.398214 ms)
step 4: train loss 4.616777 (took 4829.080307 ms)
step 5: train loss 4.231482 (took 4858.988674 ms)

(Note to self, different activation functions and resources: #168 and optimizations: master...dagelf:llm.c:activation_function_tests_cpu)

Wow, what CPU is that?! Also, maybe this would pique your interest: #253
I'm curious what iteration speeds pytorch gets on your CPU.

jart · 2024-05-21T22:10:13Z

Is this applicable to the CUDA version?

Haven't tried.

How will this affect fine tuning?

No idea.

Wow, what CPU is that?!

It's an AMD Ryzen Threadripper PRO 7995WX.

different activation functions and resources

Your fastest activation function is going to be vectorized SiLU. https://news.ycombinator.com/item?id=40371612 erff() is a lot simpler than tanhf() but SiLU uses expf() which is even simpler and less branchy.

/* Efficient implementation of erff()
   using either a pure polynomial approximation or
   the exponential of a polynomial.
   Worst-case error is 1.09ulps at 0x1.c111acp-1.
   From the Optimized Routines by Arm Limited. */
float erff(float x) {
    union {
        float f;
        unsigned i;
    } pun = {x};
    float r, x2, u;
    unsigned ix = pun.i;
    unsigned sign = ix >> 31;
    unsigned ia12 = (pun.i >> 20) & 0x7ff;
    if (ia12 < 0x3f6) {
        if (ia12 >= 0x318) {
            x2 = x * x;
            r = -0x1.3a1a82p-11f;
            r = fmaf(r, x2, +0x1.473f48p-08f);
            r = fmaf(r, x2, -0x1.b68bd2p-06f);
            r = fmaf(r, x2, +0x1.ce1a46p-04f);
            r = fmaf(r, x2, -0x1.8126e0p-02f);
            r = fmaf(r, x2, +0x1.06eba6p-03f);
            r = fmaf(r, x, x);
        } else {
            if (ia12 >= 0x040)
                r = x + 0x1.06eba8p-3f * x;
            else
                r = fmaf(0x1.06eba8p-3f, x, x);
        }
    } else if (ia12 < 0x408) {
        float a = fabsf(x);
        r = fmaf(0x1.222900p-16f, a, -0x1.91d2ccp-12f);
        u = fmaf(0x1.fd1336p-9f, a, -0x1.8d6300p-6f);
        x2 = x * x;
        r = fmaf(r, x2, u);
        r = fmaf(r, a, 0x1.b55cb0p-4f);
        r = fmaf(r, a, 0x1.450aa0p-1f);
        r = fmaf(r, a, 0x1.079d0cp-3f);
        r = fmaf(r, a, a);
        r = expf(-r);
        if (sign)
            r = -1.f + r;
        else
            r = 1.f - r;
    } else {
        if (ia12 < 0x7f8) {
            if (sign)
                r = -1.f;
            else
                r = 1.f;
        } else {
            r = (1.f - (float)((ix >> 31) << 1)) + 1.f / x;
        }
    }
    return r;
}

I'm curious what iteration speeds pytorch gets on your CPU.

iteration 1, loss: 5.2700, time: 347.339ms, tok/s: 737.03, norm: 60.996
iteration 2, loss: 4.0607, time: 295.749ms, tok/s: 865.60, norm: 17.079
iteration 3, loss: 3.3165, time: 294.709ms, tok/s: 868.65, norm: 14.776
iteration 4, loss: 2.7115, time: 294.072ms, tok/s: 870.54, norm: 13.203
iteration 5, loss: 2.1703, time: 295.474ms, tok/s: 866.40, norm: 12.374
iteration 6, loss: 1.6350, time: 296.282ms, tok/s: 864.04, norm: 10.551
iteration 7, loss: 1.1419, time: 295.474ms, tok/s: 866.41, norm: 9.788
iteration 8, loss: 0.7040, time: 294.553ms, tok/s: 869.11, norm: 7.976
iteration 9, loss: 0.3771, time: 294.848ms, tok/s: 868.24, norm: 6.243
iteration 10, loss: 0.1743, time: 294.942ms, tok/s: 867.97, norm: 3.609

This change removes the tanh GeLU approximation. This gives us the benefit of better accuracy, roughly equal perf and strict standard conformance, since we no longer need any compiler-specific tricks. Here's the last lines of train_gpt2 output before this change: step 37: train loss 3.739647 (took 598.548076 ms) step 38: train loss 4.611735 (took 596.626145 ms) step 39: train loss 3.970751 (took 598.439552 ms) val loss 4.016658 generating: --- Come Running Away, Greater conquer With the Imperial blood the heaviest host of the gods into this wondrous world beyond. I will not back thee, for how sweet after birth Netflix against repounder, will not flourish against the earlocks of Allay --- step 40: train loss 4.377756 (took 592.704936 ms) Here's the last lines of train_gpt2 output after this change: step 37: train loss 3.731596 (took 594.893995 ms) step 38: train loss 4.561646 (took 600.064035 ms) step 39: train loss 3.933512 (took 599.666173 ms) val loss 4.014135 generating: --- Whether Hipocrates, Bigon Nicinius, or rep'd With Thy fair winter-tail your outraged hand, The richness of the good smour Nine years by turns covered my Member. Thou art Nay, I fear be; but Lets o' thee know, if it --- step 40: train loss 4.358461 (took 597.594065 ms) This change has the disadvantage of diverging from PyTorch. I view this as being justified and worthwhile, for numerous reasons, e.g. "I used the tanh approximation simply because the error function erf was slow in tensorflow some years ago. If the exact version is fast enough now and does not have numerical issues, I do not see a reason to use an inexact version." ──Quoth Dan Hendrycks See pytorch/pytorch#39853

karpathy · 2024-05-24T22:54:28Z

Hi @jart it's nice to see you stop by! I don't think I can merge this because (for educational and historic reasons) I am trying to be compatible with GPT-2 and the checkpoints that OpenAI has released, in the current version of the code. It's possible that in the future we'll diverse from Exact-GPT-2 and this would make a lot more sense then, but in that case we'd probably also shift from GeLU to something that (probably?) works a bit better - GeGLU / SwiGLU or etc.

jart force-pushed the gelu branch from 7b4aaad to 34e2a98 Compare May 23, 2024 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use proper GeLU on CPU #441

Use proper GeLU on CPU #441

jart commented May 21, 2024

dagelf commented May 21, 2024 •

edited

jart commented May 21, 2024

karpathy commented May 24, 2024

Use proper GeLU on CPU #441

Are you sure you want to change the base?

Use proper GeLU on CPU #441

Conversation

jart commented May 21, 2024

dagelf commented May 21, 2024 • edited

jart commented May 21, 2024

karpathy commented May 24, 2024

dagelf commented May 21, 2024 •

edited