Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use proper GeLU on CPU #441

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Use proper GeLU on CPU #441

wants to merge 1 commit into from

Conversation

jart
Copy link

@jart jart commented May 21, 2024

This change removes the tanh GeLU approximation. This gives us the benefit of better accuracy, roughly equal perf and strict standard conformance, since we no longer need any compiler-specific tricks.

Here's the last lines of train_gpt2 output before this change:

step 37: train loss 3.739647 (took 598.548076 ms)
step 38: train loss 4.611735 (took 596.626145 ms)
step 39: train loss 3.970751 (took 598.439552 ms)
val loss 4.016658
generating:
---
Come Running Away,
Greater conquer
With the Imperial blood
the heaviest host of the gods
into this wondrous world beyond.
I will not back thee, for how sweet after birth
Netflix against repounder,
will not
flourish against the earlocks of
Allay
---
step 40: train loss 4.377756 (took 592.704936 ms)

Here's the last lines of train_gpt2 output after this change:

step 37: train loss 3.731596 (took 594.893995 ms)
step 38: train loss 4.561646 (took 600.064035 ms)
step 39: train loss 3.933512 (took 599.666173 ms)
val loss 4.014135
generating:
---
Whether Hipocrates,
Bigon Nicinius, or rep'd
With Thy fair winter-tail your outraged hand,
The richness of the good smour
Nine years by turns covered my Member. Thou art
Nay, I fear be; but
Lets o' thee know, if it
---
step 40: train loss 4.358461 (took 597.594065 ms)

This change has the disadvantage of diverging from PyTorch. I view this as being justified and worthwhile, for numerous reasons, e.g.

"I used the tanh approximation simply because the error function erf was slow in tensorflow some years ago. If the exact version is fast enough now and does not have numerical issues, I do not see a reason to use an inexact version." ──Quoth Dan Hendrycks

See pytorch/pytorch#39853

@dagelf
Copy link
Contributor

dagelf commented May 21, 2024

Sure works! ... Is this applicable to the CUDA version? How will this affect fine tuning? (Or a hypothetical retraining run of the base model?) (I'm still learning A LOT here)

It's even a few ms faster 😅

Good way to run benchmarks:

( kill -STOP -1  # Stop all processes, NB don't run this outside a script or screen if remote!
timeout 40s ./train_gpt2
kill -CONT -1 )

Benchmark:

$ grep model /proc/cpuinfo |tail -1
model name	: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz

(this)
step 1: train loss 4.451209 (took 4816.851841 ms)
step 2: train loss 4.662212 (took 4816.346237 ms)
step 3: train loss 4.672174 (took 4817.421769 ms)
step 4: train loss 4.670977 (took 4810.751746 ms)
step 5: train loss 4.335294 (took 4807.962372 ms)

vs

(previous)
step 0: train loss 5.356185 (took 5332.631576 ms)
step 1: train loss 4.301033 (took 4840.134017 ms)
step 2: train loss 4.623316 (took 4828.423850 ms)
step 3: train loss 4.600415 (took 4828.398214 ms)
step 4: train loss 4.616777 (took 4829.080307 ms)
step 5: train loss 4.231482 (took 4858.988674 ms)

(Note to self, different activation functions and resources: #168 and optimizations: master...dagelf:llm.c:activation_function_tests_cpu)

Wow, what CPU is that?! Also, maybe this would pique your interest: #253
I'm curious what iteration speeds pytorch gets on your CPU.

@jart
Copy link
Author

jart commented May 21, 2024

Is this applicable to the CUDA version?

Haven't tried.

How will this affect fine tuning?

No idea.

Wow, what CPU is that?!

It's an AMD Ryzen Threadripper PRO 7995WX.

different activation functions and resources

Your fastest activation function is going to be vectorized SiLU. https://news.ycombinator.com/item?id=40371612 erff() is a lot simpler than tanhf() but SiLU uses expf() which is even simpler and less branchy.

/* Efficient implementation of erff()
   using either a pure polynomial approximation or
   the exponential of a polynomial.
   Worst-case error is 1.09ulps at 0x1.c111acp-1.
   From the Optimized Routines by Arm Limited. */
float erff(float x) {
    union {
        float f;
        unsigned i;
    } pun = {x};
    float r, x2, u;
    unsigned ix = pun.i;
    unsigned sign = ix >> 31;
    unsigned ia12 = (pun.i >> 20) & 0x7ff;
    if (ia12 < 0x3f6) {
        if (ia12 >= 0x318) {
            x2 = x * x;
            r = -0x1.3a1a82p-11f;
            r = fmaf(r, x2, +0x1.473f48p-08f);
            r = fmaf(r, x2, -0x1.b68bd2p-06f);
            r = fmaf(r, x2, +0x1.ce1a46p-04f);
            r = fmaf(r, x2, -0x1.8126e0p-02f);
            r = fmaf(r, x2, +0x1.06eba6p-03f);
            r = fmaf(r, x, x);
        } else {
            if (ia12 >= 0x040)
                r = x + 0x1.06eba8p-3f * x;
            else
                r = fmaf(0x1.06eba8p-3f, x, x);
        }
    } else if (ia12 < 0x408) {
        float a = fabsf(x);
        r = fmaf(0x1.222900p-16f, a, -0x1.91d2ccp-12f);
        u = fmaf(0x1.fd1336p-9f, a, -0x1.8d6300p-6f);
        x2 = x * x;
        r = fmaf(r, x2, u);
        r = fmaf(r, a, 0x1.b55cb0p-4f);
        r = fmaf(r, a, 0x1.450aa0p-1f);
        r = fmaf(r, a, 0x1.079d0cp-3f);
        r = fmaf(r, a, a);
        r = expf(-r);
        if (sign)
            r = -1.f + r;
        else
            r = 1.f - r;
    } else {
        if (ia12 < 0x7f8) {
            if (sign)
                r = -1.f;
            else
                r = 1.f;
        } else {
            r = (1.f - (float)((ix >> 31) << 1)) + 1.f / x;
        }
    }
    return r;
}

I'm curious what iteration speeds pytorch gets on your CPU.

iteration 1, loss: 5.2700, time: 347.339ms, tok/s: 737.03, norm: 60.996
iteration 2, loss: 4.0607, time: 295.749ms, tok/s: 865.60, norm: 17.079
iteration 3, loss: 3.3165, time: 294.709ms, tok/s: 868.65, norm: 14.776
iteration 4, loss: 2.7115, time: 294.072ms, tok/s: 870.54, norm: 13.203
iteration 5, loss: 2.1703, time: 295.474ms, tok/s: 866.40, norm: 12.374
iteration 6, loss: 1.6350, time: 296.282ms, tok/s: 864.04, norm: 10.551
iteration 7, loss: 1.1419, time: 295.474ms, tok/s: 866.41, norm: 9.788
iteration 8, loss: 0.7040, time: 294.553ms, tok/s: 869.11, norm: 7.976
iteration 9, loss: 0.3771, time: 294.848ms, tok/s: 868.24, norm: 6.243
iteration 10, loss: 0.1743, time: 294.942ms, tok/s: 867.97, norm: 3.609

This change removes the tanh GeLU approximation. This gives us the
benefit of better accuracy, roughly equal perf and strict standard
conformance, since we no longer need any compiler-specific tricks.

Here's the last lines of train_gpt2 output before this change:

    step 37: train loss 3.739647 (took 598.548076 ms)
    step 38: train loss 4.611735 (took 596.626145 ms)
    step 39: train loss 3.970751 (took 598.439552 ms)
    val loss 4.016658
    generating:
    ---
    Come Running Away,
    Greater conquer
    With the Imperial blood
    the heaviest host of the gods
    into this wondrous world beyond.
    I will not back thee, for how sweet after birth
    Netflix against repounder,
    will not
    flourish against the earlocks of
    Allay
    ---
    step 40: train loss 4.377756 (took 592.704936 ms)

Here's the last lines of train_gpt2 output after this change:

    step 37: train loss 3.731596 (took 594.893995 ms)
    step 38: train loss 4.561646 (took 600.064035 ms)
    step 39: train loss 3.933512 (took 599.666173 ms)
    val loss 4.014135
    generating:
    ---
    Whether Hipocrates,
    Bigon Nicinius, or rep'd
    With Thy fair winter-tail your outraged hand,
    The richness of the good smour
    Nine years by turns covered my Member. Thou art
    Nay, I fear be; but
    Lets o' thee know, if it
    ---
    step 40: train loss 4.358461 (took 597.594065 ms)

This change has the disadvantage of diverging from PyTorch. I view
this as being justified and worthwhile, for numerous reasons, e.g.

  "I used the tanh approximation simply because the error function
   erf was slow in tensorflow some years ago. If the exact version
   is fast enough now and does not have numerical issues, I do not
   see a reason to use an inexact version."  ──Quoth Dan Hendrycks

See pytorch/pytorch#39853
@karpathy
Copy link
Owner

Hi @jart it's nice to see you stop by! I don't think I can merge this because (for educational and historic reasons) I am trying to be compatible with GPT-2 and the checkpoints that OpenAI has released, in the current version of the code. It's possible that in the future we'll diverse from Exact-GPT-2 and this would make a lot more sense then, but in that case we'd probably also shift from GeLU to something that (probably?) works a bit better - GeGLU / SwiGLU or etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants