Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards #168

dagelf · 2024-04-17T22:47:18Z

More targeted flag optimizations for gcc.

It's the tanhf function in gelu_backwards that causes the model to fail with -ffast-math on gcc on Linux.

Before:

$  grep name /proc/cpuinfo |head -1                                                                                                                            
model name      : Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz

step 0: train loss 5.356086 (took 6167.853384 ms)
step 1: train loss 4.300644 (took 5460.413776 ms)
step 2: train loss 4.623082 (took 5276.372294 ms)
vs
step 0: train loss 5.356185 (took 5714.622339 ms)
step 1: train loss 4.301033 (took 4814.820671 ms)
step 2: train loss 4.623316 (took 4813.711103 ms)

$  grep name /proc/cpuinfo |head -1                                                                                                                            
model name      : AMD Ryzen 5 3600 6-Core Processor

step 0: train loss 5.356085 (took 3397.901288 ms)                                                                                                                                   
step 1: train loss 4.300644 (took 2810.743621 ms)                                                                                                                                   
step 2: train loss 4.623083 (took 2813.287769 ms) 
vs
step 0: train loss 5.356185 (took 2639.362407 ms)
step 1: train loss 4.301032 (took 2258.179942 ms)
step 2: train loss 4.623315 (took 2261.548428 ms)

Timings obtained with:

( kill -STOP -1  # Stop all processes, NB don't run this outside a script!
timeout 40s ./train_gpt2
kill -CONT -1 )

Also noted:

~$ gcc -Ofast -Q --help=optimizers|grep enabled > a
~$ gcc -O3 -Ofast -Q --help=optimizers|grep enabled > b
~$ diff a b

dagelf · 2024-04-17T22:50:42Z

Also resolves #19 for good I think

karpathy · 2024-04-18T03:47:58Z

So maybe this is ok to merge...
1 it looks a little funny is there no way to combine the double nested if into one condition?
2 i think a comment explaining this would go a long way

azret · 2024-04-18T04:14:58Z

Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.

#pragma optimize( "", off )
/* unoptimized code section */
#pragma optimize( "", on )

This is really ugly. I know.

rosslwheeler · 2024-04-18T05:15:54Z

My issue with adding pragma's to source files (OpenMP excluded) is that you will keep adding more per platform/compiler. One suggestion was to split this function off into its own file then you can use the Makefile to compile with whatever flags are suitable for the platform/compiler. Makefile's typically have platform dependencies in them. It might be easier from a maintenance standpoint be to keep the source code as clean as possible?

ent0n29 · 2024-04-18T06:24:55Z

@dagelf i knew we could still go further with the cpu, thanks! looking into it

ent0n29 · 2024-04-18T07:01:24Z

So maybe this is ok to merge... 1 it looks a little funny is there no way to combine the double nested if into one condition? 2 i think a comment explaining this would go a long way

yes, you can write this @dagelf:

#if defined(__GNUC__) && !defined(__clang__)
    __attribute__((optimize("no-finite-math-only"))) 
#endif

dagelf · 2024-04-18T15:25:38Z

@karpathy ifdefs squashed and comment added

dagelf · 2024-04-18T18:44:02Z

Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.

#pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on )

This is really ugly. I know.

Does it bug out on MSVC with -Ofast too?

azret · 2024-04-18T22:02:42Z

Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization.
#pragma optimize( "", off ) /* unoptimized code section */ #pragma optimize( "", on )
This is really ugly. I know.

Does it bug out on MSVC with -Ofast too?

yep

dagelf · 2024-04-19T07:36:03Z

Tested to work with and speed up msvc too.

karpathy · 2024-04-20T05:10:12Z

I'm sorry this is too weird and ugly to merge I think.
Can someone try alternative strategies? For example tanh can be written as a function of exp quite trivially, maybe calling it that way makes it ok?

dagelf · 2024-04-20T14:30:58Z

Tried that, will need to do both tanhf and expf, busy with the latter... but it might be even uglier ...It's really the msvc part that makes it ugly IMHO 😄

Simply adding:

 __attribute__((optimize("no-finite-math-only")))

Fixes it for gcc. clang always works, but is slow.

msvc needs the pragmas before and after. The #ifdefs are just there to eliminate warnings for foreign pragmas when compiling.

dagelf · 2024-04-20T19:22:24Z

For now I'm just going to remove the ifdefs to get this down to only two lines, to keep it clean.

Going down the route of performant custom math functions means breaking cross platform compatibility, unless we start exploring lookup tables for CPU inference. Which I will explore next.

There sure is more performance to be gained. I quickly realized that a faster activation function might lead to slower convergence and more training steps, negating the benefits. This is my cue to learn more about what makes the activation function work so that I can develop a better intuition for it. (Any pointers appreciated!)

For the record, it's actually the exponential in the coshf that has the biggest influence on whatever makes gelu_backward break the model. Looking that the activation function graphs above, I think I can see why 😄

If anybody else wants to explore platform specific math function optimizations, here is a good start: https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu

Before playing with lookup tables, I'll compare performance of different activation functions.

azret · 2024-04-20T19:37:31Z

Lookup tables are a great idea

dagelf · 2024-04-20T22:05:58Z

Some more reference materials...

Activation functions

https://web.archive.org/web/20230324020355/https://chadrick-kwag.net/relu-gelu-swish-mish-activation-function-comparison/
https://web.archive.org/web/20220619144443/https://arxiv.org/vc/arxiv/papers/1908/1908.08681v2.pdf
https://consensus.app/results/?q=neural%20activation%20function

Math function optimizations

https://zenodo.org/records/4685966
https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu
https://news.ycombinator.com/item?id=8828936
https://libc.llvm.org/math/
https://dl.acm.org/doi/fullHtml/10.1145/3624062.3624166
https://stackoverflow.com/questions/47025373/fastest-implementation-of-the-natural-exponential-function-using-sse
https://stackoverflow.com/questions/9799041/efficient-implementation-of-natural-logarithm-ln-and-exponentiation

Update:

Lookup tables are a great idea

So, just tried this. Turns out it might not be such a great idea... modern CPU's are weird! So far they're slower, unless I make them really small. Also, this function hardly gets called. Pushed the lookup table code to my repo.

dagelf mentioned this pull request Apr 17, 2024

-fno-finite-math-only for almost 2x speed up, fix for [#19] #149

Merged

azret mentioned this pull request Apr 19, 2024

Minimal changes to support Windows / Visual Studio #134

Closed

dagelf closed this Apr 20, 2024

dagelf force-pushed the master branch from 8f10ef8 to 4e3b7ea Compare April 20, 2024 19:30

dagelf mentioned this pull request Apr 20, 2024

Fixes -Ofast optimizations breaking model by skipping them for gelu_backward #200

Merged

dagelf mentioned this pull request Apr 20, 2024

Error: must forward with targets before backward #19

Closed

dagelf mentioned this pull request May 21, 2024

Use proper GeLU on CPU #441

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards #168

Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards #168

dagelf commented Apr 17, 2024

dagelf commented Apr 17, 2024

karpathy commented Apr 18, 2024

azret commented Apr 18, 2024 •

edited

Loading

rosslwheeler commented Apr 18, 2024 •

edited

Loading

ent0n29 commented Apr 18, 2024

ent0n29 commented Apr 18, 2024

dagelf commented Apr 18, 2024

dagelf commented Apr 18, 2024

azret commented Apr 18, 2024

dagelf commented Apr 19, 2024

karpathy commented Apr 20, 2024

dagelf commented Apr 20, 2024 •

edited

Loading

dagelf commented Apr 20, 2024

azret commented Apr 20, 2024

dagelf commented Apr 20, 2024 •

edited

Loading

Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards #168

Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards #168

Conversation

dagelf commented Apr 17, 2024

dagelf commented Apr 17, 2024

karpathy commented Apr 18, 2024

azret commented Apr 18, 2024 • edited Loading

rosslwheeler commented Apr 18, 2024 • edited Loading

ent0n29 commented Apr 18, 2024

ent0n29 commented Apr 18, 2024

dagelf commented Apr 18, 2024

dagelf commented Apr 18, 2024

azret commented Apr 18, 2024

dagelf commented Apr 19, 2024

karpathy commented Apr 20, 2024

dagelf commented Apr 20, 2024 • edited Loading

dagelf commented Apr 20, 2024

azret commented Apr 20, 2024

dagelf commented Apr 20, 2024 • edited Loading

Activation functions

Math function optimizations

azret commented Apr 18, 2024 •

edited

Loading

rosslwheeler commented Apr 18, 2024 •

edited

Loading

dagelf commented Apr 20, 2024 •

edited

Loading

dagelf commented Apr 20, 2024 •

edited

Loading