-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gain another 10-20%+ on CPU performance on gcc by moving -fno-finite-math-only to only gelu_backwards #168
Conversation
Also resolves #19 for good I think |
So maybe this is ok to merge... |
Please don't forget about the MSVC/Windows. MSVC uses pragma to turn off the optimization. #pragma optimize( "", off ) This is really ugly. I know. |
My issue with adding pragma's to source files (OpenMP excluded) is that you will keep adding more per platform/compiler. One suggestion was to split this function off into its own file then you can use the Makefile to compile with whatever flags are suitable for the platform/compiler. Makefile's typically have platform dependencies in them. It might be easier from a maintenance standpoint be to keep the source code as clean as possible? |
@dagelf i knew we could still go further with the cpu, thanks! looking into it |
yes, you can write this @dagelf: #if defined(__GNUC__) && !defined(__clang__)
__attribute__((optimize("no-finite-math-only")))
#endif |
@karpathy ifdefs squashed and comment added |
Does it bug out on MSVC with -Ofast too? |
yep |
Tested to work with and speed up |
I'm sorry this is too weird and ugly to merge I think. |
Tried that, will need to do both Simply adding:
Fixes it for
|
For now I'm just going to remove the Going down the route of performant custom math functions means breaking cross platform compatibility, unless we start exploring lookup tables for CPU inference. Which I will explore next. There sure is more performance to be gained. I quickly realized that a faster activation function might lead to slower convergence and more training steps, negating the benefits. This is my cue to learn more about what makes the activation function work so that I can develop a better intuition for it. (Any pointers appreciated!) For the record, it's actually the exponential in the If anybody else wants to explore platform specific math function optimizations, here is a good start: https://github.com/bminor/glibc/tree/master/sysdeps/x86_64/fpu Before playing with lookup tables, I'll compare performance of different activation functions. |
Lookup tables are a great idea |
More targeted flag optimizations for
gcc
.It's the
tanhf
function ingelu_backwards
that causes the model to fail with-ffast-math
ongcc
on Linux.Before:
Timings obtained with:
Also noted: