-
-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add ability to runtime select ufunc loops, add AVX2 integer loops #7980
Conversation
NPY_CPU_SUPPORTS_AVX2 checks at runtime if AVX2 is supported
Selected at runtime depending on CPU features.
It might be a good idea to split the loops.c.src into separate files, it is getting quite large. E.g. one for int, float and complex. The different types don't really share anything in the file besides some macros that can go into the header. |
I'd love to see faster vector math loops. Is there any way I could help? I'm not familiar with the code base, but at least I could help testing (no AVX2 though). |
any thoughts on this? |
17dfcd1
to
ae32e78
Compare
I have added avx macros, not used yet but could be in future I'll put it in tomorrow, so last chance for comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM. Adding these optimizations one by one is fine I think, we don't have the capacity to do a whole set at once.
One thing I looked at just now is how OPTIONAL_INTRINSICS
gets triggered - there's not enough comments/docs for it to be really clear, but it looks to me like for SSE/SSE2/SSE3 we use env vars when building Windows binaries (see _bdist_wininst
in pavement.py
) while on other platforms and for other builds everything else just gets turned if it's detected on the build machine. Not sure I got that right though, otherwise we should have run into some issues already with manylinux and conda linux builds.
linux always builds generic binaries, this doesn't change this as the unsupported instructions will never be run if the cpu does not support it. windows has some special case to build 3 variants with nosse, sse2 and sse3 and chooses at install time what to use. |
Added the ability in the umath generator to runtime select loops depending on cpu capabilities. It is only for the basic loops, but just because thats all I currently needed.
As an example I added specializations for the auto-vectorized integer loops (GCC only).
The results are not very impressive on my laptop i5 haswell, but you do get 10-20% better performance for cpu cache sized arrays.
Possible extensions to this might be for the vector math loops (sin, cos, exp, log, pow), here there are SSE4 and AVX2 variants available in glibc.