Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add ability to runtime select ufunc loops, add AVX2 integer loops #7980

Merged
merged 5 commits into from
Sep 25, 2016

Conversation

juliantaylor
Copy link
Contributor

Added the ability in the umath generator to runtime select loops depending on cpu capabilities. It is only for the basic loops, but just because thats all I currently needed.

As an example I added specializations for the auto-vectorized integer loops (GCC only).
The results are not very impressive on my laptop i5 haswell, but you do get 10-20% better performance for cpu cache sized arrays.

Possible extensions to this might be for the vector math loops (sin, cos, exp, log, pow), here there are SSE4 and AVX2 variants available in glibc.

@juliantaylor
Copy link
Contributor Author

It might be a good idea to split the loops.c.src into separate files, it is getting quite large.

E.g. one for int, float and complex. The different types don't really share anything in the file besides some macros that can go into the header.

@charris charris changed the title add ability to runtime select ufunc loops, add AVX2 integer loops ENH: Add ability to runtime select ufunc loops, add AVX2 integer loops Aug 28, 2016
@aeberspaecher
Copy link

I'd love to see faster vector math loops. Is there any way I could help? I'm not familiar with the code base, but at least I could help testing (no AVX2 though).

@juliantaylor
Copy link
Contributor Author

any thoughts on this?
its probably not the best way to implement the ufunc generation but its simple. As its internal so we can always change it later.

@juliantaylor
Copy link
Contributor Author

I have added avx macros, not used yet but could be in future

I'll put it in tomorrow, so last chance for comments.

Copy link
Member

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM. Adding these optimizations one by one is fine I think, we don't have the capacity to do a whole set at once.

One thing I looked at just now is how OPTIONAL_INTRINSICS gets triggered - there's not enough comments/docs for it to be really clear, but it looks to me like for SSE/SSE2/SSE3 we use env vars when building Windows binaries (see _bdist_wininst in pavement.py) while on other platforms and for other builds everything else just gets turned if it's detected on the build machine. Not sure I got that right though, otherwise we should have run into some issues already with manylinux and conda linux builds.

@juliantaylor
Copy link
Contributor Author

linux always builds generic binaries, this doesn't change this as the unsupported instructions will never be run if the cpu does not support it.

windows has some special case to build 3 variants with nosse, sse2 and sse3 and chooses at install time what to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants