-
Notifications
You must be signed in to change notification settings - Fork 0
ENH: umath: simplify pairwise sum #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
numpy/core/src/umath/loops.c.src
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I'm adding to the first three members of r
here, instead of r[0]
.
thanks it is a little simpler. I wanted to have a look at this again after 1.8 is out. The big issue is the blocking done by the reduction iterator which reduces the usefulness of this change. |
You mean But take your time. |
np.sum goes over np.add.reduce too. |
Ok. The NumPy 1.7.1 I benchmarked against is the Ubuntu binary package, and I'm not sure if that uses SIMD (probably not). I can check if I can get the SIMD loop into the base case of this algorithm. |
I just checked the assembler output for this, and my GCC 4.7.3 is able to vectorize the loop itself at |
gcc will only vectorize on -O3 or with -ftree-vectorize. But ti doesn't want to vectorize it for me, 4.7 and O3:
only scalar adds :( |
Ah, I had mistaken the |
it can probably vectorize it if you set -funsafe-math-operations, gcc is is very conservative about not changing the floating point semantics (which reordering and vectorizing does) |
added the commit to the original PR, thanks |
Adds a regression test that demonstrates the issue.
BUG: non-uint-aligned arrays were counted as uint-aligned
Simple recursive implementation with unrolled base case. Also fixed signed/unsigned issues by making all indices signed.
Added a unit test based on your example.
Performance seems unchanged: still about a third faster than before.