Skip to content

Conversation

larsmans
Copy link

Simple recursive implementation with unrolled base case. Also fixed signed/unsigned issues by making all indices signed.

Added a unit test based on your example.

Performance seems unchanged: still about a third faster than before.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm adding to the first three members of r here, instead of r[0].

@juliantaylor
Copy link
Owner

thanks it is a little simpler. I wanted to have a look at this again after 1.8 is out. The big issue is the blocking done by the reduction iterator which reduces the usefulness of this change.

@larsmans
Copy link
Author

You mean np.add.reduce? I do have a lot of code that calls plain old np.sum and could do with a speed boost :)

But take your time.

@juliantaylor
Copy link
Owner

np.sum goes over np.add.reduce too.
for speed I also plan to vectorize this (which was very easy with the iterative code, need to check how to fit it into the recursive one).

@larsmans
Copy link
Author

Ok. The NumPy 1.7.1 I benchmarked against is the Ubuntu binary package, and I'm not sure if that uses SIMD (probably not). I can check if I can get the SIMD loop into the base case of this algorithm.

@larsmans
Copy link
Author

I just checked the assembler output for this, and my GCC 4.7.3 is able to vectorize the loop itself at -O2 on x86-64. The unrolling is enough of a hint for it to do that. (I'm surprised as I though it didn't do that except at -O3.)

@juliantaylor
Copy link
Owner

gcc will only vectorize on -O3 or with -ftree-vectorize. But ti doesn't want to vectorize it for me, 4.7 and O3:

  5,96 │ 80:   add    $0x4,%rax                                                                    ▒
  6,00 │       addsd  (%r8),%xmm3                                                                  ▒
  6,11 │       add    %r10,%r8                                                                     ▒
  6,11 │       addsd  (%rcx),%xmm2                                                                 ▒
  6,04 │       mov    %rax,%rsi                                                                    ▒
  5,70 │       addsd  (%rcx,%r9,1),%xmm1                                                           ▒
  9,83 │       addsd  (%rcx,%r9,2),%xmm0                                                           ▒
  7,21 │       add    %r10,%rcx                                                                    ▒
  5,39 │       cmp    %r11,%rax                                                                    ▒
  6,85 │     ↑ jb     80                                                                           ▒

only scalar adds :(

@larsmans
Copy link
Author

Ah, I had mistaken the addsd for a vector op since it's an SSE instruction. Must brush up my asm skills.

@juliantaylor
Copy link
Owner

it can probably vectorize it if you set -funsafe-math-operations, gcc is is very conservative about not changing the floating point semantics (which reordering and vectorizing does)

@juliantaylor
Copy link
Owner

added the commit to the original PR, thanks
note I fixed a stride bug in the recursive call.

juliantaylor pushed a commit that referenced this pull request Nov 28, 2016
Adds a regression test that demonstrates the issue.
juliantaylor pushed a commit that referenced this pull request May 12, 2019
BUG: non-uint-aligned arrays were counted as uint-aligned
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants