New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: implement pairwise summation #3685

Merged
merged 4 commits into from Dec 3, 2013

Conversation

Projects
None yet
4 participants
@juliantaylor
Contributor

juliantaylor commented Sep 4, 2013

pairwise summation has an average error of O(log(n)) instead of O(n) of
the regular summation.
It is implemented as summing pairs of small blocks of regulary summed
values in order to archive the same performance as the old sum.

An example of data which profits greatly is
d = np.ones(500000)
(d / 10.).sum() - d.size / 10.

An alternative to pairwise summation is kahan summation but in order to
have a low performance penalty one must unroll and vectorize it,
while pairwise summation has the same speed without any vectorization.
The better error bound of O(1) is negligible in many cases.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 4, 2013

Contributor

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6

Contributor

juliantaylor commented Sep 4, 2013

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 4, 2013

Member

To make sure I understand the api here: this just unconditionally uses
pairwise summation to implement np.add.reduce?
On 4 Sep 2013 20:29, "Julian Taylor" notifications@github.com wrote:

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23816846
.

Member

njsmith commented Sep 4, 2013

To make sure I understand the api here: this just unconditionally uses
pairwise summation to implement np.add.reduce?
On 4 Sep 2013 20:29, "Julian Taylor" notifications@github.com wrote:

interestingly it makes one test in scipy test_lsmr.py fail
apparently it has set of data where pairwise is slightly worse about 10e-6


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23816846
.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 4, 2013

Contributor

all add.reduce calls that go over float_add with IS_BINARY_REDUCE true
so this also improves mean/std/var and anything else that uses sum.

Contributor

juliantaylor commented Sep 4, 2013

all add.reduce calls that go over float_add with IS_BINARY_REDUCE true
so this also improves mean/std/var and anything else that uses sum.

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 4, 2013

Member

Neat. I assume this is just as fast as the current naive algorithm?
On 4 Sep 2013 20:48, "Julian Taylor" notifications@github.com wrote:

all add.reduce calls that go over _add with IS_BINARY_REDUCE true
so this also improves std/var and anything else that uses sum.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23818421
.

Member

njsmith commented Sep 4, 2013

Neat. I assume this is just as fast as the current naive algorithm?
On 4 Sep 2013 20:48, "Julian Taylor" notifications@github.com wrote:

all add.reduce calls that go over _add with IS_BINARY_REDUCE true
so this also improves std/var and anything else that uses sum.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23818421
.

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 4, 2013

Contributor

its even faster :) (about 30%)
probably due to the unrolling.
and it can be vectorized easily.

Contributor

juliantaylor commented Sep 4, 2013

its even faster :) (about 30%)
probably due to the unrolling.
and it can be vectorized easily.

Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 5, 2013

Member

Looks good modulo comments.

Some further questions:

  • It looks like this is only implemented for float/double/longdouble. What about half and complex types?
  • The loop unrolling and stack logic looks correct to me, but it's complicated enough, and with enough distinct execution paths, that I'd feel better knowing there were some careful tests checking lots of different array lengths. Do those exist?

And okay, this is just idle curiosity really, but: would it make any sense to do the same for *, and other commutative reduction operations? (I guess logaddexp.reduce would be a particularly obvious candidate, but that would of course require changes to the generic ufunc reduction loop. I guess the ufunc dispatch logic allows for types to override some reduction operations like add.reduce? We don't have two separate implementations of sum, do we?)

Member

njsmith commented Sep 5, 2013

Looks good modulo comments.

Some further questions:

  • It looks like this is only implemented for float/double/longdouble. What about half and complex types?
  • The loop unrolling and stack logic looks correct to me, but it's complicated enough, and with enough distinct execution paths, that I'd feel better knowing there were some careful tests checking lots of different array lengths. Do those exist?

And okay, this is just idle curiosity really, but: would it make any sense to do the same for *, and other commutative reduction operations? (I guess logaddexp.reduce would be a particularly obvious candidate, but that would of course require changes to the generic ufunc reduction loop. I guess the ufunc dispatch logic allows for types to override some reduction operations like add.reduce? We don't have two separate implementations of sum, do we?)

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Sep 6, 2013

Contributor

half and complex certainly should use it too, didn't do it yet to simplify review.

I'll add some more tests for sum to make sure it works.

I'm not sure if it is applicable to other operations, I'll have to look it up.

A big problem is that the reduce ufunc actually works in buffersized blocks, so the benefit of the pairwise summation is reduced to blocks of 8192 elements by default.
Innerloop growing for reductions is disabled with a TODO.
I tried simply enabling the innerloop growing in the iterator and all tests still passed. I'm not sure what additional checks mentioned in the TODO are required.

Contributor

juliantaylor commented Sep 6, 2013

half and complex certainly should use it too, didn't do it yet to simplify review.

I'll add some more tests for sum to make sure it works.

I'm not sure if it is applicable to other operations, I'll have to look it up.

A big problem is that the reduce ufunc actually works in buffersized blocks, so the benefit of the pairwise summation is reduced to blocks of 8192 elements by default.
Innerloop growing for reductions is disabled with a TODO.
I tried simply enabling the innerloop growing in the iterator and all tests still passed. I'm not sure what additional checks mentioned in the TODO are required.

@charris

This comment has been minimized.

Show comment
Hide comment
@charris

charris Sep 6, 2013

Member

The inner loop buffering might be historical. @mwiebe Let's ask Mark if he recalls anything about that.

Member

charris commented Sep 6, 2013

The inner loop buffering might be historical. @mwiebe Let's ask Mark if he recalls anything about that.

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Sep 7, 2013

Member

Note that there are good reasons to want to use a smaller buffer by default
though - reduces the memory overhead of casting, etc.

I guess one cheap option for commutative ops would be to just always do
pairwise reductions when combining inner loop results on different blocks.
I can't see off the top of my head how this could every systematically
reduce accuracy, and it might even save copies compared to a strategy that
inserts the result from the previous loop at the beginning of the next loop
block.
On 6 Sep 2013 18:16, "Charles Harris" notifications@github.com wrote:

The inner loop buffering might be historical. @mwiebehttps://github.com/mwiebeLet's ask Mark if he recalls anything about that.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23955340
.

Member

njsmith commented Sep 7, 2013

Note that there are good reasons to want to use a smaller buffer by default
though - reduces the memory overhead of casting, etc.

I guess one cheap option for commutative ops would be to just always do
pairwise reductions when combining inner loop results on different blocks.
I can't see off the top of my head how this could every systematically
reduce accuracy, and it might even save copies compared to a strategy that
inserts the result from the previous loop at the beginning of the next loop
block.
On 6 Sep 2013 18:16, "Charles Harris" notifications@github.com wrote:

The inner loop buffering might be historical. @mwiebehttps://github.com/mwiebeLet's ask Mark if he recalls anything about that.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-23955340
.

Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
@larsmans

This comment has been minimized.

Show comment
Hide comment
@larsmans

larsmans Oct 16, 2013

Contributor

Note: simplified recursive version and unit test here.

Contributor

larsmans commented Oct 16, 2013

Note: simplified recursive version and unit test here.

juliantaylor and others added some commits Sep 4, 2013

ENH: implement pairwise summation
pairwise summation has an average error of O(log(n)) instead of O(n) of
the regular summation.
It is implemented as summing pairs of small blocks of regulary summed
values in order to archive the same performance as the old sum.

An example of data which profits greatly is
d = np.ones(500000)
(d / 10.).sum() - d.size / 10.

An alternative to pairwise summation is kahan summation but in order to
have a low performance penalty one must unroll and vectorize it,
while pairwise summation has the same speed without any vectorization.
The better error bound of O(1) is negligible in many cases.
ENH: umath: simplify pairwise sum
Simple recursive implementation with unrolled base case. Also fixed
signed/unsigned issues by making all indices signed.

Added a unit test based on @juliantaylor's example.

Performance seems unchanged: still about a third faster than before.
ENH: fix stride issue and unroll 8 times to improve accuracy
Fix missing stride accounting in calling recursive function.
Unroll 8 times to improve accuracy and allowing vectorizing with avx
without changing summation order.
Add tests.
@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Dec 2, 2013

Member

Speaking of... what's the status of this? A quick skim looks like it's a win, but would be a bigger win if we had pairwise block reduction?

Member

njsmith commented Dec 2, 2013

Speaking of... what's the status of this? A quick skim looks like it's a win, but would be a bigger win if we had pairwise block reduction?

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Dec 2, 2013

Contributor

yes its a win in accuracy and performance, but the former would be better if the blocks would be added the same way.
I'm currently adding the pairwise sum for complex numbers, then I think its ready to merge.

  • the blocking can be revisited latter.
  • logaddexp needs a different approach (a new ufunc which uses the logsumexp scaling approach of scipy)
  • multiplication might profit too, but I would need to read up on floating point semantics (also its probably less important)
    • vectorization can be done easily with how this code is written (but the gain is not so high with sse, only about 20%), to be revisited later
Contributor

juliantaylor commented Dec 2, 2013

yes its a win in accuracy and performance, but the former would be better if the blocks would be added the same way.
I'm currently adding the pairwise sum for complex numbers, then I think its ready to merge.

  • the blocking can be revisited latter.
  • logaddexp needs a different approach (a new ufunc which uses the logsumexp scaling approach of scipy)
  • multiplication might profit too, but I would need to read up on floating point semantics (also its probably less important)
    • vectorization can be done easily with how this code is written (but the gain is not so high with sse, only about 20%), to be revisited later
@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Dec 2, 2013

Contributor

complex and half types added

Contributor

juliantaylor commented Dec 2, 2013

complex and half types added

@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Dec 2, 2013

Member

So now that you added those, you think this is ready to merge? (Even if
there's still more to do later?)

On Mon, Dec 2, 2013 at 10:50 AM, Julian Taylor notifications@github.comwrote:

complex and half types added


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-29645613
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

Member

njsmith commented Dec 2, 2013

So now that you added those, you think this is ready to merge? (Even if
there's still more to do later?)

On Mon, Dec 2, 2013 at 10:50 AM, Julian Taylor notifications@github.comwrote:

complex and half types added


Reply to this email directly or view it on GitHubhttps://github.com//pull/3685#issuecomment-29645613
.

Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org

@juliantaylor

This comment has been minimized.

Show comment
Hide comment
@juliantaylor

juliantaylor Dec 2, 2013

Contributor

I think its ready, just pushed a small update which removed a nop line and changed the tests to not work on data with zero in first element which improves their ability to find issues regarding the first element (which is already initialized by the iterator in these loops)

Contributor

juliantaylor commented Dec 2, 2013

I think its ready, just pushed a small update which removed a nop line and changed the tests to not work on data with zero in first element which improves their ability to find issues regarding the first element (which is already initialized by the iterator in these loops)

@njsmith

View changes

Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
@njsmith

View changes

Show outdated Hide outdated numpy/core/src/umath/loops.c.src Outdated
@njsmith

This comment has been minimized.

Show comment
Hide comment
@njsmith

njsmith Dec 2, 2013

Member

otherwise LGTM

Member

njsmith commented Dec 2, 2013

otherwise LGTM

njsmith added a commit that referenced this pull request Dec 3, 2013

Merge pull request #3685 from juliantaylor/pairwise
ENH: implement pairwise summation

@njsmith njsmith merged commit 05ab6f4 into numpy:master Dec 3, 2013

1 check passed

default The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment