Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical-stable sum (similar to math.fsum) (Trac #1855) #2448

Closed
numpy-gitbot opened this issue Oct 19, 2012 · 12 comments
Closed

Numerical-stable sum (similar to math.fsum) (Trac #1855) #2448

numpy-gitbot opened this issue Oct 19, 2012 · 12 comments

Comments

@numpy-gitbot
Copy link

Original ticket http://projects.scipy.org/numpy/ticket/1855 on 2011-06-02 by trac user ling, assigned to unknown.

For some applications, having a numerical-stable sum is critical. What I mean can be well illustrated in the examples below:

math.fsum([1, 1e-10] * 1000), numpy.sum([1, 1e-10] * 1000)
(1000.0000001, 1000.0000001000171)
math.fsum([1, 1e100, 1, -1e100] * 10000), numpy.sum([1, 1e100, 1, -1e100] * 10000)
(20000.0, 0.0)

Here math.fsum is a numerical stable sum introduced in Python 2.6, which uses the Shewchuk algorithm (seems to be described in http://code.activestate.com/recipes/393090/).

Other complaints about the plain vanilla numpy.sum could be found in a recent thread: http://article.gmane.org/gmane.comp.python.numeric.general/42756.

I hope numpy.sum can also implement something similar (e.g., Shewchuk algorithm or Kahan summation algorithm or partial sum). We could add an optional argument "stable=True/False" to numpy.sum and let the user decides whether they want a more accuracy (but potentially slower) version.

@numpy-gitbot
Copy link
Author

trac user ling wrote on 2011-06-02

This could be related to #1696. If the requested algorithm (or a better one) is implemented there, it could also be enabled by a "stable=True" parameter. I.e., users have the choice between the default (fast but may be inaccurate) and the stable one.

@cdeil
Copy link

cdeil commented Jan 11, 2013

I need a fast and accurate sum for large numpy arrays and did a quick test with simply using a higher precision accumulator via data.sum(dtype="float128"):
https://gist.github.com/4509330

Questions to the numpy experts:

  • Why doesn't the speed of the sum depend on the data or accumulator float precision?
  • Will data.sum(dtype="float128") work with 128 bit precision on all or most machines, or does it sometimes give 64 bit precision?

In any case: this requested feature for a numerically stable sum in numpy would be very useful to me.

@njsmith
Copy link
Member

njsmith commented Jan 11, 2013

Your speed measure is items/time, so larger is better? That is rather
odd. I wouldn't necessarily expect that using a higher-precision
accumulator would be slow, since like you say, the problem's probably
bounded by the speed of reading out the original arrays -- but then I
can't explain why it's just as fast to process the 128-bit input as
the 32-bit input. I guess another explanation would be that it
actually is bounded by the fp operation time, and Intel has just gone
overboard and made 32/64/80-bit fp operations equally fast. (Possibly
they all use the 80-bit units internally and then downcast on output.)

As for your second question, I think the best answer is a hollow
laugh. dtype=float128 probably won't give 128 bit precision on any
machine you care about. What numpy actually exposes is:
np.longdouble <- this is whatever your C compiler calls "long double"
np.float<8 * sizeof(long double)> <- an alias for the above, named
is based on the memory size of a C "long double" on your machine

On x86-32, "long double" is an 80-bit fp number stored in 96 bits (so
numpy calls it "float96"); on x86-64, it's a 80-bit (!) fp number
stored in 128 bits (this is the float128 you're using). On other
architectures it does other fun things, like on PPC where it's usually
"double-double" arithmetic:
https://en.wikipedia.org/wiki/Long_double
https://en.wikipedia.org/wiki/Double-double_%28arithmetic%29#Double-double_arithmetic

Nowadays we actually could expose a real float128, at least on
systems where we have working modern compilers
(cough_notwindows_cough). But no-one's done the work to sort out the
dumb existing naming scheme and add this support.

There have several mailing list discussions about how messed up this
all is, usually started by Matthew Brett discovering yet another wacky
bug or confusing inconsistency in this code when running on Sparc or
something.

My suggestions for using extended precision:

  • Use np.longdouble, not np.float128 (since it's defined to something
    useful more often)
  • Be aware that this is a dusty corner of numpy, so let us know if you
    run into bugs (though it should work and some people do use it)
  • Be prepared for it to be unavailable on some systems (though it
    should be some sort of extended precision whenever it exists)

@njsmith
Copy link
Member

njsmith commented Jan 11, 2013

On the actual bug's request for a stable summation algorithm:

The thing that makes this less than perfectly obvious is that there actually is no np.sum implementation anywhere in numpy. Instead what we have is addition code (np.add), plus a general mechanism to turn any binary operation into a reduction operation (so the code for np.sum for floating point is also the code for np.prod, and the code for np.sum for integers, etc.). So we can't just go to the np.sum code directly and add Kahan summation or whatever.

That said, it'd be lovely to have an exact algorithm like Shewchuk's (as used in Python's fsum, which might be steal-able code), or a bounded-given-condition-number algorithm like Kahan summation, or both, in numpy. The way to do it would be to define a special-purpose "generalized ufunc" for each, for floating point only. Details of how to do this:
http://jayvius.github.com/numpy-dtypes/newgufunc.html
http://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html
You'd want to generate one implementation for each version of floating point arithmetic numpy supports; if you look in loops.c.src (which would be a good place for the code to be added) then you can see how to write a kind of "code template" for the algorithm that then gets filled in with each type.

This wouldn't be too hard -- Kahan summation in particular is only about 10 lines of code, and hooking it up into numpy not much more. And then we could add a method= argument to np.sum and ndarray.sum that chose whether you wanted to pass on to add.reduce (using the default reduction algorithm, which doesn't know anything about addition in particular), or to one of the new special case generalized ufuncs.

The other thing we might want to consider is making add.reduce do pairwise summation by default:
https://en.wikipedia.org/wiki/Pairwise_summation
in principle this is just as fast as what we do now; not as good as Kahan or Shewchuk, but much better than the naive implementation. And the general principle could be implemented in the shared ufunc reduction code without giving it special knowledge of addition. We have a duty to our users to use good algorithms by default, since so very many people will not stop and think about these things.

@cdeil
Copy link

cdeil commented Jan 11, 2013

Yes, my speed measure is summed elements in GHz.
With any data and accumulator dtype I get 1 GHz on my 2.6 GHz Intel Core I7 Macbook.
I ran the same test on a Xeon Intel 2.6 GHz machine (see here) and there the speeds are different (0.2 to 1 GHz speed) for different data / accumulater dtypes.
I didn't try to understand what is going on. I think for my application (adding the per-bin or per-event fit statistic for 1e6 or 1e9 or more bins / events) the speed of the sum might not be relevant because computing the per-bin or per-event fit statistic is much slower, but I think precision is an issue, at least it was pointed out to me that the Sherpa Python fitting package is using Kahan summation and I should do the same.

Thanks for your detailed comments. I guess using a float128 dtype for the accumulator is not the answer then, but one of the stable adding methods you mentioned is. It would be awesome if this became available in numpy, unfortunately I don't have the expertise / time to contribute them myself.

@kyleabeauchamp
Copy link

So fsum is already available as np.math.fsum(). I wonder if we should just include some better documentation about using fsum when appropriate.

We could include a note about this function in the docstring (the See Also section?) of np.sum. I think that would be sufficient for many users.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2013

Please don't use np.math.fsum. That's just an accidental alias for
math.fsum.

On Thu, Jul 18, 2013 at 7:08 PM, kyleabeauchamp notifications@github.comwrote:

So fsum is already available as np.math.fsum(). I wonder if we should
just include some better documentation about using fsum when appropriate.

We could include a note about this function in the docstring (the See Also
section?) of np.sum. I think that would be sufficient for many users.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2448#issuecomment-21202863
.

@kyleabeauchamp
Copy link

Obviously. The point is that there's already a function to do this, and it's already imported.

@njsmith
Copy link
Member

njsmith commented Jul 18, 2013

It already being imported is probably a bug.

A reference to math.fsum would make sense in the np.sum docs, but it's not
really a full solution: it doesn't support per-axis summation etc. like an
np.fsum would.
On 18 Jul 2013 23:06, "kyleabeauchamp" notifications@github.com wrote:

Obviously. The point is that there's already a function to do this, and
it's already imported.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2448#issuecomment-21218374
.

@charris
Copy link
Member

charris commented Feb 20, 2014

This has been improved by #3685.

In [5]: math.fsum([1, 1e-10] * 1000), numpy.sum([1, 1e-10] * 1000)
Out[5]: (1000.0000001, 1000.0000001)

In [6]: math.fsum([1, 1e100, 1, -1e100] * 10000), numpy.sum([1, 1e100, 1, -1e100] * 10000)
Out[6]: (20000.0, 0.0)

The second sum is heading towards the pathological.

Note that the fsum function could be cythonized pretty easily and made part of a specialized package. Given the way x87 and SSE registers can get used on 32 bit systems it is also a bit compiler/cpu dependent, the l0 + hi trick can fail.

I'm going to close this as we probably have a good enough solution for most things.

@nschloe
Copy link
Contributor

nschloe commented Feb 27, 2018

For everyone interested: I've just released accupy to deal with ill-conditioned sums and dot products. The sums are somewhat slower that numpy.sum, but more accurate too; see the README for details.

@cdeil
Copy link

cdeil commented Feb 27, 2018

@nschloe - Thank you! Nice project name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants