Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ufunc.at perfomance >10x too slow #5922

Closed
d1manson opened this issue May 27, 2015 · 19 comments
Closed

ufunc.at perfomance >10x too slow #5922

d1manson opened this issue May 27, 2015 · 19 comments

Comments

@d1manson
Copy link

I have created a Matlab-like accumarray function which tries to squeeze as much performance form numpy as possible for a specific list of functions: sum any all max min mean...etc.

The functions is available as a gist here.

There is another accumarray implementation available on github here, which was originally created by @ml31415, but over the last couple of days has had many of my suggestions incorporated (I have also updated my own gist in line with my recent suggestions).

The original purpose of this other implementation was to use scipy.weave to get massively improved performance over what was assumed to be the best-possible raw-numpy version. However it appears that most of the functions can be fairly heavily optimized without scipy.weave. [It's not really important here, but for reference, the remaining functions which are difficult-to-optimize are min max and prod.]

The main point, however, is that it should be simple to optimize by just using ufunc.at for the relevant ufunc (as this is exactly what ufunc.at is intended for), yet ufunc.at gets miserable performance....about 15x slower than scipy.weave and 10-25x slower than a bunch of carefully written alternative numpy algorithms (where such optimized algorithms exist). Surely ufunc.at could be improved?

Also, and on a separate note, would there be any interest in including accumarray itself in numpy?


Here are some benchmarking stats produced by my function (testing and benchmarking code is inclued at the bottom of the gist). Note that the baseline times are obtained by sorting-spliting-and-looping, using the named numpy function for each group; whereas the optimised functions do some kind of handcrafted vectorised operation in most cases, except max min and prod which use ufunc.at. Note also that the actual observed speedup depends on a variety of properties of the input. Here we are using 100 000 indices uniformly picked from the interval [0, 1000). Specifically, about 25% of the values are 0 (for use with the bool tests), the remainder are uniformly distributed on [-50,25).

std baseline 190.4 ms optimised 7.3 ms ... 26.2 x faster
all baseline 77.2 ms optimised 8.6 ms ... 9.0 x faster
min baseline 65.3 ms optimised 50.0 ms ... 1.3 x faster
max baseline 64.4 ms optimised 45.8 ms ... 1.4 x faster
sum baseline 64.6 ms optimised 2.4 ms ... 27.3 x faster
var baseline 173.4 ms optimised 7.7 ms ... 22.4 x faster
prod baseline 63.5 ms optimised 50.2 ms ... 1.3 x faster
any baseline 75.5 ms optimised 7.0 ms ... 10.8 x faster
mean baseline 100.2 ms optimised 3.7 ms ... 26.9 x faster
@mhvk
Copy link
Contributor

mhvk commented May 27, 2015

Intrigued, I tried direct numpy examples, which confirm that np.bincount is almost 60 times faster than np.add.at. This seems surprising.

import numpy as np
test_idx = np.random.randint(0, 1e3, 1e5)
test_vals = np.random.rand(1e5)*100-50
res = np.zeros(1000)
%timeit np.add.at(res, test_idx, test_vals)
#--> 100 loops, best of 3: 13 ms per loop
%timeit res[:] += np.bincount(test_idx, weights=test_vals)
#--> 1000 loops, best of 3: 220 µs per loop

p.s. I confirmed both give identical results.

@ml31415
Copy link

ml31415 commented May 27, 2015

+1 for fixing this. Would be great to have ufunc.at working with decent speed!

@jaimefrio
Copy link
Member

It's what you pay for flexibility: np.bincount only does addition over an array of np.intp indices and another of np.double values, both 1D. ufunc.at works on multidimensional indices, for any type and almost any ufunc.

I looked into this some time ago, and ended up giving up: in all its generality, there's no easy way out of the current slowness of ufunc.at.

I can try to look into it once more, and report back with what, if anything, would be easy to speed-up through specialization, e.g. if all inputs are 1-D, then we may be able to streamline things...

@ml31415
Copy link

ml31415 commented May 27, 2015

It may make sense to implement some specialized loops for common cases like mean, std etc (if not done already). My C imlementation is basically also just dispatching on the functions name and then choosing the according inner loop. The generalization for different data types is done by scipy weave there, but this should also be possible to precompile it for all relevant combinations.

@d1manson
Copy link
Author

Surely it's easy to go from 1d to nd, because you can ravel and/or ravel_multi_index/reshape outside the loop..I guess life becomes more complicated if you need to support broadcasting, but that's a pretty complicated usage case for this already well-hidden functionality (I hadn't even thought to try it, and I do use broadcasting a lot elsewhere).

As to the question of types, would it not be worth upgrading the types if it makes the computation go a lot faster? Then maybe in the long term, aim to reduce the extent of redundant upgrading.

The reality of implementing this may be messier than I imagine, but the potential benefits are clearly pretty high.

@jaimefrio
Copy link
Member

I revisited this last night, here are my findings...

The way ufunc.at(op1, idx, op2) works is more or less:

  • It creates a PyArrayMapIterObject with op1 and idx. These lets you iterate over a data pointer into op1 in the order that the indexing in idx dictates.
  • It creates a PyArrayIterObject over op2, and broadcasts it to the shape of the other iterator.
  • We can now iterate simultaneously over op1 and op2, so all that would be left is to call the ufunc on each point until both iterators were exhausted. Unfortunately it isn't that easy, because to call the ufunc you have to have your data in an aligned memory location, with a non-swapped type, and casted to the right type (e.g. there is no inner loop to add double and int, it requires first casting the int to double).
  • The way ufuncs normally handle this is by filling a contiguous, aligned buffer with information from both operands, casted to the right type, operating on the buffers, then copying from the buffer into the output array. Iteration over the buffer is just a for loop, so it is really fast to run the ufunc over the whole buffer, and most of the time is typically spent in the buffer loading and unloading.
  • ufunc.at cannot use a large buffer, as it an output entry comes up twice in the buffer, the operations will not be accumulated on a single value, which is what we don't want to happen. So it instead uses single item buffers. And because all the casting and buffering operations are messy to implement, it reuses the functionality already present in the iterator. But to use it, it has to create yet another NpyIter object, set to iterate over the single entries pointed to by the two other iterators. And this iterator has to be reset for every next point.
  • If you create a separate branch, checking that both operands be aligned, not byteswapped, and of the right dtype, so that none of the buffering or casting is needed, things run about 6x faster:
import numpy as np
a = np.zeros(1000)
idx = np.random.randint(1000, size=(10000))
b = np.random.rand(10000)
%timeit np.add.at(a, idx, b)
%timeit np.add.at(a, idx, b)
1000 loops, best of 3: 968 µs per loop  # <- current master
10000 loops, best of 3: 157 µs per loop  # <- new method
  • This is better, but not great either, and it doesn't kick in for some very simple cases, just because all of a sudden casting is required:
%timeit np.add.at(a, idx, 1)
1000 loops, best of 3: 1.11 ms per loop
%timeit np.add.at(a, idx, 1.0)
10000 loops, best of 3: 158 µs per loop

So in summary, we could maybe squeeze a 5x performance for some of the more typical use cases by being smart about making copies of full arrays to avoid one-at-a-time casting into buffers. I have a few other ideas I will not go into right now, that perhaps could squeeze up to an additional 2x. The point is that the resulting function is still going to be some 10x slower than what one would expect from using np.bincount.

If anyone wants to try it out, this is the branch I have been playing around with:

https://github.com/jaimefrio/numpy/tree/ufunc_at_fast

Thoughts are welcome, but I am not 100% sure that the improvements justify the extra complexity. They probably do, but...

@seberg
Copy link
Member

seberg commented May 28, 2015

Sounds good, I knew it was slow due to the buffering always. The other optimization would be to not use the MapIter functions interface, but do the iteration lowlevel (and possibly specialized).
However, I think it is silly to expect it to be as fast as bincount. Bincount is pretty specialized and I think even limited to contiguous arrays.

EDIT: There is one at a time casting, but it is done in a very inefficient way and would likely be relatively easy to speed up.

@jaimefrio
Copy link
Member

Yes, one possibility very easy to specialize is the "all three items are 1-D arrays." Although even without any of the iterator overhead, we are still going to have two function calls (the one to innerloop which in turns calls the one coming from innerloopdata) per item instead of one for every buffer plus one for every item in the buffer, so some slowdown is unavoidable.

Related to this, the other possible optimization I had in mind, if the arrays are aligned, is handing 2 items at a time to the innerloop function, by passing it the first data pointer, and strides calculated to send it to the next one.

What do you have in mind for the casting speed-up?

@seberg
Copy link
Member

seberg commented May 28, 2015

The casting is currently done using an NpyIter based hack, I think you could write it using the dtype casting functions, which should be much faster. I think I would have to look again at the new mapiter structure to say more. In the case of a non-empty subspace (i.e. slicing involved), the casting can possibly be moved into the mapiter part (if you do not use the public API).

I am not sure how you look at that "all three items are 1-D" optimization. I think, one method would be to look at the 3 or so branches that MapIter uses and basically do all of those branches as well.

@d1manson
Copy link
Author

Update: @ml31415 and I have created a package called numpy-groupies, which consists of a bunch of tools for doing group-indexing operations including aggregation. The package is available on PyPi with pip install numpy-groupies.

The full list of benchmarks now looks like this:

function  pure-py  np-grouploop   np-ufuncat np-optimised    pandas        ratio
     std  1737.8ms       171.8ms     no-impl       7.0ms    no-impl   247.1: 24.4:  -  : 1.0 :  -  
     all  1280.8ms        62.2ms      41.8ms       6.6ms    550.7ms   193.5: 9.4 : 6.3 : 1.0 : 83.2
     min  1358.7ms        59.6ms      42.6ms      42.7ms     24.5ms    55.4: 2.4 : 1.7 : 1.7 : 1.0 
     max  1538.3ms        55.9ms      38.8ms      37.5ms     18.8ms    81.9: 3.0 : 2.1 : 2.0 : 1.0 
     sum  1532.8ms        62.6ms      40.6ms       1.9ms     20.4ms   808.5: 33.0: 21.4: 1.0 : 10.7
     var  1756.8ms       146.2ms     no-impl       6.3ms    no-impl   279.1: 23.2:  -  : 1.0 :  -  
    prod  1448.8ms        55.2ms      39.9ms      38.7ms     20.2ms    71.7: 2.7 : 2.0 : 1.9 : 1.0 
     any  1399.5ms        69.1ms      41.1ms       5.7ms    558.8ms   246.2: 12.2: 7.2 : 1.0 : 98.3
    mean  1321.3ms        88.3ms     no-impl       4.0ms     20.9ms   327.6: 21.9:  -  : 1.0 : 5.2 
Python 2.7.9, Numpy 1.9.2, Win7 Core i7.

Here we are using 100,000 indices uniformly picked from [0, 1000). Specifically, about 25% of the values are 0 (for use with bool operations), the remainder are uniformly distribuited on [-50,25).

  • purepy - uses nothing but pure python, relying partly on itertools.groupby.
  • np-grouploop - uses numpy to sort values based on idx, then uses split to create separate arrays, and then loops over these arrays, running the relevant numpy function for each array.
  • np-ufuncat - uses the ufunc.at method under discussion here
  • np-optimisied - uses custom numpy indexing/other tricks to beat the above two implementations (except for min max prod which rely on ufunc.at).
  • pandas - does something like: pd.DataFrame({'idx': idx, 'vals': vals}).groupby('idx').sum().

Note that some the some of the no-impl may be unwarranted, but I haven't bothered to get them working yet. Also, note I haven't included the weave implementation here, which I believe is fastest in almost all cases.

kevin-keraudren pushed a commit to kevin-keraudren/randomferns-python that referenced this issue Feb 25, 2016
…tween `a[indices] += 1` and `np.add.at(a,indices,1)`

and the trick of using `np.bincount`, which is faster than `ufunc.at`.

See: ml31415/numpy-groupies#1 and numpy/numpy#5922
@ml31415
Copy link

ml31415 commented Mar 30, 2017

@jaimefrio any news on this? I guess your 5x improvements would be well worth to integrate them!

@seberg
Copy link
Member

seberg commented Mar 30, 2017

@ml31415 I don't think doing this is impossibly complicated, but doing it does require quite a bit of time and spare brain cycles, which unfortunately a lot of us are notoriously short on.... And then there are often other things higher on our (personal) priority lists.

@nschloe
Copy link
Contributor

nschloe commented Jun 25, 2018

Just for reference, fastfunc (a small pybind11 project of mine) speeds things up by a factor of about 40.

ff

@mattip
Copy link
Member

mattip commented Jun 25, 2018

Only if things are limited to int64 data with no overflow or divide-by-0 handling and no broadcasting. It is small and clearly written though. Note that pybind11 seems to incur a penalty for len(array) < 10 or so compared to NumPy's direct use of c-extensions.

@nschloe
Copy link
Contributor

nschloe commented Jul 15, 2019

Another hint: If you use numpy.add.at, a 8 times faster alternative is numpy.bincount with its optional weight argument:

import perfplot
import numpy as np

np.random.seed(0)


def np_add_at(a, i):
    out0 = np.zeros(1000)
    np.add.at(out0, i, a)
    return out0


def np_bincount(a, i):
    return np.bincount(i, weights=a, minlength=1000)


b = perfplot.bench(
    setup=lambda n: (np.random.rand(n), np.random.randint(0, 1000, n)),
    kernels=[np_add_at, np_bincount],
    n_range=[2**k for k in range(24)],
    title=f"NumPy {np.__version__}",
)

b.save("out.png")
b.show()

out

@sradc
Copy link

sradc commented Feb 4, 2021

Anecdotally I've found it ~6x faster to manually loop rather than use add.at (not a completely elementwise loop)

@mattip
Copy link
Member

mattip commented Feb 8, 2023

Closing: #23136 was merged. Please rerun the groupies benchmarks.

@nschloe
Copy link
Contributor

nschloe commented Feb 8, 2023

Redid the benchmark from here with latest main:

out

@seberg
Copy link
Member

seberg commented Feb 8, 2023

Wow, nice! That we have a decent amount of additional overhead should be expected, but very cool that it levels off at the same place. (Although probably not too surprising, since this should be heavily memory bound at large sizes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants