Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions on improving numpy.bincount #8495

Closed
rainwoodman opened this issue Jan 18, 2017 · 26 comments
Closed

Suggestions on improving numpy.bincount #8495

rainwoodman opened this issue Jan 18, 2017 · 26 comments

Comments

@rainwoodman
Copy link
Contributor

  • Currently, numpy.bincount only allows scalar weights per bin. There was a stackoverflow article about using supporting vector weights. Can we extend the function to such that the result propagates the shape of the weights into the result?

  • Can we also add an 'out' argument to allow saving the result to an existing array? This will be useful for very sparsely populated bins and large input data.

@rainwoodman rainwoodman changed the title Suggestions improving numpy.bincount Suggestions on improving numpy.bincount Jan 18, 2017
@rgommers
Copy link
Member

An out keyword seems okay to me. For the vector weights, do you have a link to the stackoverflow question?

@seberg
Copy link
Member

seberg commented Jan 18, 2017

I admit it is way too slow and should be sped up, but for vector weights using np.add.at may be an OK option, which won't blow up the bincount implementation a lot.

@rainwoodman
Copy link
Contributor Author

@rainwoodman
Copy link
Contributor Author

Yes I looked up the example at

https://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.ufunc.at.html

An implementation with np.add could solve both issues. We shall also mention advanced uses of ufunc.at like this. For example, it appears to me we can find the minimum per bin, if we replace np.add with np.fmin?

@rainwoodman
Copy link
Contributor Author

So, a PR for this or ask on the mail-list first?

@seberg
Copy link
Member

seberg commented Jan 20, 2017

Not sure what exactly you mean, but you can probably do whatever you feel is right. As I said, the problem with ufunc.at is that it is very slow, so implementing an out argument for bincount by defering to it would seem probably pretty useless.

@rainwoodman
Copy link
Contributor Author

I see. You meant ufunc.at is very slow and ufunc.at shall not be used for vector bincount. So this is far from trivial.

For the record.

In [14]:

out = numpy.zeros(index.max() + 1)

%timeit add.at(out, index, weight)

100 loops, best of 3: 18 ms per loop

In [18]:

out = numpy.zeros(index.max() + 1)

%timeit numpy.bincount(index, weights=weight, minlength=len(out))

1000 loops, best of 3: 607 µs per loop

@rainwoodman
Copy link
Contributor Author

I took a look at the cause of slowness of ufunc.at. Here is an hypothesis -- I don't know if this makes sense or not:

Buffering is requested by [1]. However, the comment at [2] suggests buffering is very slow when subspace is small (is it 1 in this case?)

[1] https://github.com/numpy/numpy/blob/master/numpy/core/src/umath/ufunc_object.c#L5384

[2] https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/mapping.c#L2534

@seberg
Copy link
Member

seberg commented Jan 21, 2017

Yes, that abuse is probably the biggest reason. I am aware of it, and maybe a bit at fault that it is there, but nobody ever bothered to try to get rid of it....

@seberg
Copy link
Member

seberg commented Jan 21, 2017

Note that in general you need to convert types there though, but you can do it much faster then that. You will never achieve bincount speeds, but I would not be suprised by a factor of 5-10 faster for ufunc.at. More could be achieved if we somehow manage to give low-level access to the mapping iterator. But then you also need specializations, etc., etc.

@rainwoodman
Copy link
Contributor Author

Indeed. After this patch [1], without cast is 3x faster than with cast. All test cases passes, but I dislike the catch and clear approach -- is there a NpyIter function that tests if buffering is necessary for given operands?

In [23]:

index = arange(100000, dtype='i8') % 100

weight = arange(100000, dtype='f4')

out = numpy.zeros(index.max() + 1, dtype='f8')

%timeit add.at(out, index, weight)

100 loops, best of 3: 7.91 ms per loop

In [24]:

index = arange(100000, dtype='i8') % 100

weight = arange(100000, dtype='f8')

out = numpy.zeros(index.max() + 1, dtype='f8')

%timeit add.at(out, index, weight)

​

100 loops, best of 3: 2.29 ms per loop

In [26]:

index = arange(100000, dtype='i8') % 100

weight = arange(100000, dtype='f4')

out = numpy.zeros(index.max() + 1, dtype='f4')

%timeit add.at(out, index, weight)

100 loops, best of 3: 2.3 ms per loop

[1]:

diff --git a/numpy/core/src/umath/ufunc_object.c b/numpy/core/src/umath/ufunc_object.c
index 52f11ee..db58268 100644
--- a/numpy/core/src/umath/ufunc_object.c
+++ b/numpy/core/src/umath/ufunc_object.c
@@ -5382,12 +5382,26 @@ ufunc_at(PyUFuncObject *ufunc, PyObject *args)
                         NPY_ITER_EXTERNAL_LOOP|
                         NPY_ITER_REFS_OK|
                         NPY_ITER_ZEROSIZE_OK|
+//                        NPY_ITER_BUFFERED|
+                        NPY_ITER_GROWINNER|
+                        NPY_ITER_DELAY_BUFALLOC,
+                        NPY_KEEPORDER, NPY_UNSAFE_CASTING,
+                        op_flags, dtypes,
+                        -1, NULL, NULL, buffersize);
+
+    if (iter_buffer == NULL) {
+        PyErr_Clear();
+        iter_buffer = NpyIter_AdvancedNew(nop, array_operands,
+                        NPY_ITER_EXTERNAL_LOOP|
+                        NPY_ITER_REFS_OK|
+                        NPY_ITER_ZEROSIZE_OK|
                         NPY_ITER_BUFFERED|
                         NPY_ITER_GROWINNER|
                         NPY_ITER_DELAY_BUFALLOC,
                         NPY_KEEPORDER, NPY_UNSAFE_CASTING,
                         op_flags, dtypes,
                         -1, NULL, NULL, buffersize);
+    }

@rainwoodman
Copy link
Contributor Author

It is still 10x slower than bincount.

In [21]:

index = arange(100000, dtype='i8') % 10

weight = arange(100000, dtype='f4')

out = numpy.zeros(index.max() + 1)

%timeit numpy.bincount(index, weights=weight, minlength=len(out))

10000 loops, best of 3: 191 µs per loop

@seberg
Copy link
Member

seberg commented Jan 21, 2017

The correct approach there is to delete that iterator completely and replace it with individual element casts (at least for the case where it is individual elements), I believe (did not read the code again though). You will never get bincount speeds. To get close there you would have to iterate like fancy indexing does with at least a 1-D specialization. And that is not even possible too easily, because I doubt you have access to all you need through the public mapiter API.

@rainwoodman
Copy link
Contributor Author

rainwoodman commented Jan 25, 2017

I think we can probably avoid using MapIter at all. Are their other users of MapIter? It feels using MapIter here is a bit of a stretch. We can use a NpyIter on index and weights instead.

  • to define bincount/at, random access of the out array must be possible; in numpy data model, when it is possible it is fast (strides * index); dtype is also known.

  • Use a regular numpy iterator on the index and the weights array, where the weights is cast to the dtype of out, and index is cast to an integer type (e.g. intp?)

  • Have to call the ufunc loop with length of 1 because we do not have any better alterantives in the UFunc data structure.

  • Probably add a special code path for "+" and hope the compiler will lift the condition out of the loop.

@seberg
Copy link
Member

seberg commented Jan 25, 2017

You are not going to get advanced indexing easily without using mapiter, for a single index array, sure, you can/should avoid it for speed reasons.

@rainwoodman
Copy link
Contributor Author

What does single index array mean? ndim = 1 or array[index] ...?

@seberg
Copy link
Member

seberg commented Jan 25, 2017

Well, fancy indexing has some special cases for both (IIRC, maybe not). Mostly the 1 index and 1D case is likely important of course (if you assume that the part which is not indexed is large enough), but you can also branch of to all cases, which advanced indexing does, but it gives a lot of different inner loops, including some avoiding MapIter altogether.

Going all the way with that optimization will be quite a chunk of work, and may require to get more info from MapIter. I dunno if parts of it may be easier to do, such as cases where advanced indexing avoids the MapIter (maybe even not avoiding it fully, but also not using it for iteration, etc.).

But you can get a good factor (similar or bigger to what you already saw) with just fixing the buffering and specializing the single item loop and still using MapIter....

@seberg
Copy link
Member

seberg commented Jan 25, 2017

I don't really want to get into too much detail, because I would have to reread the advanced indexing code myself, heh...

@rainwoodman
Copy link
Contributor Author

Hi. Making the at method faster benefits all categorization operations though; so rereading the advanced indexing code maybe worthwhile.

On the other hand, I'd like to poke around at a relatively fast bincount like this:

def bincount(indices, weights, minlength=None, ufunc=np.add, out=None):

I'll start by avoiding MapIter. If the out array is really weird, then we can resort to holding the result in a temp, before using the MapIter or alike to copying it back to the output.

@seberg
Copy link
Member

seberg commented Jan 25, 2017

Honestly, I don't understand that signature. If you pass in the ufunc, might as well use the np.add.at signature and put in faster special cases into ufunc.at. I simply do not have the time to really think about how to efficiently implement ufunc.at in a nice way.

@seberg
Copy link
Member

seberg commented Jan 25, 2017

(though I admit that its a bit non-trivial that you need to prepare the out array first...

@mhvk
Copy link
Contributor

mhvk commented Jan 26, 2017

Just thought that I'd link back to some previous discussion (with many comments from @seberg ;-):

All of this suggests both speed-ups of np.<ufunc>.at and generalizations of np.bincount (that keeps its great performance) would be very welcome! (I'd certainly use a bincount version that could count with vector weights myself...).

@rainwoodman
Copy link
Contributor Author

The reason I suggest to tamper with bincount is because a copy of the out may be necessary -- however, we do expect ufunc.at to be explicitly in-place.

If we work out a way of bincount without copy in the future, then ufunc.at can be replaced with the code in bincount.

Thanks mhvk for the links. My proposal was actually proposed in one of the threads , though it appears nobody actually implement it.

@WarrenWeckesser
Copy link
Member

Another cross-reference: gh-9397

@mattip
Copy link
Member

mattip commented Feb 1, 2023

ufunc.add.at got a performance boost in #23136. But it does not cover casting:

>>> import time, numpy as np
>>> index = np.arange(100000, dtype='i8') % 10
>>> weight = np.arange(100000, dtype='f4')
>>> out = np.zeros(index.max() + 1)
>>> t0 = time.perf_counter(); xxx = np.bincount(index, weights=weight, minlength=len(out)); t1 = time.perf_counter()
>>> t_bincout = t1 - t0; t_bincount
0.0013672089553438127
>>> t0 = time.perf_counter(); xxx = np.add.at(out, index, weight); t1 = time.perf_counter()
>>> t_at_casting = t1 - t0; t_at_casting
0.030268130998592824
>>> out = np.zeros(index.max() + 1, dtype='f4')
>>> t0 = time.perf_counter(); xxx = np.add.at(out, index, weight); t1 = time.perf_counter()
>>> t_at_fast = t1 - t0; t_at_fast
0.0010713650262914598
>>> t_at_casting / t_bincount, t_at_fast / t_bincount
(22.138628393478655, 0.7836146933531773)

@mattip
Copy link
Member

mattip commented Feb 8, 2023

PR #23136 should make ufunc.at (for 1d, aligned arrays with no casting) as fast as bincount. Closing and opening a new issue to continue discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants