ENH: pass counting array to bincount #22471

ntessore · 2022-10-23T11:15:02Z

Proposed new feature or change:

I would like to propose adding the output array as an optional parameter to bincount. This makes bincount very useful when iteratively tallying large amounts of data with large indices.

After a bit of discussion on the mailing list, I think that a good interface for this might be a bins parameter, name debatable, that can either be an integer, which fixes the number of bins, or an array, to which the bin counts are then added. If there are indices larger than allowed by the bins specification, the function should probably raise a ValueError.

To see where this can be very useful, consider this example of making a high-resolution map from large batches of data:

high_res_map = np.zeros(10000**2)
for indices, quantities in next_big_chunk_of_data():
    high_res_map += np.bincount(indices, quantities, 10000**2)  # slow: repeatedly adding large arrays

This would be trivially sped up:

high_res_map = np.zeros(10000**2)
for indices, quantities in next_big_chunk_of_data():
    np.bincount(indices, quantities, bins=high_res_map)  # fast: plain sum-loop in C

Loops like the above can be found a lot in e.g. astronomy.

As far as I know, there is no equivalent numpy functionality, and there isn't any fast alternative outside of writing the simplest of loops in C/Cython/numba/..., which is what some packages, including my own, have done many times over. But this exact loop is what bincount does under the hood, and the change still fits both the purpose and description of bincount nicely, without changing existing behaviour.

The only moral issue is of course what happens when both minlength and bins are given. I think minlength should then be silently ignored. (I actually cannot imagine a situation where you actually want to specify minlength and not bins, but that's beside the point.)

The text was updated successfully, but these errors were encountered:

mhvk · 2022-10-23T19:03:24Z

See also #7998 (which suggested an out argument -- seems slightly more logical to me). See also #9497 and #8495.

Overall, I think there is general agreement something like this would be great to have, but it needs someone to actually implement it (without slowing down bincount as it is); there was even a trial -- see #9424 -- but it was abandoned. There was also a different suggestion, in analogy with the reduce methods of ufuncs, to have both an initial and out argument -- and if initial is out, then use it.

Anyway, in the end I think it is mostly someone finding the time to do it...

ntessore · 2022-10-24T08:59:39Z

@mhvk Thanks for pointing out these earlier issues. So this has come up a couple of times.

It looks like there are four things that people want to change in bincount:

Flexibility in the weights data type,
Matching weights and output data types,
Providing a preallocated output array,
Providing an initial value array.

Implementing everything seems a bigger task to me. But implementing only points 3 and 4 while not attacking points 1 and 2 seems doable: accept only a 1d NPY_DOUBLE array for out, ignore minlength, check that the max index in list fits into out, set out to initial if initial is not out, proceed.

The question is: is there interest in the limited change? Or should a change try to cover everything?

mhvk · 2022-10-24T12:30:04Z

@ntessore - I think the limited change that you suggest is what most people have asked for, and could very easily stand on its own!

ntessore · 2022-10-24T19:49:47Z

Ok, I give it a go when I have a moment. If there are any existing C API functions which have a similar sort of out and initial logic to copy from, that would be helpful information, since none come to mind right now.

mhvk · 2022-10-24T19:56:45Z

The suggestion of initial and out comes from the ufuncs, though there initial is always a scalar, so it is an extension of that logic.

arthurfeeney · 2022-12-30T20:17:46Z

I think I've got something functional working for initial and out arrays

Just curious what y'all (@ntessore, @mhvk) think: rather than using initial, would it be preferable to just have a boolean parameter? Something like overwrite_out (defaulted to True). It would just indicate whether or not out should be overwritten. So, by default it preserves the normal semantics of an out parameter. If set to False, bincount will just accumulate on top of out.
I don't think there's a huge use for initial outside of setting initial=out? and a bool parameter makes things simpler. Though I'm also not sure if there's any precedent for a parameter like overwrite_out.

mhvk · 2022-12-30T21:01:04Z

Agreed that perhaps initial is not super useful beyond indicating that out should not be overwritten. Its main advantage is that it is also used for things like np.sum, so it feels a bit more familiar; I don't think there is a precedent for something like overwrite_out. Also, if we'd ever make this a gufunc, then using initial would make it easier. I can also see potentially useful expansions when we make bincount multi-dimensional (with initial e.g. giving initial values for each row; just making this up obviously!), and/or extend the use of out to np.histogram etc.

Overall, I think I'm still somewhat in favor of initial. But maybe more relevantly, if you have something working, I'd suggest to push it up so we can have a look!

ntessore · 2023-01-01T22:47:37Z

Sorry for dropping the ball on this. I managed to actually make some progress towards all of 1-4 using a rather trivial NpyIter implementation, except I haven't worked out how to support arbitrary dtypes without manually repeating the code, so it got stuck and forgotten.

I only mention this because playing around with the code, it's pretty clear that beyond the "common interface" points that @mhvk raises, all combinations of initial and out can be actually be useful; for example, you might set initial=x, out=y for fixed x and multiple y if you want to tally multiple data sets with the same initial condition.

ntessore changed the title ~~ENH: <Please write a comprehensive title after the 'ENH: ' prefix>~~ ENH: pass counting array to bincount Oct 23, 2022

mhvk mentioned this issue Nov 2, 2022

ENH: Move identity to the ArrayMethod to allow customization #20970

Merged

arthurfeeney mentioned this issue Dec 31, 2022

ENH: add initial and out parameters to bincount #22907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: pass counting array to bincount #22471

ENH: pass counting array to bincount #22471

ntessore commented Oct 23, 2022

mhvk commented Oct 23, 2022

ntessore commented Oct 24, 2022

mhvk commented Oct 24, 2022

ntessore commented Oct 24, 2022

mhvk commented Oct 24, 2022

arthurfeeney commented Dec 30, 2022

mhvk commented Dec 30, 2022

ntessore commented Jan 1, 2023 •

edited

ENH: pass counting array to bincount #22471

ENH: pass counting array to bincount #22471

Comments

ntessore commented Oct 23, 2022

Proposed new feature or change:

mhvk commented Oct 23, 2022

ntessore commented Oct 24, 2022

mhvk commented Oct 24, 2022

ntessore commented Oct 24, 2022

mhvk commented Oct 24, 2022

arthurfeeney commented Dec 30, 2022

mhvk commented Dec 30, 2022

ntessore commented Jan 1, 2023 • edited

ntessore commented Jan 1, 2023 •

edited