Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pass counting array to bincount #22471

Open
ntessore opened this issue Oct 23, 2022 · 8 comments
Open

ENH: pass counting array to bincount #22471

ntessore opened this issue Oct 23, 2022 · 8 comments

Comments

@ntessore
Copy link

Proposed new feature or change:

I would like to propose adding the output array as an optional parameter to bincount. This makes bincount very useful when iteratively tallying large amounts of data with large indices.

After a bit of discussion on the mailing list, I think that a good interface for this might be a bins parameter, name debatable, that can either be an integer, which fixes the number of bins, or an array, to which the bin counts are then added. If there are indices larger than allowed by the bins specification, the function should probably raise a ValueError.

To see where this can be very useful, consider this example of making a high-resolution map from large batches of data:

high_res_map = np.zeros(10000**2)
for indices, quantities in next_big_chunk_of_data():
    high_res_map += np.bincount(indices, quantities, 10000**2)  # slow: repeatedly adding large arrays

This would be trivially sped up:

high_res_map = np.zeros(10000**2)
for indices, quantities in next_big_chunk_of_data():
    np.bincount(indices, quantities, bins=high_res_map)  # fast: plain sum-loop in C

Loops like the above can be found a lot in e.g. astronomy.

As far as I know, there is no equivalent numpy functionality, and there isn't any fast alternative outside of writing the simplest of loops in C/Cython/numba/..., which is what some packages, including my own, have done many times over. But this exact loop is what bincount does under the hood, and the change still fits both the purpose and description of bincount nicely, without changing existing behaviour.

The only moral issue is of course what happens when both minlength and bins are given. I think minlength should then be silently ignored. (I actually cannot imagine a situation where you actually want to specify minlength and not bins, but that's beside the point.)

@ntessore ntessore changed the title ENH: <Please write a comprehensive title after the 'ENH: ' prefix> ENH: pass counting array to bincount Oct 23, 2022
@mhvk
Copy link
Contributor

mhvk commented Oct 23, 2022

See also #7998 (which suggested an out argument -- seems slightly more logical to me). See also #9497 and #8495.

Overall, I think there is general agreement something like this would be great to have, but it needs someone to actually implement it (without slowing down bincount as it is); there was even a trial -- see #9424 -- but it was abandoned. There was also a different suggestion, in analogy with the reduce methods of ufuncs, to have both an initial and out argument -- and if initial is out, then use it.

Anyway, in the end I think it is mostly someone finding the time to do it...

@ntessore
Copy link
Author

@mhvk Thanks for pointing out these earlier issues. So this has come up a couple of times.

It looks like there are four things that people want to change in bincount:

  1. Flexibility in the weights data type,
  2. Matching weights and output data types,
  3. Providing a preallocated output array,
  4. Providing an initial value array.

Implementing everything seems a bigger task to me. But implementing only points 3 and 4 while not attacking points 1 and 2 seems doable: accept only a 1d NPY_DOUBLE array for out, ignore minlength, check that the max index in list fits into out, set out to initial if initial is not out, proceed.

The question is: is there interest in the limited change? Or should a change try to cover everything?

@mhvk
Copy link
Contributor

mhvk commented Oct 24, 2022

@ntessore - I think the limited change that you suggest is what most people have asked for, and could very easily stand on its own!

@ntessore
Copy link
Author

Ok, I give it a go when I have a moment. If there are any existing C API functions which have a similar sort of out and initial logic to copy from, that would be helpful information, since none come to mind right now.

@mhvk
Copy link
Contributor

mhvk commented Oct 24, 2022

The suggestion of initial and out comes from the ufuncs, though there initial is always a scalar, so it is an extension of that logic.

@arthurfeeney
Copy link

I think I've got something functional working for initial and out arrays

Just curious what y'all (@ntessore, @mhvk) think: rather than using initial, would it be preferable to just have a boolean parameter? Something like overwrite_out (defaulted to True). It would just indicate whether or not out should be overwritten. So, by default it preserves the normal semantics of an out parameter. If set to False, bincount will just accumulate on top of out.
I don't think there's a huge use for initial outside of setting initial=out? and a bool parameter makes things simpler. Though I'm also not sure if there's any precedent for a parameter like overwrite_out.

@mhvk
Copy link
Contributor

mhvk commented Dec 30, 2022

Agreed that perhaps initial is not super useful beyond indicating that out should not be overwritten. Its main advantage is that it is also used for things like np.sum, so it feels a bit more familiar; I don't think there is a precedent for something like overwrite_out. Also, if we'd ever make this a gufunc, then using initial would make it easier. I can also see potentially useful expansions when we make bincount multi-dimensional (with initial e.g. giving initial values for each row; just making this up obviously!), and/or extend the use of out to np.histogram etc.

Overall, I think I'm still somewhat in favor of initial. But maybe more relevantly, if you have something working, I'd suggest to push it up so we can have a look!

@ntessore
Copy link
Author

ntessore commented Jan 1, 2023

Sorry for dropping the ball on this. I managed to actually make some progress towards all of 1-4 using a rather trivial NpyIter implementation, except I haven't worked out how to support arbitrary dtypes without manually repeating the code, so it got stuck and forgotten.

I only mention this because playing around with the code, it's pretty clear that beyond the "common interface" points that @mhvk raises, all combinations of initial and out can be actually be useful; for example, you might set initial=x, out=y for fixed x and multiple y if you want to tally multiple data sets with the same initial condition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants