Skip to content

histogram with array-valued weights #4718

Closed
jsw-fnal opened this Issue May 15, 2014 · 9 comments

5 participants

@jsw-fnal

Sometimes, I have a sample of data which I want to divide into bins and then sum an associated vector over all samples within each bin. This could be accomplished by creating a histogram with array-valued weights. Perhaps there are other ways to do it, but this seems to be the best conceptual match to what I want to do, and is a very natural extension of np.histogram.

In the particular example that I am dealing with at the moment, I want to re-weight my samples according to a 4th-order polynomial of an associated quantity. That is, I have an array hist_variable with shape (n,), and another array reweighting_variable also of shape (n,). My weights are some polynomial function of reweighting_variable:

rv = reweighting_variable
weights = a0 + a1*rv + a2*rv**2 + a3*rv**3 + a4*rv**4
counts, edges = np.histogram(hist_variable, weights=weights)

This works fine, as long as a0, a1, etc. are known and fixed. If I instead want to vary a0, a1, etc., then I can create five different histograms:

counts0, edges = np.histogram(hist_variable)
counts1 = np.histogram(hist_variable, weights=rv, bins=edges)[0]
counts2 = np.histogram(hist_variable, weights=rv**2, bins=edges)[0]
counts3 = np.histogram(hist_variable, weights=rv**3, bins=edges)[0]
counts4 = np.histogram(hist_variable, weights=rv**4, bins=edges)[0]

def weighted_counts(a0, a1, a2, a3, a4):
    return a0*counts0 + a1*counts1 + a2*counts2 + a3*counts3 + a4*counts4

And then I can choose any a0, a1, etc. that I like, and get my weighted histogram with the function weighted_counts, without performing the expensive np.histogram step again. This works, but it is not ideal.

The reason that it is not ideal is that I must re-do the binning step of np.histogram many times. This is expensive. I've tried to create a solution using np.digitize and fancy indexing, but that is also quite slow. What I wish I could do is

weights = array([rv**0, rv**1, rv**2, rv**3, rv**4])
counts_array, edges = np.histogram(hist_variable, weights=weights)

Can this be implemented efficiently? It seems like summing array-valued weights shouldn't be drastically slower than summing scalar-valued weights.

@jaimefrio
NumPy member

This is probably better asked in the mailing list, rather than opened as an issue.

An issue np.digitize has is that it does a linear search of the bins. If you have many bins it can cripple performance. I took a look at it some time ago, but I never managed to put together a solution using binary search that didn't perform worse than np.digitize for small array sizes.

You can always rewrite a call to np.digitize in terms of np.searchsorted, which will be much faster for many bins:

In [15]: a = np.arange(10)

In [16]: b = np.random.randint(10, size=10)

In [17]: np.digitize(b, a)
Out[17]: array([ 8,  4,  5,  1,  3,  1,  1,  1, 10,  7])

In [18]: np.searchsorted(a, b, side='r')
Out[18]: array([ 8,  4,  5,  1,  3,  1,  1,  1, 10,  7])

In [19]: np.digitize(b, a, right=True)
Out[19]: array([7, 3, 4, 0, 2, 0, 0, 0, 9, 6])

In [20]: np.searchsorted(a, b, side='l')
Out[20]: array([7, 3, 4, 0, 2, 0, 0, 0, 9, 6])

Once you have your hist_variable binned, either with np.digitize or np.searchsorted, you can get your counts really fast with np.bincount, i.e. something like:

bins = np.digitize(hist_variable, edges)
counts0 = np.bincount(bins)
counts1 = np.bincount(bins, weights=rv)
...
countsj = np.bincount(bins, weights=rv**j)
...
@jsw-fnal

Hi Jaime,
Thanks for the swift response. I opened an issue because I was hoping that other people might also need this, and that it could be implemented into a future version of numpy. I intended it as a low-priority feature request.
Regards,
Jon

@lzkelley
lzkelley commented Nov 4, 2015

There is a function which does this in scipy.stats called binned_statistic. Personally, I would think that this functionality is general enough to be included in numpy, but perhaps the lack of responses here suggests otherwise...

@njsmith
NumPy member
@rgommers
NumPy member
rgommers commented Nov 4, 2015

I would not be in favor of moving binned_statistic. We have enough duplication between numpy and scipy to get rid of as it is. Let's not add more for functionality that fits perfectly fine in scipy.stats and would need to be deprecated there if it's copied to numpy.

@lzkelley
lzkelley commented Nov 4, 2015

@rgommers yeah, that makes a lot of sense. So this issue should probably just be closed then?

@jsw-fnal
jsw-fnal commented Nov 4, 2015

binned_statistic does not appear to do what I am (was) looking for at all. The first two arguments (x and values) are still required to be the same shape.

If I give, say, x with shape (5,) and values with shape (5, 2), then binned_statistic either raises an exception or exhibits a behavior other than what I would expect/desire, depending on the value of the statistic argument:

  • When statistic is "mean" or "sum", it raises an exception
  • When statistic is "median", it finds the median among all values that fall into each bin, as if values were flattened after binning
  • When statistic is "count", it ignores values completely
  • When statistic is a function, as long as the return value of that function is a scalar, things are fine. But this prohibits me from producing the behavior I originally described even via a function value for statistic

In the example above, I would expect the output to have shape (n_bins, 2). In general, if x has shape (k,) and values has shape (k, m), and there are n bins, then I am looking for an output with shape (n, m).

@lzkelley

@jsw-fnal sorry I initially misunderstood your issue. I've included this suggested enhancement into the Scipy PR #5497, please let me know if you have any feedback.

@rgommers
NumPy member

The Scipy PR has been merged; possible follow-ups also should be discussed on the scipy issue tracker or mailing list I think. So closing this. Thanks all.

@rgommers rgommers closed this Nov 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.