# histogram with array-valued weights#4718

Closed
opened this Issue May 15, 2014 · 9 comments

### 5 participants

Sometimes, I have a sample of data which I want to divide into bins and then sum an associated vector over all samples within each bin. This could be accomplished by creating a histogram with array-valued weights. Perhaps there are other ways to do it, but this seems to be the best conceptual match to what I want to do, and is a very natural extension of `np.histogram`.

In the particular example that I am dealing with at the moment, I want to re-weight my samples according to a 4th-order polynomial of an associated quantity. That is, I have an array `hist_variable` with shape `(n,)`, and another array `reweighting_variable` also of shape `(n,)`. My weights are some polynomial function of `reweighting_variable`:

``````rv = reweighting_variable
weights = a0 + a1*rv + a2*rv**2 + a3*rv**3 + a4*rv**4
counts, edges = np.histogram(hist_variable, weights=weights)
``````

This works fine, as long as `a0`, `a1`, etc. are known and fixed. If I instead want to vary `a0`, `a1`, etc., then I can create five different histograms:

``````counts0, edges = np.histogram(hist_variable)
counts1 = np.histogram(hist_variable, weights=rv, bins=edges)[0]
counts2 = np.histogram(hist_variable, weights=rv**2, bins=edges)[0]
counts3 = np.histogram(hist_variable, weights=rv**3, bins=edges)[0]
counts4 = np.histogram(hist_variable, weights=rv**4, bins=edges)[0]

def weighted_counts(a0, a1, a2, a3, a4):
return a0*counts0 + a1*counts1 + a2*counts2 + a3*counts3 + a4*counts4
``````

And then I can choose any `a0`, `a1`, etc. that I like, and get my weighted histogram with the function `weighted_counts`, without performing the expensive `np.histogram` step again. This works, but it is not ideal.

The reason that it is not ideal is that I must re-do the binning step of `np.histogram` many times. This is expensive. I've tried to create a solution using `np.digitize` and fancy indexing, but that is also quite slow. What I wish I could do is

``````weights = array([rv**0, rv**1, rv**2, rv**3, rv**4])
counts_array, edges = np.histogram(hist_variable, weights=weights)
``````

Can this be implemented efficiently? It seems like summing array-valued weights shouldn't be drastically slower than summing scalar-valued weights.

NumPy member

This is probably better asked in the mailing list, rather than opened as an issue.

An issue `np.digitize` has is that it does a linear search of the bins. If you have many bins it can cripple performance. I took a look at it some time ago, but I never managed to put together a solution using binary search that didn't perform worse than `np.digitize` for small array sizes.

You can always rewrite a call to `np.digitize` in terms of `np.searchsorted`, which will be much faster for many bins:

``````In [15]: a = np.arange(10)

In [16]: b = np.random.randint(10, size=10)

In [17]: np.digitize(b, a)
Out[17]: array([ 8,  4,  5,  1,  3,  1,  1,  1, 10,  7])

In [18]: np.searchsorted(a, b, side='r')
Out[18]: array([ 8,  4,  5,  1,  3,  1,  1,  1, 10,  7])

In [19]: np.digitize(b, a, right=True)
Out[19]: array([7, 3, 4, 0, 2, 0, 0, 0, 9, 6])

In [20]: np.searchsorted(a, b, side='l')
Out[20]: array([7, 3, 4, 0, 2, 0, 0, 0, 9, 6])
``````

Once you have your `hist_variable` binned, either with `np.digitize` or `np.searchsorted`, you can get your counts really fast with `np.bincount`, i.e. something like:

``````bins = np.digitize(hist_variable, edges)
counts0 = np.bincount(bins)
counts1 = np.bincount(bins, weights=rv)
...
countsj = np.bincount(bins, weights=rv**j)
...
``````

Hi Jaime,
Thanks for the swift response. I opened an issue because I was hoping that other people might also need this, and that it could be implemented into a future version of numpy. I intended it as a low-priority feature request.
Regards,
Jon

commented Nov 4, 2015

There is a function which does this in `scipy.stats` called `binned_statistic`. Personally, I would think that this functionality is general enough to be included in numpy, but perhaps the lack of responses here suggests otherwise...

NumPy member
NumPy member
commented Nov 4, 2015

I would not be in favor of moving `binned_statistic`. We have enough duplication between numpy and scipy to get rid of as it is. Let's not add more for functionality that fits perfectly fine in `scipy.stats` and would need to be deprecated there if it's copied to numpy.

commented Nov 4, 2015

@rgommers yeah, that makes a lot of sense. So this issue should probably just be closed then?

commented Nov 4, 2015

`binned_statistic` does not appear to do what I am (was) looking for at all. The first two arguments (`x` and `values`) are still required to be the same shape.

If I give, say, `x` with shape `(5,)` and `values` with shape `(5, 2)`, then `binned_statistic` either raises an exception or exhibits a behavior other than what I would expect/desire, depending on the value of the `statistic` argument:

• When `statistic` is "mean" or "sum", it raises an exception
• When `statistic` is "median", it finds the median among all values that fall into each bin, as if `values` were flattened after binning
• When `statistic` is "count", it ignores `values` completely
• When `statistic` is a function, as long as the return value of that function is a scalar, things are fine. But this prohibits me from producing the behavior I originally described even via a function value for `statistic`

In the example above, I would expect the output to have shape `(n_bins, 2)`. In general, if `x` has shape `(k,)` and `values` has shape `(k, m)`, and there are `n` bins, then I am looking for an output with shape `(n, m)`.

This was referenced Nov 12, 2015
Closed

Closed

### Enhancement to binned_statistic: option to unraveled returned bin-number mappings, and able to pass multiple data arrays at once scipy/scipy#5497

@jsw-fnal sorry I initially misunderstood your issue. I've included this suggested enhancement into the Scipy PR #5497, please let me know if you have any feedback.

NumPy member

The Scipy PR has been merged; possible follow-ups also should be discussed on the scipy issue tracker or mailing list I think. So closing this. Thanks all.

closed this Nov 24, 2015