How will new histogram bin selectors work with range parameter? #7411

davidchall · 2016-03-12T16:33:15Z

I've seen that v1.11 will be introducing aliases for different histogram binning optimization methods. I'm really looking forward to this feature, so thank you for implementing this!

My question is how these methods will interact with the range parameter (this is not currently documented).

If the range is within (min, max), I imagine that the data sent to the binning method is restricted to this range.
If the range is extends beyond (min, max), I am not sure how this would be handled. Will methods that return a binwidth fill the entire range according to this binwidth? And will methods that return a number of bins be forced to expand their binwidth to cover the full range?

seberg · 2016-03-12T16:36:17Z

Uhh, very good point, I think they may just ignore it when estimating the number of bins. @nayyarv, @madphysicist what do you guys think?

madphysicist · 2016-03-12T17:49:59Z

If I am not mistaken, @nayyarv included range support in the last commit he made. All the estimators return a number of bins in the end, even the ones that compute a bin width. If I am not mistaken, the bin count is applied over the range. I will look into it.

madphysicist · 2016-03-13T06:41:43Z

I looked back at the modifications to histogram. Support for range was inroduced here: #7243. It effectively trims off any part of the array that is outside the range. However, the conversion of bin-width computations to bin counts is done using the ptp of the (remaining) array, not the requested range. Both the main function and the histogram estimators behave the same way. The next change, #7199 preserves that behavior.

I believe that the way bin counts are handled is a bug since the histogram is normalized to the requested range, not the ptp of the remaining data. This means that the bin width implied the the count is actually being ignored. There are two simple fixes:

Adjust the bin count to cover the range given that the count the estimators return is the ptp.
Modify the estimators to return the bin width.

I prefer option 2 because it will require less computation and will allow the estimators to be more self contained. Rather than recomputing the ptp and comparing it to the range, the main function can set the number of bins itself, given a width. The main function should not have to know about the internals of the estimators, like what range the returned number of counts applies to.

I can see a number of actions coming from this issue:

Correct the behavior in the presence of a range.
Update the documentation.
Move the code that computes the actual range up a few lines so that it is not done twice by two slightly different methods: https://github.com/madphysicist/numpy/blob/master/numpy/lib/function_base.py#L499 vs https://github.com/madphysicist/numpy/blob/master/numpy/lib/function_base.py#L524

On a side note, weights are coming soon. I am working out A) a new way of doing introselect, and B) some benchmarks to demonstrate that it is much better to do that than naive sort.

madphysicist · 2016-03-13T06:44:18Z

To sum up, @seberg the range is being handled, but there is a subtle bug.

davidchall · 2016-03-13T16:20:19Z

Thanks for investigating, @madphysicist

This addresses part 1 of my question, which is when the range is trimmed. Part 2 of my question is about when the range is extended beyond (min, max), or one side is trimmed while the other is extended.

My worry was that the bin selectors would set the number of bins according to the input data (as they should), and then these would be stretched across the full range. This would cause inflated bin widths in the range where the actual data lies.

I think that your solution of making the bin estimators return a bin width instead of a number of bins will also solve this second part.

nayyarv · 2016-03-13T16:33:29Z

@madphysicist is very much on top of this.

There is an implicit assumption that the number of bins returned is based on the range of the data. It's explicit in the fd and scott estimators, while more implicit for sqrt, sturges, rice, doane,

This means when the range is excessively large, the number of bins calculated will be non-optimal since they were calculated for the range of the data passed in, not the actual range set by the user.
This can even pop up in situations where the range filters out some data, but still results in larger gaps on either side of the range filtered data.

I agree that returning the binwidth instead is a very clean solution, and then using mn and mx to calculate the proper number of bins.

Here's a suggested PR: master...nayyarv:master

The methods that estimate bin numbers are decorated to return binwidth's instead, while those that calculate binwidth are simplified.
If the function get's a binwidth of 0, it just reverts to 1 bin, as was the default before.

madphysicist · 2016-03-13T17:07:28Z

@nayyarv Would you be OK with putting the decorator code directly into the functions?

madphysicist · 2016-03-13T17:09:44Z

Aside from that, LGTM. Moving the duplicated range computation up and documenting the behavior can be a separate PR.

madphysicist · 2016-03-13T17:12:44Z

@davidchall I did not explicitly acknowledge this in my original post, but your original post did in fact uncover a bug. Your worry that the range would be stretched was well founded, but @nayyarv has proposed what I think is the correct fix. With those changes, the optimal size will be computed based on the clipped data but correctly applied to the range regardless of how it overlaps the data.

madphysicist · 2016-03-14T15:34:44Z

@nayyarv. 'auto' should return the minimum width. Also, I am pretty sure that you would want to remove the ceil+int conversions in the estimators that compute bincounts. I am prepping a PR that handles that and a couple of extra corner cases.

madphysicist · 2016-03-14T21:11:01Z

@davidchall It looks like using widths did in fact solve the problem, as you can see from the updated tests. The simple range test now reports about twice the number of bins it used to because the range is approximately twice the PTP of the input.

davidchall · 2016-03-14T23:02:24Z

@madphysicist Thank you - this is the behaviour that I would expect. It might be a good idea to mention this interplay between range and the binning method in the docstring for the main histogram() function, but that is just my own opinion.

madphysicist · 2016-03-15T00:52:06Z

@davidchall Thanks for noticing that. It was what we discussed and was certainly my intent all along. I will update the docs shortly.

madphysicist · 2016-03-15T03:07:16Z

Fixed. I think we should move further discussion of the code and docs to the PR chat.

Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.

madphysicist · 2016-03-16T18:23:30Z

@davidchall The PR has moved to #7423 in case you have any further comments. You can still leave comments in #7416 since the entire set of code changes is visible there.

Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.

madphysicist mentioned this issue Mar 14, 2016

BUG: Incorrect handling of range in histogram with automatic bins. #7416

Merged

madphysicist added a commit to madphysicist/numpy that referenced this issue Mar 16, 2016

BUG: Incorrect handling of range in histogram with automatic bins.

127eb9e

Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.

charris closed this as completed in #7416 Mar 16, 2016

madphysicist added a commit to madphysicist/numpy that referenced this issue Mar 16, 2016

BUG: Incorrect handling of range in histogram with automatic bins.

86f83fb

Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How will new histogram bin selectors work with range parameter? #7411

How will new histogram bin selectors work with range parameter? #7411

davidchall commented Mar 12, 2016

seberg commented Mar 12, 2016

madphysicist commented Mar 12, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 13, 2016

davidchall commented Mar 13, 2016

nayyarv commented Mar 13, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 14, 2016

madphysicist commented Mar 14, 2016

davidchall commented Mar 14, 2016

madphysicist commented Mar 15, 2016

madphysicist commented Mar 15, 2016

madphysicist commented Mar 16, 2016

How will new histogram bin selectors work with range parameter? #7411

How will new histogram bin selectors work with range parameter? #7411

Comments

davidchall commented Mar 12, 2016

seberg commented Mar 12, 2016

madphysicist commented Mar 12, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 13, 2016

davidchall commented Mar 13, 2016

nayyarv commented Mar 13, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 13, 2016

madphysicist commented Mar 14, 2016

madphysicist commented Mar 14, 2016

davidchall commented Mar 14, 2016

madphysicist commented Mar 15, 2016

madphysicist commented Mar 15, 2016

madphysicist commented Mar 16, 2016