Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How will new histogram bin selectors work with range parameter? #7411

Closed
davidchall opened this issue Mar 12, 2016 · 15 comments · Fixed by #7416
Closed

How will new histogram bin selectors work with range parameter? #7411

davidchall opened this issue Mar 12, 2016 · 15 comments · Fixed by #7416

Comments

@davidchall
Copy link

I've seen that v1.11 will be introducing aliases for different histogram binning optimization methods. I'm really looking forward to this feature, so thank you for implementing this!

My question is how these methods will interact with the range parameter (this is not currently documented).

  • If the range is within (min, max), I imagine that the data sent to the binning method is restricted to this range.
  • If the range is extends beyond (min, max), I am not sure how this would be handled. Will methods that return a binwidth fill the entire range according to this binwidth? And will methods that return a number of bins be forced to expand their binwidth to cover the full range?
@seberg
Copy link
Member

seberg commented Mar 12, 2016

Uhh, very good point, I think they may just ignore it when estimating the number of bins. @nayyarv, @madphysicist what do you guys think?

@madphysicist
Copy link
Contributor

If I am not mistaken, @nayyarv included range support in the last commit he made. All the estimators return a number of bins in the end, even the ones that compute a bin width. If I am not mistaken, the bin count is applied over the range. I will look into it.

@madphysicist
Copy link
Contributor

I looked back at the modifications to histogram. Support for range was inroduced here: #7243. It effectively trims off any part of the array that is outside the range. However, the conversion of bin-width computations to bin counts is done using the ptp of the (remaining) array, not the requested range. Both the main function and the histogram estimators behave the same way. The next change, #7199 preserves that behavior.

I believe that the way bin counts are handled is a bug since the histogram is normalized to the requested range, not the ptp of the remaining data. This means that the bin width implied the the count is actually being ignored. There are two simple fixes:

  1. Adjust the bin count to cover the range given that the count the estimators return is the ptp.
  2. Modify the estimators to return the bin width.

I prefer option 2 because it will require less computation and will allow the estimators to be more self contained. Rather than recomputing the ptp and comparing it to the range, the main function can set the number of bins itself, given a width. The main function should not have to know about the internals of the estimators, like what range the returned number of counts applies to.

I can see a number of actions coming from this issue:

  1. Correct the behavior in the presence of a range.
  2. Update the documentation.
  3. Move the code that computes the actual range up a few lines so that it is not done twice by two slightly different methods: https://github.com/madphysicist/numpy/blob/master/numpy/lib/function_base.py#L499 vs https://github.com/madphysicist/numpy/blob/master/numpy/lib/function_base.py#L524

On a side note, weights are coming soon. I am working out A) a new way of doing introselect, and B) some benchmarks to demonstrate that it is much better to do that than naive sort.

@madphysicist
Copy link
Contributor

To sum up, @seberg the range is being handled, but there is a subtle bug.

@davidchall
Copy link
Author

Thanks for investigating, @madphysicist

This addresses part 1 of my question, which is when the range is trimmed. Part 2 of my question is about when the range is extended beyond (min, max), or one side is trimmed while the other is extended.

My worry was that the bin selectors would set the number of bins according to the input data (as they should), and then these would be stretched across the full range. This would cause inflated bin widths in the range where the actual data lies.

I think that your solution of making the bin estimators return a bin width instead of a number of bins will also solve this second part.

@nayyarv
Copy link
Contributor

nayyarv commented Mar 13, 2016

@madphysicist is very much on top of this.

There is an implicit assumption that the number of bins returned is based on the range of the data. It's explicit in the fd and scott estimators, while more implicit for sqrt, sturges, rice, doane,

This means when the range is excessively large, the number of bins calculated will be non-optimal since they were calculated for the range of the data passed in, not the actual range set by the user.
This can even pop up in situations where the range filters out some data, but still results in larger gaps on either side of the range filtered data.

I agree that returning the binwidth instead is a very clean solution, and then using mn and mx to calculate the proper number of bins.

Here's a suggested PR: master...nayyarv:master

The methods that estimate bin numbers are decorated to return binwidth's instead, while those that calculate binwidth are simplified.
If the function get's a binwidth of 0, it just reverts to 1 bin, as was the default before.

@madphysicist
Copy link
Contributor

@nayyarv Would you be OK with putting the decorator code directly into the functions?

@madphysicist
Copy link
Contributor

Aside from that, LGTM. Moving the duplicated range computation up and documenting the behavior can be a separate PR.

@madphysicist
Copy link
Contributor

@davidchall I did not explicitly acknowledge this in my original post, but your original post did in fact uncover a bug. Your worry that the range would be stretched was well founded, but @nayyarv has proposed what I think is the correct fix. With those changes, the optimal size will be computed based on the clipped data but correctly applied to the range regardless of how it overlaps the data.

@madphysicist
Copy link
Contributor

@nayyarv. 'auto' should return the minimum width. Also, I am pretty sure that you would want to remove the ceil+int conversions in the estimators that compute bincounts. I am prepping a PR that handles that and a couple of extra corner cases.

@madphysicist
Copy link
Contributor

@davidchall It looks like using widths did in fact solve the problem, as you can see from the updated tests. The simple range test now reports about twice the number of bins it used to because the range is approximately twice the PTP of the input.

@davidchall
Copy link
Author

@madphysicist Thank you - this is the behaviour that I would expect. It might be a good idea to mention this interplay between range and the binning method in the docstring for the main histogram() function, but that is just my own opinion.

@madphysicist
Copy link
Contributor

@davidchall Thanks for noticing that. It was what we discussed and was certainly my intent all along. I will update the docs shortly.

@madphysicist
Copy link
Contributor

Fixed. I think we should move further discussion of the code and docs to the PR chat.

madphysicist added a commit to madphysicist/numpy that referenced this issue Mar 16, 2016
Fixes numpy#7411. Tests and documentation updated.
Fixes other small issues with range and bin count computations.
@madphysicist
Copy link
Contributor

@davidchall The PR has moved to #7423 in case you have any further comments. You can still leave comments in #7416 since the entire set of code changes is visible there.

madphysicist added a commit to madphysicist/numpy that referenced this issue Mar 16, 2016
Fixes numpy#7411. Tests and documentation updated.
Fixes other small issues with range and bin count computations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants