Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Incorrect handling of range in histogram with automatic bins. #7416

Merged
merged 2 commits into from
Mar 16, 2016

Conversation

madphysicist
Copy link
Contributor

Fixes #7411. Tests and documentation updated.

Note that the number of bins has nearly doubled in TestHistogramOptimBinNums.test_simple_range. This is the correct behavior since the data is generated over the range [-10, 10] but the requested range is [-20, 20]. The original fixed values were the same as for TestHistogramOptimBinNums.test_simple, which used the PTP of the data in the estimate.

By switching the results of the estimators to bin width instead of count, this PR uncovered and fixed the following problems that had gone previously unnoticed:

  • 'auto', 'rice', 'sturges' and 'sqrt' estimators returned non-one for the "No Variance" test. Clearly should be 1 in all cases. This was happening because of rounding in the bin count computation.
  • 'rice' estimator returned 2 for input with datasize 1. Clearly should be 1 as well.

Fixed another minor potential issue: The computation of mn, mx of the range was being done in two slightly different ways for the main function and histogram bin estimators. Aside from code duplication, this could cause discrepancies in corner cases.

Finally, replaced / with // in a couple of tests to make warnings go away. Turns out 500 / 5 is not always an integer in Python.


if isinstance(bins, basestring):
# if `bins` is a string for an automatic method,
# this will replace it with the number of bins calculated
if bins not in _hist_bin_selectors:
raise ValueError("{0} not a valid estimator for `bins`".format(bins))
raise ValueError("{} not a valid estimator for bins".format(bins))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of {} instead of {0} will break on 2.6, which I think we're still supporting for 1.11 -- were you hoping for this to be backported to 1.11.0 or 1.11.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I was not thinking that far ahead. Either one will be fine.

@njsmith
Copy link
Member

njsmith commented Mar 16, 2016

One small comment about 2.6 compatibility, but otherwise LGTM.

madphysicist and others added 2 commits March 16, 2016 05:49
Fixes numpy#7411. Tests and documentation updated.
Fixes other small issues with range and bin count computations.
Described ad nauseum the relationship between `range` parameter and bin estimation.
Updated formulas for estimators now that they are returning bin widths.
@madphysicist
Copy link
Contributor Author

Fixed.

@charris charris added this to the 1.11.0 release milestone Mar 16, 2016
charris added a commit that referenced this pull request Mar 16, 2016
BUG: Incorrect handling of range in `histogram` with automatic bins.
@charris charris merged commit 858b5b2 into numpy:master Mar 16, 2016
@charris
Copy link
Member

charris commented Mar 16, 2016

Thanks @madphysicist .

"""
return int(np.ceil(np.log2(x.size))) + 1
return x.ptp() / np.ceil(np.log2(x.size) + 1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the np.ceil be removed from this?

@nayyarv
Copy link
Contributor

nayyarv commented Mar 16, 2016

@madphysicist you're moving too fast for me to even have a chance to review things haha. I remember my original PR being far more sedate haha.

Other than my comment, it looks pretty good.

We're still going to get some funky behaviour when we have large binwidth's and excessively large range (if we have like 2 bins and 10 pieces of data tightly clustered, we could generate 2 bins of height 5 instead of 1 bin with height 10 ), but I think anyone doing stuff with small data sizes and stupidly large ranges shouldn't expect perfect behaviour.

@madphysicist
Copy link
Contributor Author

@nayyarv I fixed the ceil. I think it was a first attempt that I forgot to fix later. Since I can not reopen a PR, I made a new one: #7423. I was surprised by how quickly the PR was merged too.

Using bin widths that depend purely on the data rather than bin counts will actually prevent the histogram from looking funky. Basically, doubling the width of the range does not affect the way the data is binned since the width is computed from the actual PTP within the range. The only effect that the range can have is the positioning of the bins relative to the data. That is in fact a good use-case for range outside of trimming the data. However, to fully support it, there would have to be a way to pass in a fixed bin width rather than a bin count. That however, is a matter for another PR.

@charris
Copy link
Member

charris commented Mar 16, 2016

@madphysicist Could you make a PR against maintenance/1.11.x with the fixes for range? It isn't a straightforward backport due to other changes in master.

@charris
Copy link
Member

charris commented Mar 16, 2016

The problem is that the bin estimators are new in 1.11 and I don't want to release a buggy version. Enhancements I don't care about, but bugs are a no-no. So anything that is considered a bug should also be fixed in a PR against maintenance/1.11.x.

madphysicist added a commit to madphysicist/numpy that referenced this pull request Mar 16, 2016
@charris charris removed this from the 1.11.0 release milestone Mar 16, 2016
charris added a commit that referenced this pull request Mar 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How will new histogram bin selectors work with range parameter?
4 participants