-
-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Incorrect handling of range in histogram
with automatic bins.
#7416
Conversation
|
||
if isinstance(bins, basestring): | ||
# if `bins` is a string for an automatic method, | ||
# this will replace it with the number of bins calculated | ||
if bins not in _hist_bin_selectors: | ||
raise ValueError("{0} not a valid estimator for `bins`".format(bins)) | ||
raise ValueError("{} not a valid estimator for bins".format(bins)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of {}
instead of {0}
will break on 2.6, which I think we're still supporting for 1.11 -- were you hoping for this to be backported to 1.11.0 or 1.11.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I was not thinking that far ahead. Either one will be fine.
One small comment about 2.6 compatibility, but otherwise LGTM. |
Fixes numpy#7411. Tests and documentation updated. Fixes other small issues with range and bin count computations.
Described ad nauseum the relationship between `range` parameter and bin estimation. Updated formulas for estimators now that they are returning bin widths.
a1ab4a8
to
8869c1a
Compare
Fixed. |
BUG: Incorrect handling of range in `histogram` with automatic bins.
Thanks @madphysicist . |
""" | ||
return int(np.ceil(np.log2(x.size))) + 1 | ||
return x.ptp() / np.ceil(np.log2(x.size) + 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the np.ceil be removed from this?
@madphysicist you're moving too fast for me to even have a chance to review things haha. I remember my original PR being far more sedate haha. Other than my comment, it looks pretty good. We're still going to get some funky behaviour when we have large binwidth's and excessively large range (if we have like 2 bins and 10 pieces of data tightly clustered, we could generate 2 bins of height 5 instead of 1 bin with height 10 ), but I think anyone doing stuff with small data sizes and stupidly large ranges shouldn't expect perfect behaviour. |
@nayyarv I fixed the Using bin widths that depend purely on the data rather than bin counts will actually prevent the histogram from looking funky. Basically, doubling the width of the range does not affect the way the data is binned since the width is computed from the actual PTP within the range. The only effect that the range can have is the positioning of the bins relative to the data. That is in fact a good use-case for range outside of trimming the data. However, to fully support it, there would have to be a way to pass in a fixed bin width rather than a bin count. That however, is a matter for another PR. |
@madphysicist Could you make a PR against |
The problem is that the bin estimators are new in 1.11 and I don't want to release a buggy version. Enhancements I don't care about, but bugs are a no-no. So anything that is considered a bug should also be fixed in a PR against maintenance/1.11.x. |
Fixes #7411. Tests and documentation updated.
Note that the number of bins has nearly doubled in
TestHistogramOptimBinNums.test_simple_range
. This is the correct behavior since the data is generated over the range [-10, 10] but the requested range is [-20, 20]. The original fixed values were the same as forTestHistogramOptimBinNums.test_simple
, which used the PTP of the data in the estimate.By switching the results of the estimators to bin width instead of count, this PR uncovered and fixed the following problems that had gone previously unnoticed:
2
for input with datasize 1. Clearly should be 1 as well.Fixed another minor potential issue: The computation of mn, mx of the range was being done in two slightly different ways for the main function and histogram bin estimators. Aside from code duplication, this could cause discrepancies in corner cases.
Finally, replaced
/
with//
in a couple of tests to make warnings go away. Turns out500 / 5
is not always an integer in Python.