New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Automatic number of bins for np.histogram #6029

Merged
merged 1 commit into from Aug 15, 2015

Conversation

Projects
None yet
9 participants
@nayyarv
Contributor

nayyarv commented Jun 30, 2015

Hi,
Brought this up on the mailing lists, and got some support. (It then turned into a discussion of the p-Square algorithm for dynamic histogram generation). I also had some support when I originally brought it up with the matplotlib guys (where I had initially planned to put it), you can find the thread here

I've added support for methods to automatically choose the number of bins based on the data provided. The default signature remains the same, however users can now pass 'auto' or other strings to have an optimal number of bins chosen. This is especially useful for visualisation libraries.

An (out of date) notebook with my first code attempts can be found here. It discusses the reasoning behind the methods + justification and samples. I've tried to make it slightly simpler since then.

I've provided the implementation which defines functions within the histogram function as it allows for easy refactoring and I didn't want to put another 5 functions inside the base library that have no use anywhere else. The code is hopefully easy to refactor/change to better fit numpy's style guide/organisation.

I've included the docstring update as well with some notes on the different methods and a quick example on how to use the new call. I've tested the code and it seems to work as expected so far. Passing in bins='s' does lead to choosing the sturges estimator (as opposed to the scott estimator) but I can't say for sure what will happen.

@nayyarv nayyarv changed the title from Automatic number of bins for np.histogram to ENH: Automatic number of bins for np.histogram Jun 30, 2015

@@ -84,11 +84,15 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
----------
a : array_like
Input data. The histogram is computed over the flattened array.
bins : int or sequence of scalars, optional
bins : int or sequence of scalars, str, optional

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

formatting nitpick: or str instead of , str

If `bins` is an int, it defines the number of equal-width
bins in the given range (10, by default). If `bins` is a sequence,
it defines the bin edges, including the rightmost edge, allowing
for non-uniform bin widths.
If `bins` is a string, it selects an automatic bin estimator, either
["fd", "scott", "sturges", "rice", "auto"]. If no match found,
it reverts to the auto method. For visualisation libraries, we suggest

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

If no match is found then raising a ValueError is preferable. Most likely the user made a typo.

auto - is the maximum of the sturges or fd estimators.
A compromise to get good automated bin choices. For small datasets the sturges value will be
chosen, while larger datasets will default to FD. Avoids the overly conservative values of
FD when small and Sturges when large. Switchover point is usually x.size~1000.

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

For this block of text, the first three lines belong in the Notes section I think, and the rest as a reST definition list (http://sphinx-doc.org/rest.html#lists-and-quote-like-blocks) in the description of the bins parameter.

return 1 #all else fails
def default(x):

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

Better to name this function auto, because that's what the string is called and because it's not actually the default.

"""
iqr = np.subtract(*np.percentile(x, [75, 25]))
if iqr ==0: #unlikely

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

PEP8 nitpick: space after ==

if iqr ==0: #unlikely
iqr = mad(x) # replace with something useful
h = (2 * iqr * x.size**(-1.0/3)) # binwidth

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

This can go inside if iqr > 0:

if iqr > 0:
return np.ceil(x.ptp()/h)
return 1 #all else fails

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

Changing to an if-elif-else block and removing the blank lines would be better here.

@rgommers

This comment has been minimized.

Member

rgommers commented Jun 30, 2015

histogram has turned into a very long function now, and the implemented estimator functions might be reusable from histogram2d and histogramdd. So I suggest to create a separate function _hist_estimator(x, bins) that parses the strings (input validation) and returns the number of bins.

break
else:
# Maybe raise a Warning or something? Printing to stderr for now
print("Automatic method '{}' not found, reverting to default".format(bins), file=sys.stderr)

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

print statements should never be used. Raise ValueError here.

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

Also, normally you check for valid input first. So:

if name.lower() not in optimalityMethods.keys():
    raise ValueError('%s not a valid method for `bins`" % name)

bins = estimator(a)
"""
h = 3.5 * x.std() * x.size **(-1.0/3)
if h>0:

This comment has been minimized.

@rgommers

rgommers Jun 30, 2015

Member

PEP8 : h > 0

@rgommers

This comment has been minimized.

Member

rgommers commented Jun 30, 2015

As a bonus, it would be nice if the docstring had an example that used 'auto' and actually created a plot with it. There are examples for other function that use matplotlib for plots that you could copy.

@rgommers

This comment has been minimized.

Member

rgommers commented Jun 30, 2015

Some style comments, but overall +1 on this addition.

# measure of variability - Mean Absolute Deviation (mad)
iqr = mad(x)
if iqr > 0:

This comment has been minimized.

@nayyarv

nayyarv Jun 30, 2015

Contributor

Can't really do an if-elif block. I replace the value of iqr with mad if the iqr is 0. If the mad is 0 too, then I return 1 as the number of bins, otherwise I calculate the FD estimator.
I did move the return 1 into the if iqr block though

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Jul 3, 2015

Thanks for the feedback, I've put together the commits that should have addressed the issues.

As it currently stands, _hist_bandwidth_estimator should not be used for the 2d or dd histograms, the estimators binwidth suggestions is dependent on the number of dimensions (there is an implicit d=1). I have limited knowledge of higher dimension optimal bin measures, so I can't go beyond that just yet.

Some questions/thoughts

  1. Should I switch to a bunch of if-elif statements inside _hist_bandwidth_estimator as opposed to the current nested functions?
  2. Would it be better to put the information on the estimators in the notes and provide a link to the Notes heading? As it is I think the parameters section has become too verbose, and beyond the scope of the section.
@jaimefrio

This comment has been minimized.

Member

jaimefrio commented Jul 3, 2015

Shouldn't bandwith be replaced throughout with binwidth?

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Jul 13, 2015

Changed all mentions of bandwidth to either binwidth or optimal_numbins to better reflect the nature of function.

I've included a picture here of what the example in the docstring produce, so it doesn't have to be left to imagination.
histogram

@seberg

This comment has been minimized.

Member

seberg commented Jul 13, 2015

Good to see this converge, but unfortunately we will need tests before we can merge anything.

The tests should go into numpy/numpy/lib/tests/test_function_base.py.

@homu

This comment has been minimized.

Contributor

homu commented Jul 23, 2015

☔️ The latest upstream changes (presumably #6100) made this pull request unmergeable. Please resolve the merge conflicts.

@nayyarv nayyarv closed this Jul 23, 2015

@nayyarv nayyarv reopened this Jul 23, 2015

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Jul 23, 2015

I've dealt with the upstream changes and included a bunch of unit tests that test basic functionality of the new automated methods.
Furthermore I squashed all the past commits into a single commit so it's not all over the place like before.

@homu

This comment has been minimized.

Contributor

homu commented Jul 26, 2015

☔️ The latest upstream changes (presumably #6115) made this pull request unmergeable. Please resolve the merge conflicts.

@tacaswell

This comment has been minimized.

Contributor

tacaswell commented Jul 26, 2015

Please ping me when this gets merged so the matplotlib docs can be updated.

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Jul 26, 2015

Alright, up to date with master again, in a single commit (my git skills are getting a good workout)
As to the tests included:

  1. Check the methods can handle empty arrays
  2. Check that strings passed that aren't a provided method raise a correct ValueError.
  3. Basic check that the values make sense on some pre-checked data of sizes 50, 500 and 5000.
  4. Check that no variance in the data results in expected values for the estimators (especially for scott and fd)
  5. Check the performance of the Scott and FD estimators in the presence of outliers.
@homu

This comment has been minimized.

Contributor

homu commented Jul 28, 2015

☔️ The latest upstream changes (presumably #6126) made this pull request unmergeable. Please resolve the merge conflicts.

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Jul 29, 2015

Thanks @homu, appreciate the updates and you pointing out which PR merge is breaking my PR. Merged upstream changes again.

@njsmith

This comment has been minimized.

Member

njsmith commented Jul 29, 2015

@homu is an automated bot :-) pretty handy though!
On Jul 28, 2015 22:01, "Varun Nayyar" notifications@github.com wrote:

Thanks @homu https://github.com/homu, appreciate the updates and you
pointing out which PR merge is breaking my PR. Merged upstream changes
again.


Reply to this email directly or view it on GitHub
#6029 (comment).

Automated Bin Selection Methods example, using 2 peak random data with 2000 points
>>> from matplotlib import pyplot as plt
>>> a = np.hstack((np.random.randn(1000), np.random.randn(1000) * 2 + 5))

This comment has been minimized.

@shoyer

shoyer Aug 13, 2015

Member

Use a np.random.RandomState instance to make documentation builds deterministic.

Also -- do we actually generate plots from docstrings in the numpy documentation? It's worth checking on this.

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 14, 2015

Thanks @shoyer, I took your suggestions on board for the latest version of my PR. Other things that I did in this commit

  1. The parameter section in the docstring was getting too large, so instead I moved the equations and explanations to the Notes while leaving a brief one-liner in parameters for users.
  2. The example section now uses plt.hist since the method actually doesn't check input, it simply passes things onto np.histogram which makes things a little simpler. I've also used a RandomState to make the plot deterministic as suggested
  3. Added some extra information into the assert error messages, as my test setup made it difficult to work out which particular method was failing. Also added in a test to check behaviour with small values since FD uses percentile, which has limited meaning when there is only 1 datapoint.

I also made sure to generate the docfiles to check it looks ok, I've included screenshots here of the relevant sections, as generated on my computer

Parameters

screenshot 2015-08-14 23 19 35

Notes

screenshot 2015-08-14 23 22 52

Examples

screenshot 2015-08-14 23 15 53

@ewmoore

This comment has been minimized.

Contributor

ewmoore commented Aug 14, 2015

I'm not sure we should be counting on plt.hist to do that. Is that
documented somewhere? Also I think the example for a function should
probably call that function.

On Fri, Aug 14, 2015 at 9:27 AM, Varun Nayyar notifications@github.com
wrote:

Thanks @shoyer https://github.com/shoyer, I took your suggestions on
board for the latest version of my PR. Other things that I did in this
commit

  1. The parameter section in the docstring was getting too large, so
    instead I moved the equations and explanations to the Notes while leaving a
    brief one-liner in parameters for users.
  2. The example section now uses plt.hist since the method actually
    doesn't check input, it simply passes things onto np.histogram which
    makes things a little simpler. I've also used a RandomState to make the
    plot deterministic as suggested
  3. Added some extra information into the assert error messages, as my test
    setup made it difficult to work out which particular method was failing.
    Also added in a test to check behaviour with small values since FD uses
    percentile, which has limited meaning when there is only 1 datapoint.

I also made sure to generate the docfiles to check it looks ok, I've
included screenshots here of the relevant sections, as generated on my
computer
Parameters

[image: screenshot 2015-08-14 23 19 35]
https://cloud.githubusercontent.com/assets/1589119/9274669/facb2ff6-42da-11e5-8589-ca8ff230aa9c.png
Notes

[image: screenshot 2015-08-14 23 22 52]
https://cloud.githubusercontent.com/assets/1589119/9274732/6e290130-42db-11e5-8b39-95106c83f5d3.png
Examples

[image: screenshot 2015-08-14 23 15 53]
https://cloud.githubusercontent.com/assets/1589119/9274605/97cbec10-42da-11e5-9dc8-273d05dead51.png


Reply to this email directly or view it on GitHub
#6029 (comment).

@tacaswell

This comment has been minimized.

Contributor

tacaswell commented Aug 14, 2015

@ewmoore From the mpl side I am counting on plt.hist to behave that way. We do not do any validation on the bins input and just pass it through to np.histogram. Once this is merged and released this behavior will be documented on our side.

Part of the history of this is that @nayyarv put a patch into mpl to add the auto logic and I suggested he try putting it in numpy instead 😉 .

@seberg

This comment has been minimized.

Member

seberg commented Aug 14, 2015

Well, I guess you could argue that the docs are a bit backward, but we have plots in examples and frankly showing some plotting might be nice for the user and does give the right idea. I would be fine with adding the same/similar example without plotting additionally.
The plot example also has still two PEP8 style issues (spaces around = in func call and two spaces before #).

In any case, I think since @shoyer had some close look now too, I am willing to put it in as is.

@nayyarv, however, there are two more real things left though, sorry :(. Could you add the .. versionadded:: 1.11.0 tag [1]? np.take has an example, the empty lines are important.
As well as include it into the release notes (they are in numpy/doc/release), I think it is a neat new feature we should likely mention it in the improvements section.

[1] or do we want to squeeze it into 1.10? tend to think rather not just out of principle, plus I somewhat hope 1.11 will be a fast one

@tacaswell

This comment has been minimized.

Contributor

tacaswell commented Aug 14, 2015

We are all friends here 😄 .

@nayyarv nayyarv closed this Aug 14, 2015

@seberg

This comment has been minimized.

Member

seberg commented Aug 14, 2015

Uh, what happened? I hope this still exists?

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 14, 2015

Sorry, git mistake, I'm trying to squash the changes into a commit and it ended up going south.

While I have you, By something additional, did you mean something like this

>>> heights, edges = np.histogram(a, bins = 'auto')
>>> heights
array([  5,  34, 135, 274, 333, 195,  94,  82,  98, 145, 143, 157, 121, 89,  50,  24,  16,   5])
>>> len(heights)
18

followed by the plot?

@seberg

This comment has been minimized.

Member

seberg commented Aug 14, 2015

Yeah, but I don't care much frankly, just pondering. Would seem fine, just leaving it is also fine. Or you could give the length of the heights returned, that shows quite obviously that the bin number was chosen somehow.

@nayyarv nayyarv reopened this Aug 14, 2015

ENH: Adding in automatic number of bins estimation for np.histogram. …
…Users can now pass in bins='auto' (or 'scott','fd','rice','sturges') and get the corresponding rule of thumb estimator provide a decent estimate of the optimal number of bins for the given the data.
@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 14, 2015

Alright, all put together now and squashed into 1 commit. I didn't put in the extra example, I think the plot covers the idea that the number of bins have been chosen automatically.

I'd be keen to squeeze this into 1.10 (considering how often I use histograms), but I'm not too fussy.

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 14, 2015

Let's stick with the process and keep the 1.10 branch to bug fixes only. We
have to make a cutoff somewhere and it will always be tempting to try and
sneak just one more thing past it, people have already started QA'ing the
beta... 1.11 will be along soon enough.
On Aug 14, 2015 10:44, "Varun Nayyar" notifications@github.com wrote:

Alright, all put together now and squashed into 1 commit. I didn't put in
the extra example, I think the plot covers the idea that the number of bins
have been chosen automatically.

I'd be keen to squeeze this into 1.10 (considering how often I use
histograms), but I'm not too fussy.


Reply to this email directly or view it on GitHub
#6029 (comment).

@@ -140,6 +252,48 @@ def histogram(a, bins=10, range=None, normed=False, weights=None,
second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes*
4.
.. versionadded:: 1.11.0

This comment has been minimized.

@seberg

seberg Aug 14, 2015

Member

I would say, putting it only with the keyword is sufficient (but frankly we have reached bikeshedding). The other option we do is to putting it only in the Notes section. So will put this in soon (unless someone beats me to it). Lets not start breaking rules again, rather hope that 1.11 is soon. I feel there may be some nice things coming up soon enough ;).

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 15, 2015

No worries, @seberg, I'll wait for 1.11 then.

I didn't want the parameter section to be drowned out by discussion of the automatic methods (which is what happened originally) but wanted users to instantly know what the differences were without jumping to notes. This seemed like a decent compromise. I'm happy to let the docs be updated at a later date when someone has a better way of going about things.

@seberg

This comment has been minimized.

Member

seberg commented Aug 15, 2015

OK, putting this in. If someone still finds some minor doc things or such, just open a new pull request is. Thanks a lot @nayyarv for your patience and everyone who helped reviewing!

seberg added a commit that referenced this pull request Aug 15, 2015

Merge pull request #6029 from nayyarv/master
ENH: Automatic number of bins for np.histogram

@seberg seberg merged commit 6e8b869 into numpy:master Aug 15, 2015

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@tacaswell

This comment has been minimized.

Contributor

tacaswell commented Aug 16, 2015

From the mpl side, i would like to see this in 1.10 as it is something that
we get asked for often, but understand holding the line on feature freeze
(even if I am really bad at it).

On Sat, Aug 15, 2015, 9:53 AM seberg notifications@github.com wrote:

Merged #6029 #6029.


Reply to this email directly or view it on GitHub
#6029 (comment).

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 16, 2015

Having had a look at the places this PR/original mpl issue have been referenced in, I have just realised the optimal bin methods don't even consider weights or range. For example if there were a 1000 samples in total, but only 100 inside range, the methods all assume the full datasize and overestimate the number of bins.

Similar with weighting - if there were 50 values with weight 20 each, the a.size will be 50, instead of the actual 1000 the methods should be using.

I've gone through some quick-fixes and updates here: master...nayyarv:autobinsWeighted
though it's made things a little more confusing since the functions are no longer stand alone. I need to simplify things a bit and add testing before I make a PR - this is a heads up.

An aside,
np.percentile doesn't support weights, neither does np.partition which means making a O(n) np.percentile that supports weights requires a lot of work. I've written a simple one that uses np.sort and np.searchsorted so it's about O(nlogn) time and matches outputs exactly with np.percentile(np.repeat(a, weights), q) without the memory overheads (though it could be quicker for small weights). It still needs testing, though randomised testing hasn't thrown any errors yet.

What would be the best solution to this?

  1. Include my nlogn function and wait for np.percentile to
  2. Usenp.percentile(np.repeat(a, weights), q) and put warnings up?
  3. Disallow FD for weighted data, and default to Scott (since weighted Standard deviation is easier to calculate)?
@seberg

This comment has been minimized.

Member

seberg commented Aug 16, 2015

First of all, is it really correct that if you have a weight of 20 each you would multiple the number of samples as 20? Number 3. might be an option, though I dislike that weights=np.ones(...) would not give the same result.

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 16, 2015

Are weights required to be integers, as the repeat(...) implementation
assumes?
On Aug 16, 2015 8:56 AM, "seberg" notifications@github.com wrote:

First of all, is it really correct that if you have a weight of 20 each
you would multiple the number of samples as 20? Number 3. might be an
option, though I dislike that weights=np.ones(...) would not give the
same result.


Reply to this email directly or view it on GitHub
#6029 (comment).

@shoyer

This comment has been minimized.

Member

shoyer commented Aug 16, 2015

In my experience, fractional weights are also somewhat common. They are useful for modeling discrete probability distributions.

On Sun, Aug 16, 2015 at 3:45 PM, Nathaniel J. Smith
notifications@github.com wrote:

Are weights required to be integers, as the repeat(...) implementation
assumes?
On Aug 16, 2015 8:56 AM, "seberg" notifications@github.com wrote:

First of all, is it really correct that if you have a weight of 20 each
you would multiple the number of samples as 20? Number 3. might be an
option, though I dislike that weights=np.ones(...) would not give the
same result.


Reply to this email directly or view it on GitHub
#6029 (comment).


Reply to this email directly or view it on GitHub:
#6029 (comment)

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 17, 2015

Well, the effective sample size is sum of weights, but this assumes the weights are counts. I.e. repeat(a, weights) would return the full sample. So in this case, 20 observations of 50 sample does have an effective sample size of 1000 (and the difference between n^(1/3) being 4 or 10)

np.histogram as a function doesn't care what the weights are - it simply adds the appropriate weight value into the bin, which allows for things like negative, complex and fractional weights.

In terms of the estimators - they need to account for variability and size. When you have data like [2,3,4] and weights of [4,1,2], you have an effective sample size of 7, not 3 - and should calculate accordingly. having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't really tell us about effective sample size. I'm not sure what to do with something like [30.2, 29.8 22.34].

Furthermore, using the weighted standard deviation or weighted percentiles also assume weights represents counts/frequency of some kind.
For sd, Having a negative weight is identical to have the positive weight and the additive inverse of the x-value. Having a complex weight sends histogram into complex space for variability calculation. Fractional (0.3) or decimal (30.2) shouldn't have a problem with weighted sd, as long as they're divided by the sum of the weights.
Similarly for percentile, it's a positional question - if the weights are negative - what does that mean when calculating percentile x = [2,3], weights [ 1, -2], what's the median? Even decimal weights = [1.2, 1.8], what's the median? It corresponds to an cdf at count 1.5, so is it 3 or is it some linear interpolation of 2 and 3? If the weights were [1,1] the median would be 2.5, and if the weights are [1,2], then the median is clearly 3. Complex numbers don't help at all.

Since non-whole weights make no sense in the estimators, maybe restrict the optimal bins calculation if the weights are positive integers (which should be a large majority of use cases?)
Maybe ignore non-whole weights and simply work with a and range - but in use cases like this, the number of bins/ bin edges are likely predetermined.

@njsmith

This comment has been minimized.

Member

njsmith commented Aug 17, 2015

It wouldn't be the end of world if bins="auto" have an error for some
otherwise valid weights= arguments. But handling fractional weights at
least would be nice if there's a reasonable way; they are reasonably
common. (And we should have some reasonable+tested behavior for all
cases, even if that is just returning a nice error.)
On Aug 17, 2015 6:48 AM, "Varun Nayyar" notifications@github.com wrote:

Well, the effective sample size is sum of weights, but this assumes the
weights are counts. I.e. repeat(a, weights) would return the full sample.
So in this case, 20 observations of 50 sample does have an effective sample
size of 1000 (and the difference between n^(1/3) being 4 or 10)

np.histogram as a function doesn't care what the weights are - it simply
adds the appropriate weight value into the bin, which allows for things
like negative, complex and fractional weights.

In terms of the estimators - they need to account for variability and
size. When you have data like [2,3,4] and weights of [4,1,2], you have an
effective sample size of 7, not 3 - and should calculate accordingly.
having weights of [0.3, 0.4, 0.3] or [1+i, 2+3i, 4-i], or [-5, 1, 3] don't
really tell us about effective sample size. I'm not sure what to do with
something like [30.2, 29.8 22.34].

Furthermore, using the weighted standard deviation or weighted percentiles
also assume weights represents counts/frequency of some kind.
For sd, Having a negative weight is identical to have the positive weight
and the additive inverse of the x-value. Having a complex weight sends
histogram into complex space for variability calculation. Fractional (0.3)
or decimal (30.2) shouldn't have a problem with weighted sd, as long as
they're divided by the sum of the weights.
Similarly for percentile, it's a positional question - if the weights are
negative - what does that mean when calculating percentile x = [2,3],
weights [ 1, -2], what's the median? Even decimal weights = [1.2, 1.8],
what's the median? It corresponds to an cdf at count 1.5, so is it 3 or is
it some linear interpolation of 2 and 3? If the weights were [1,1] the
median would be 2.5, and if the weights are [1,2], then the median is
clearly 3. Complex numbers don't help at all.

Since non-whole weights make no sense in the estimators, maybe restrict
the optimal bins calculation if the weights are positive integers (which
should be a large majority of use cases?)
Maybe ignore non-whole weights and simply work with a and range - but in
use cases like this, the number of bins/ bin edges are likely predetermined.


Reply to this email directly or view it on GitHub
#6029 (comment).

@nayyarv

This comment has been minimized.

Contributor

nayyarv commented Aug 17, 2015

Fair enough, I can deal with fractional weighting by using a.size instead of weights.sum() for sample size and raise errors for unsupported weights types

Though since np.percentile doesn't support weights, auto should only call FD for unweighted data, and scott for weighted data (though like seberg, I'm not overly happy with this - maybe a quick check that np.all(weights == weights[0]) somewhere at the start?)

@seberg

This comment has been minimized.

Member

seberg commented Aug 17, 2015

Indeed, it is not like "auto" is a default, it is just the default suggested one (without bins). I do not like the explosion of combinations/complexity, and I definitely do not want to guess what type of weights we have based on the data type!

Either we ignore the problem and document that weights are ignored for these (which is a bit dubious), or we just throw an error.

If someone needs more, I think the only option would be to expose the estimator functions and add different types of weights to them (i.e. aweights/fweights) and that might be more the type of things for a statics package to handle.
Trying to fit that into histogram itself seems too much complexity for a function that does not need to care about where the weights came from.

Or we just pull them out of histogram completely again, expose the (more complicated) estimators somewhere and have matplotlib add the bins="method" syntactic sugar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment