Unbiased estimates of variance/ std deviation (Trac #502) #1100

numpy-gitbot opened this Issue Oct 19, 2012 · 8 comments


None yet
1 participant

Original ticket http://projects.scipy.org/numpy/ticket/502 on 2007-04-18 by @pierregm, assigned to unknown.

Unbiased estimates of the variance and standard deviation are used far more often than their biased counterparts. Currently, the var/std methods/functions return biased estimates. Unbiased estimates can be obtained simply by multiplying the variance by n/float(n-1) (where n is the size of the array along a particular axis). This extra step (and the test on n it implies) becomes quickly tedious when used repeatedly.

I suggest the introduction of 2 new methods (and the corresponding functions), varu and stdu, that would give direct access to the unbiased estimates.

Milestone changed to 1.1 by @alberts on 2007-05-12

@charris wrote on 2007-05-13

There have been several discussions of this, and, IIRC, the upshot was that the biased estimates were preferred for numerical reasons. People want the unbiased estimate ofter enough that I think we should add a flag in the call, something like bias={0,1}.

@rkern wrote on 2007-05-13

My preference is to not use the term "bias," since it's wrong. The standard definition of bias, E[X-Xtrue], is inappropriate for quantities like variance and standard deviation. Notably, the square root of the N-1, "unbiased" variance is not an unbiased estimate of the standard deviation. Using an appropriate definition of bias, E[log(X/Xtrue)], does give coherent estimates of variance and standard deviation. However, the factor is N-2, not N-1.

My preference is to add a parameter that is subtracted from N rather than a flag that switches between N and N-1. The shortest name a I can come up with is a bit obscure, though, "ddof" for "change in degrees of freedom".

Attachment added by trac user zouave on 2007-07-12: diff-std

trac user zouave wrote on 2007-07-12

I've just posted a patch that adds a "ddof" parameter as described above (by rkern), as well as a "pop_mean" parameter that allows users to supply an out-of-sample population mean to be used instead of calculating the in-sample mean. It affects ndarrays, masked arrays and matrices; doesn't affect fromnumeric.py stuff. The appropriate on-line doc is also updated (to the best of my knowledge).

Any comments, questions, suggestions [or insults] are more than welcome. Honestly, my hope is to have this fixed by 1.0.4 (and that this release will come soon).

@mdehoon wrote on 2008-02-20

charris wrote:

something like bias={0,1}.

rkern wrote:

My preference is to not use the term "bias," since it's wrong.

One solution is to use ml={1,0}, with ml standing for maximum likelihood.
Dividing by N gives the maximum likelihood estimate, and this is true both for the variance and the standard deviation.

@teoliphant wrote on 2008-03-07

In r4853 the ddof feature was added to var and std. So, this ticket can be closed.

Milestone changed to 1.0.5 by @stefanv on 2008-03-11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment