Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Variance is not calculated correctly in some cases + inconsistent definition #10242
Comments
sinhrks
added Docs Numeric
labels
May 30, 2015
|
Thanks for the report. The difference is caused by the default of # numpy
dat['x'].values.var()
# 2.2448979591836733
dat['x'].values.var(ddof=True)
# 2.6190476190476186
# pandas
dat['x'].var()
# 2.6190476190476191
dat['x'].var(ddof=False)
# 2.2448979591836733It is written in API but easy to miss it... should be listed on parameter section. PR is appreciated. |
|
Ah I missed latter part. I think it should be precise as the same as numpy. Marked as a bug also. |
sinhrks
added the
Bug
label
May 30, 2015
nleehone
commented
May 31, 2015
|
Yeah the latter part is the more concerning. The first was more of a remark, but it should definitely be added to the parameters list. Might be a good idea to also make it explicit in the docs that numpy assumes a different ddof, and maybe add the example I gave as a caution to users. |
nleehone
commented
May 31, 2015
|
I'm trying to trace this back and found the following: dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686,
9.0692124687,9.0692124707,9.0692124679,9.0692124685,
9.0692124698,9.0692124719,9.0692124698,9.0692124692,
9.0692124689,9.0692124673,9.0692124707,9.0692124714,
9.0692124714,9.0692124734,9.0692124719,9.0692124710,
9.0692124694,9.0692124705,9.0692124713,9.0692124717
]})
values = dat['x'].values
# Formula used internally by pandas to calculate the variance (ignoring the final division by N-1):
print(np.sum(values)**2/len(values)-np.sum(values**2))
# Normal variance formula (once again ignoring the final div by N-1):
print(np.sum((values - np.mean(values))**2))
As far as I can tell these should be exactly equivalent, but for some reason they are not. |
|
do u have bottleneck installed? |
|
@jreback It works fine if # with bottleneck 1.0.0
dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18As @NickLH pointed, issue looks to be caused by # after fixing squared-sum to calculate mean first
# numpy
dat['x'].values.var()
# 2.0681774255780584e-18
dat['x'].values.var(ddof=True)
# 2.158098183211887e-18
dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18 |
nleehone
commented
May 31, 2015
|
No bottleneck... here's the full output of my system INSTALLED VERSIONS
------------------
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
pandas: 0.16.1
nose: 1.3.1
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: NoneI think it's a problem with the algorithm that's being used. According to this page on wiki http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance Quote: The method I used that seems to work is the two pass algorithm mentioned on the wiki article, which is apparently numerically stable except in the case of large The good formula (again according to the wiki) that is numerically stable, even for large def compensated_variance(data):
n = 0
sum1 = 0
for x in data:
n = n + 1
sum1 = sum1 + x
mean = sum1/n
sum2 = 0
sum3 = 0
for x in data:
sum2 = sum2 + (x - mean)**2
sum3 = sum3 + (x - mean)
variance = (sum2 - sum3**2/n)/(n - 1)
return variance
print(dat['x'].values.var(ddof=1))
print(compensated_variance(dat['x']))
However, the 'correct' value (calculated with Mathematica) is: |
|
look at how roll_var was implemented - this uses weldord method - iirc there is s comment that var should be changed |
nleehone
commented
May 31, 2015
|
Implementing it with Welford's method (or the modified one from Knuth that is on that wiki page) would probably slow down the computation, but it is probably worth the small performance drop since we should be trying to get rid of weird roundoff errors such as those I found. I'm quite busy at the moment, so I cannot work on a patch, but if nobody has taken this on by the time my workload reduces, I'd be happy to contribute. In the mean time, I suggest that people just use the numpy version: dat['x'].values.var(ddof=1)That's what I'll be doing at least... |
jreback
added this to the
Next Major Release
milestone
Jun 1, 2015
|
this would have to be benchmarked, but this is very much an edge case (difference of small numbers). As far as the |
This was referenced Jun 26, 2015
|
I took the liberty of opening a small PR to implement Welford's method in |
|
|
jreback
modified the milestone: 0.17.0, Next Major Release
Jun 30, 2015
This was referenced Jul 2, 2015
jreback
modified the milestone: Next Major Release, 0.17.0
Aug 15, 2015
BrazilForever11
commented
Aug 17, 2015
|
std has the same issue |
jreback
modified the milestone: 0.17.0, Next Major Release
Sep 2, 2015
jreback
added a commit
that referenced
this issue
Sep 4, 2015
|
|
jvkersch + jreback |
2d0d58c
|
|
closed by #10679 |
jreback
closed this
Sep 4, 2015
nickeubank
added a commit
to nickeubank/pandas
that referenced
this issue
Sep 29, 2015
|
|
jvkersch + nickeubank |
5c942ea
|
nleehone commentedMay 30, 2015
The var method acts inconsistently when used on the Series vs the numpy array returned from values:
This is due to numpy using ddof=0 as its default (calculates the biased variance), whereas pandas calculates the unbiased variance by default. These two should be consistent, so either pandas should adapt or numpy should (maybe it should be numpy since we should calculate unbiased by default?).
The other problem is that pandas does not calculate the variance of this DataFrame properly.
Inconsistent definition aside, the variance should clearly not be zero when calculated from pandas.
Here is the system information:
INSTALLED VERSIONS:
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
pandas: 0.16.1
numpy: 1.9.2