Variance is not calculated correctly in some cases + inconsistent definition #10242

nleehone · 2015-05-30T22:37:57Z

The var method acts inconsistently when used on the Series vs the numpy array returned from values:

dat = pd.DataFrame({'x': [1,2,3,4,0,0,0]})
print(dat['x'].values.var())
print(dat['x'].var())

Prints:
2.24489795918
2.61904761905

This is due to numpy using ddof=0 as its default (calculates the biased variance), whereas pandas calculates the unbiased variance by default. These two should be consistent, so either pandas should adapt or numpy should (maybe it should be numpy since we should calculate unbiased by default?).

The other problem is that pandas does not calculate the variance of this DataFrame properly.
Inconsistent definition aside, the variance should clearly not be zero when calculated from pandas.

dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686,
                          9.0692124687,9.0692124707,9.0692124679,9.0692124685,
                          9.0692124698,9.0692124719,9.0692124698,9.0692124692,
                          9.0692124689,9.0692124673,9.0692124707,9.0692124714,
                          9.0692124714,9.0692124734,9.0692124719,9.0692124710,
                          9.0692124694,9.0692124705,9.0692124713,9.0692124717
]})
print(dat['x'].values.var())
print(dat['x'].var())

Prints:
2.06817742558e-18
0.0

Here is the system information:
INSTALLED VERSIONS:
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.16.1
numpy: 1.9.2

The text was updated successfully, but these errors were encountered:

sinhrks · 2015-05-30T23:10:11Z

Thanks for the report. The difference is caused by the default of ddof option which specify sample / unbiased variance.

# numpy
dat['x'].values.var()
# 2.2448979591836733
dat['x'].values.var(ddof=True)
# 2.6190476190476186

# pandas
dat['x'].var()
# 2.6190476190476191
dat['x'].var(ddof=False)
# 2.2448979591836733

It is written in API but easy to miss it... should be listed on parameter section. PR is appreciated.

sinhrks · 2015-05-30T23:16:12Z

Ah I missed latter part.

I think it should be precise as the same as numpy. Marked as a bug also.

nleehone · 2015-05-31T00:00:36Z

Yeah the latter part is the more concerning.

The first was more of a remark, but it should definitely be added to the parameters list. Might be a good idea to also make it explicit in the docs that numpy assumes a different ddof, and maybe add the example I gave as a caution to users.

nleehone · 2015-05-31T00:33:03Z

I'm trying to trace this back and found the following:

dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686,
                          9.0692124687,9.0692124707,9.0692124679,9.0692124685,
                          9.0692124698,9.0692124719,9.0692124698,9.0692124692,
                          9.0692124689,9.0692124673,9.0692124707,9.0692124714,
                          9.0692124714,9.0692124734,9.0692124719,9.0692124710,
                          9.0692124694,9.0692124705,9.0692124713,9.0692124717
]})

values = dat['x'].values
# Formula used internally by pandas to calculate the variance (ignoring the final division by N-1):
print(np.sum(values)**2/len(values)-np.sum(values**2))
# Normal variance formula (once again ignoring the final div by N-1):
print(np.sum((values - np.mean(values))**2))

Returns:
0.0
4.96362582139e-17

As far as I can tell these should be exactly equivalent, but for some reason they are not.

jreback · 2015-05-31T01:09:25Z

do u have bottleneck installed?

sinhrks · 2015-05-31T01:27:12Z

@jreback It works fine if bottleneck installed on my environment.

# with bottleneck 1.0.0
dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18

As @NickLH pointed, issue looks to be caused by nanops._nanvar which drops small decimals during squared-sum. Should check there is no side effect though...

# after fixing squared-sum to calculate mean first

# numpy
dat['x'].values.var()
# 2.0681774255780584e-18
dat['x'].values.var(ddof=True)
# 2.158098183211887e-18

dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18

nleehone · 2015-05-31T02:00:50Z

No bottleneck... here's the full output of my system

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.16.1
nose: 1.3.1
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

I think it's a problem with the algorithm that's being used. According to this page on wiki http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
the algorithm that is being used is the naieve one, which has problems when the values are similar.

Quote:
Because SumSq and (Sum×Sum)⁄n can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice.[1][2] This is particularly bad if the standard deviation is small relative to the mean. However, the algorithm can be improved by adopting the method of the assumed mean.
End Quote

The method I used that seems to work is the two pass algorithm mentioned on the wiki article, which is apparently numerically stable except in the case of large n. This is also the algorithm that numpy uses under the hood. See: https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/numpy/core/_methods.py (look up '_var').

The good formula (again according to the wiki) that is numerically stable, even for large n is:

def compensated_variance(data):
    n = 0
    sum1 = 0
    for x in data:
        n = n + 1
        sum1 = sum1 + x
    mean = sum1/n

    sum2 = 0
    sum3 = 0
    for x in data:
        sum2 = sum2 + (x - mean)**2
        sum3 = sum3 + (x - mean)
    variance = (sum2 - sum3**2/n)/(n - 1)
    return variance

print(dat['x'].values.var(ddof=1))
print(compensated_variance(dat['x']))

Returns:
Normal: 2.158098183211887e-18
Compensated: 2.1580981832106009e-18

However, the 'correct' value (calculated with Mathematica) is:
2.158098183211887e-18
So the compensated version sacrifices some numerical accuracy to get better stability with large values of n. I do not think we should implement it this way.

jreback · 2015-05-31T02:24:12Z

look at how roll_var was implemented - this uses weldord method - iirc there is s comment that var should be changed

nleehone · 2015-05-31T03:44:24Z

Implementing it with Welford's method (or the modified one from Knuth that is on that wiki page) would probably slow down the computation, but it is probably worth the small performance drop since we should be trying to get rid of weird roundoff errors such as those I found.

I'm quite busy at the moment, so I cannot work on a patch, but if nobody has taken this on by the time my workload reduces, I'd be happy to contribute.

In the mean time, I suggest that people just use the numpy version:

dat['x'].values.var(ddof=1)

That's what I'll be doing at least...

jreback · 2015-06-01T10:56:05Z

this would have to be benchmarked, but this is very much an edge case (difference of small numbers). As far as the ddof is concerned, you can simply be explicit.

jvkersch · 2015-06-29T17:29:47Z

I took the liberty of opening a small PR to implement Welford's method in group_var_float64 (which is used to compute the variance after grouping by, and also implements the sum-of-squares algorithm). This seems to work quite well, but I wonder if the correct way to go is to fix _nanvar in a similar way, or to have these two methods share the same implementation. I haven't looked at _nanvar in detail, so I don't know if that's feasible.

jreback · 2015-06-29T18:32:15Z

@jvkersch

_nanvar could also use attention, however its implementation doesn't involve a group indexer, so can be a bit simpler (and is implemented as a vectorized method).

BrazilForever11 · 2015-08-17T19:22:18Z

std has the same issue

jreback · 2015-09-04T12:15:33Z

closed by #10679

sinhrks added Docs Numeric Operations Arithmetic, Comparison, and Logical operations labels May 30, 2015

sinhrks added the Bug label May 30, 2015

jreback added this to the Next Major Release milestone Jun 1, 2015

jreback mentioned this issue Jun 26, 2015

Cython function group_var_float64 may return small negative values #10448

Closed

jvkersch mentioned this issue Jun 29, 2015

ENH: Make group_var_ use Welford's algorithm. #10472

Merged

jreback modified the milestones: 0.17.0, Next Major Release Jun 30, 2015

shoyer mentioned this issue Jul 2, 2015

Pandas: why pandas.Series.std() is quite different from numpy.std() #10489

Closed

jvkersch mentioned this issue Jul 26, 2015

BUG: Use stable algorithm for _nanvar. #10679

Closed

jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015

jreback modified the milestones: 0.17.0, Next Major Release Sep 2, 2015

jreback pushed a commit that referenced this issue Sep 4, 2015

BUG: Use stable algorithm for _nanvar, #10242

2d0d58c

jreback closed this as completed Sep 4, 2015

nickeubank pushed a commit to nickeubank/pandas that referenced this issue Sep 29, 2015

BUG: Use stable algorithm for _nanvar, pandas-dev#10242

5c942ea

jreback mentioned this issue Nov 5, 2015

BUG: pandas std broken, erratic behavior #11524

Closed

HarveySouth mentioned this issue Jul 30, 2022

BUG: Inconsistency in std() defaults between Pandas compared to Xarray and NumPy produces unexpected behaviour #47903

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variance is not calculated correctly in some cases + inconsistent definition #10242

Variance is not calculated correctly in some cases + inconsistent definition #10242

nleehone commented May 30, 2015

sinhrks commented May 30, 2015

sinhrks commented May 30, 2015

nleehone commented May 31, 2015

nleehone commented May 31, 2015

jreback commented May 31, 2015

sinhrks commented May 31, 2015

nleehone commented May 31, 2015

jreback commented May 31, 2015

nleehone commented May 31, 2015

jreback commented Jun 1, 2015

jvkersch commented Jun 29, 2015

jreback commented Jun 29, 2015

BrazilForever11 commented Aug 17, 2015

jreback commented Sep 4, 2015

Variance is not calculated correctly in some cases + inconsistent definition #10242

Variance is not calculated correctly in some cases + inconsistent definition #10242

Comments

nleehone commented May 30, 2015

sinhrks commented May 30, 2015

sinhrks commented May 30, 2015

nleehone commented May 31, 2015

nleehone commented May 31, 2015

jreback commented May 31, 2015

sinhrks commented May 31, 2015

nleehone commented May 31, 2015

jreback commented May 31, 2015

nleehone commented May 31, 2015

jreback commented Jun 1, 2015

jvkersch commented Jun 29, 2015

jreback commented Jun 29, 2015

BrazilForever11 commented Aug 17, 2015

jreback commented Sep 4, 2015