Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Variance is not calculated correctly in some cases + inconsistent definition #10242

Closed
nleehone opened this issue May 30, 2015 · 14 comments
Closed
Labels
Bug Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@nleehone
Copy link

The var method acts inconsistently when used on the Series vs the numpy array returned from values:

dat = pd.DataFrame({'x': [1,2,3,4,0,0,0]})
print(dat['x'].values.var())
print(dat['x'].var())
Prints:
2.24489795918
2.61904761905

This is due to numpy using ddof=0 as its default (calculates the biased variance), whereas pandas calculates the unbiased variance by default. These two should be consistent, so either pandas should adapt or numpy should (maybe it should be numpy since we should calculate unbiased by default?).

The other problem is that pandas does not calculate the variance of this DataFrame properly.
Inconsistent definition aside, the variance should clearly not be zero when calculated from pandas.

dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686,
                          9.0692124687,9.0692124707,9.0692124679,9.0692124685,
                          9.0692124698,9.0692124719,9.0692124698,9.0692124692,
                          9.0692124689,9.0692124673,9.0692124707,9.0692124714,
                          9.0692124714,9.0692124734,9.0692124719,9.0692124710,
                          9.0692124694,9.0692124705,9.0692124713,9.0692124717
]})
print(dat['x'].values.var())
print(dat['x'].var())
Prints:
2.06817742558e-18
0.0

Here is the system information:
INSTALLED VERSIONS:
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.16.1
numpy: 1.9.2

@sinhrks sinhrks added Docs Numeric Operations Arithmetic, Comparison, and Logical operations labels May 30, 2015
@sinhrks
Copy link
Member

sinhrks commented May 30, 2015

Thanks for the report. The difference is caused by the default of ddof option which specify sample / unbiased variance.

# numpy
dat['x'].values.var()
# 2.2448979591836733
dat['x'].values.var(ddof=True)
# 2.6190476190476186

# pandas
dat['x'].var()
# 2.6190476190476191
dat['x'].var(ddof=False)
# 2.2448979591836733

It is written in API but easy to miss it... should be listed on parameter section. PR is appreciated.

@sinhrks
Copy link
Member

sinhrks commented May 30, 2015

Ah I missed latter part.

I think it should be precise as the same as numpy. Marked as a bug also.

@sinhrks sinhrks added the Bug label May 30, 2015
@nleehone
Copy link
Author

Yeah the latter part is the more concerning.

The first was more of a remark, but it should definitely be added to the parameters list. Might be a good idea to also make it explicit in the docs that numpy assumes a different ddof, and maybe add the example I gave as a caution to users.

@nleehone
Copy link
Author

I'm trying to trace this back and found the following:

dat = pd.DataFrame({'x': [9.0692124699,9.0692124692,9.0692124702,9.0692124686,
                          9.0692124687,9.0692124707,9.0692124679,9.0692124685,
                          9.0692124698,9.0692124719,9.0692124698,9.0692124692,
                          9.0692124689,9.0692124673,9.0692124707,9.0692124714,
                          9.0692124714,9.0692124734,9.0692124719,9.0692124710,
                          9.0692124694,9.0692124705,9.0692124713,9.0692124717
]})

values = dat['x'].values
# Formula used internally by pandas to calculate the variance (ignoring the final division by N-1):
print(np.sum(values)**2/len(values)-np.sum(values**2))
# Normal variance formula (once again ignoring the final div by N-1):
print(np.sum((values - np.mean(values))**2))
Returns:
0.0
4.96362582139e-17

As far as I can tell these should be exactly equivalent, but for some reason they are not.

@jreback
Copy link
Contributor

jreback commented May 31, 2015

do u have bottleneck installed?

@sinhrks
Copy link
Member

sinhrks commented May 31, 2015

@jreback It works fine if bottleneck installed on my environment.

# with bottleneck 1.0.0
dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18

As @NickLH pointed, issue looks to be caused by nanops._nanvar which drops small decimals during squared-sum. Should check there is no side effect though...

# after fixing squared-sum to calculate mean first

# numpy
dat['x'].values.var()
# 2.0681774255780584e-18
dat['x'].values.var(ddof=True)
# 2.158098183211887e-18

dat['x'].var()
# 2.158098183211887e-18
dat['x'].var(ddof=False)
# 2.0681774255780584e-18

@nleehone
Copy link
Author

No bottleneck... here's the full output of my system

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-52-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.16.1
nose: 1.3.1
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 1.2.1
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

I think it's a problem with the algorithm that's being used. According to this page on wiki http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
the algorithm that is being used is the naieve one, which has problems when the values are similar.

Quote:
Because SumSq and (Sum×Sum)⁄n can be very similar numbers, cancellation can lead to the precision of the result to be much less than the inherent precision of the floating-point arithmetic used to perform the computation. Thus this algorithm should not be used in practice.[1][2] This is particularly bad if the standard deviation is small relative to the mean. However, the algorithm can be improved by adopting the method of the assumed mean.
End Quote

The method I used that seems to work is the two pass algorithm mentioned on the wiki article, which is apparently numerically stable except in the case of large n. This is also the algorithm that numpy uses under the hood. See: https://github.com/numpy/numpy/blob/4cbced22a8306d7214a4e18d12e651c034143922/numpy/core/_methods.py (look up '_var').

The good formula (again according to the wiki) that is numerically stable, even for large n is:

def compensated_variance(data):
    n = 0
    sum1 = 0
    for x in data:
        n = n + 1
        sum1 = sum1 + x
    mean = sum1/n

    sum2 = 0
    sum3 = 0
    for x in data:
        sum2 = sum2 + (x - mean)**2
        sum3 = sum3 + (x - mean)
    variance = (sum2 - sum3**2/n)/(n - 1)
    return variance

print(dat['x'].values.var(ddof=1))
print(compensated_variance(dat['x']))
Returns:
Normal: 2.158098183211887e-18
Compensated: 2.1580981832106009e-18

However, the 'correct' value (calculated with Mathematica) is:
2.158098183211887e-18
So the compensated version sacrifices some numerical accuracy to get better stability with large values of n. I do not think we should implement it this way.

@jreback
Copy link
Contributor

jreback commented May 31, 2015

look at how roll_var was implemented - this uses weldord method - iirc there is s comment that var should be changed

@nleehone
Copy link
Author

Implementing it with Welford's method (or the modified one from Knuth that is on that wiki page) would probably slow down the computation, but it is probably worth the small performance drop since we should be trying to get rid of weird roundoff errors such as those I found.

I'm quite busy at the moment, so I cannot work on a patch, but if nobody has taken this on by the time my workload reduces, I'd be happy to contribute.

In the mean time, I suggest that people just use the numpy version:

dat['x'].values.var(ddof=1)

That's what I'll be doing at least...

@jreback jreback added this to the Next Major Release milestone Jun 1, 2015
@jreback
Copy link
Contributor

jreback commented Jun 1, 2015

this would have to be benchmarked, but this is very much an edge case (difference of small numbers). As far as the ddof is concerned, you can simply be explicit.

@jvkersch
Copy link

I took the liberty of opening a small PR to implement Welford's method in group_var_float64 (which is used to compute the variance after grouping by, and also implements the sum-of-squares algorithm). This seems to work quite well, but I wonder if the correct way to go is to fix _nanvar in a similar way, or to have these two methods share the same implementation. I haven't looked at _nanvar in detail, so I don't know if that's feasible.

@jreback
Copy link
Contributor

jreback commented Jun 29, 2015

@jvkersch

_nanvar could also use attention, however its implementation doesn't involve a group indexer, so can be a bit simpler (and is implemented as a vectorized method).

@jreback jreback modified the milestones: 0.17.0, Next Major Release Jun 30, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@BrazilForever11
Copy link

std has the same issue

@jreback jreback modified the milestones: 0.17.0, Next Major Release Sep 2, 2015
@jreback
Copy link
Contributor

jreback commented Sep 4, 2015

closed by #10679

@jreback jreback closed this as completed Sep 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Docs Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants