Features/514 var std numpylike #515

ClaudiaComito · 2020-03-31T10:31:50Z

Description

Working on #490 led me to opening several cans of worms and resulted in a rather messy PR (#508). I'm going to withdraw #508 and submit the main changes separately.

With this PR I'm making ht.var() and ht.std() numpy-compliant.

Issue/s resolved: #514 and partly #490 (because the numpy default is biased variance (no Bessel correction), ht.var() of a tensor of 1 element now returns 0 instead of nan).

Changes proposed:

kwarg bessel has been replaced by ddof, can be 0 or 1;
Default is ddof=0 like for numpy (equivalent to unbiased=False for torch and bessel=False for the current heat implementation).
argument bessel (bool) can still be passed so scripts shouldn't break, but remember the default is now no Bessel correction

Type of change

Remove irrelevant options:

Breaking change (fix or feature that would cause existing functionality to not work as expected)

Due Diligence

All split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

functions that call var() or std() relying on unbiased variance will yield the wrong result (default is now biased variance, ddof=0).

…mpleteness

codecov · 2020-03-31T10:38:13Z

Codecov Report

Merging #515 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #515      +/-   ##
==========================================
+ Coverage   96.60%   96.61%   +0.01%     
==========================================
  Files          65       65              
  Lines       13991    14008      +17     
==========================================
+ Hits        13516    13534      +18     
+ Misses        475      474       -1

Impacted Files	Coverage Δ
heat/core/tests/test_indexing.py	`92.75% <ø> (-0.11%)`	⬇️
heat/core/dndarray.py	`96.76% <100.00%> (+0.17%)`	⬆️
heat/core/statistics.py	`97.39% <100.00%> (+0.04%)`	⬆️
heat/core/tests/test_statistics.py	`99.30% <100.00%> (+0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba5ae65...23dcc7e. Read the comment docs.

coquelin77 · 2020-03-31T10:40:00Z

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

ClaudiaComito · 2020-03-31T10:52:17Z

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

I have a lot of understanding and in my mind I've been calling this the "doof" argument all along.

Not sure it's worth jeopardizing our import heat as np stance for this though. @Markus-Goetz ?

coquelin77 · 2020-03-31T10:56:00Z

to further stir the pot, pytorch's default is an unbiased estimator

coquelin77 · 2020-03-31T11:07:06Z

to further stir the pot, pytorch's default is an unbiased estimator

and to extend that, this would need to be changed in the var function as well to calculate the correct estimator.

ClaudiaComito · 2020-04-01T04:51:22Z

to further stir the pot, pytorch's default is an unbiased estimator

and to extend that, this would need to be changed in the var function as well to calculate the correct estimator.

I think I've addressed all of this, but let me know once you had a look.

coquelin77 · 2020-04-01T06:13:05Z

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

I have a lot of understanding and in my mind I've been calling this the "doof" argument all along.

Not sure it's worth jeopardizing our import heat as np stance for this though. @Markus-Goetz ?

i do not think that we are jeopardizing this as the code will run clean with the bessel correction as default. I think that we should be using unbiased estimators statistics whenever possible. there are already differences which make it so that import heat as np does not distribute the data between processes. i think that the use of unbiased estimators as defaults can be put into this pile. i disagree with the default value setting in numpy in this matter and i do not think that we should follow numpy blindly.

ClaudiaComito · 2020-04-01T08:24:17Z

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

I have a lot of understanding and in my mind I've been calling this the "doof" argument all along.
Not sure it's worth jeopardizing our import heat as np stance for this though. @Markus-Goetz ?

i do not think that we are jeopardizing this as the code will run clean with the bessel correction as default. I think that we should be using unbiased estimators statistics whenever possible. there are already differences which make it so that import heat as np does not distribute the data between processes. i think that the use of unbiased estimators as defaults can be put into this pile. i disagree with the default value setting in numpy in this matter and i do not think that we should follow numpy blindly.

The code doesn't run clean with the Bessel correction in edge cases, where for example ht.var() of a 1-element tensor returns nan instead of 0.

I'm not here to make a statement about best practices in statistics. I think deviating from numpy and setting ddof=1 as default in this case is a bad idea:

numpy users will expect the exact same results from heat for the same function call; by the way, sklearn uses the default np.var() profusely as well;
we know for a fact that the current default (bessel=True) breaks the tests in the edge cases where splitting produces a 1-element tensor on 1 node;
we will have instances of this (1-element tensors) happening in real life, not just in tests;
finally but most importantly, being able to split tensors is the reason why HeAT exists! Unbiased estimators are not. We're talking about a 1/n factor for the variance, where n is usually very large. This is not at all the same pile. I don't see this as following numpy blindly, rather I feel that the opposite would be following torch blindly and setting us up for trouble unnecessarily.

Let's agree to disagree as far as I'm concerned, and maybe the others should chip in as well at this point.

coquelin77 · 2020-04-01T08:34:29Z

this does a better job of making the case as to why it should be 0 and not 1. they didnt chose it arbitrarily. Can you add something like this do the docs? this is pulled directly from numpy

The mean is normally calculated as x.sum() / N, where N = len(x).
If, however, ddof is specified, the divisor N - ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.

ClaudiaComito · 2020-04-01T08:38:08Z

this does a better job of making the case as to why it should be 0 and not 1. they didnt chose it arbitrarily.

Well, I never thought that they did it arbitrarily! By the way thanks for giving me the chance to refresh statistics 101...

coquelin77 · 2020-04-01T08:40:51Z

CHANGELOG.md

@@ -6,6 +6,7 @@
 - [#483](https://github.com/helmholtz-analytics/heat/pull/483) Bugfix: DNDarray.cpu() changes heat device to cpu
 - Update documentation theme to "Read the Docs"
 - [#499](https://github.com/helmholtz-analytics/heat/pull/499) Bugfix: MPI datatype mapping: `torch.int16` now maps to `MPI.SHORT` instead of `MPI.SHORT_INT`
+- [#515] (https://github.com/helmholtz-analytics/heat/pull/515) ht.var(), ht.std() numpy-compliant (ddof=0)


can you be more specific? default is now max likelyhood ... instead of unbiased estimator

OK done, both here and in the docs

coquelin77 · 2020-04-01T08:46:18Z

this does a better job of making the case as to why it should be 0 and not 1. they didnt chose it arbitrarily.

Well, I never thought that they did it arbitrarily! By the way thanks for giving me the chance to refresh statistics 101...

I was not sure why they chose it before. i still dont agree with it but if it has a logical reason behind it which i agree with then i cannot continue to oppose it simply because i have an opinion which leans the other way. (also this is not a major change, it was mostly a principle thing for me)

ClaudiaComito added 7 commits March 31, 2020 11:40

Replacing var(), std() argument "bessel" with ddof=0 for numpy API co…

b181103

…mpleteness

Replaced bessel with ddof attribute in test_var(), test_std()

25b3c3e

ht.var(), ht.std(), updated documentation and examples

c09490a

Attribute ddof=0 (!) replaces bessel in methods var(), std().

f4c0088

Removed forgotten print statement

d1f63d1

Informally allow for kwarg "bessel" within var() and std()

801e0e0

Inofficially allow for kwarg "bessel" in methods std() and var()

07033d2

ClaudiaComito added the enhancement New feature or request label Mar 31, 2020

ClaudiaComito requested review from mtar and coquelin77 March 31, 2020 10:31

ClaudiaComito added this to In progress in Gaussian Naive Bayes via automation Mar 31, 2020

Updated changelog

c963b3f

Increasing test coverage.

108d37f

ClaudiaComito added 2 commits April 1, 2020 06:08

Increasing test coverage

c7ee6be

Increasing test coverage

302d802

ClaudiaComito requested a review from Markus-Goetz April 1, 2020 04:46

coquelin77 previously approved these changes Apr 1, 2020

View reviewed changes

Expanded documentation on ddof

1efa36d

ClaudiaComito dismissed coquelin77’s stale review via 1efa36d April 1, 2020 10:01

ClaudiaComito and others added 2 commits April 1, 2020 12:09

Updated changelog

416f90e

Merge branch 'master' into features/514-var-std-numpylike

23dcc7e

ClaudiaComito merged commit 3b64c46 into master Apr 1, 2020

Gaussian Naive Bayes automation moved this from In progress to Done Apr 1, 2020

ClaudiaComito deleted the features/514-var-std-numpylike branch April 1, 2020 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features/514 var std numpylike #515

Features/514 var std numpylike #515

ClaudiaComito commented Mar 31, 2020 •

edited

codecov bot commented Mar 31, 2020 •

edited

coquelin77 commented Mar 31, 2020

ClaudiaComito commented Mar 31, 2020

coquelin77 commented Mar 31, 2020

coquelin77 commented Mar 31, 2020

ClaudiaComito commented Apr 1, 2020

coquelin77 commented Apr 1, 2020

ClaudiaComito commented Apr 1, 2020

coquelin77 commented Apr 1, 2020 •

edited

ClaudiaComito commented Apr 1, 2020 •

edited

coquelin77 Apr 1, 2020

ClaudiaComito Apr 1, 2020

coquelin77 commented Apr 1, 2020

Features/514 var std numpylike #515

Features/514 var std numpylike #515

Conversation

ClaudiaComito commented Mar 31, 2020 • edited

Description

Changes proposed:

Type of change

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

codecov bot commented Mar 31, 2020 • edited

Codecov Report

coquelin77 commented Mar 31, 2020

ClaudiaComito commented Mar 31, 2020

coquelin77 commented Mar 31, 2020

coquelin77 commented Mar 31, 2020

ClaudiaComito commented Apr 1, 2020

coquelin77 commented Apr 1, 2020

ClaudiaComito commented Apr 1, 2020

coquelin77 commented Apr 1, 2020 • edited

ClaudiaComito commented Apr 1, 2020 • edited

coquelin77 Apr 1, 2020

Choose a reason for hiding this comment

ClaudiaComito Apr 1, 2020

Choose a reason for hiding this comment

coquelin77 commented Apr 1, 2020

ClaudiaComito commented Mar 31, 2020 •

edited

codecov bot commented Mar 31, 2020 •

edited

coquelin77 commented Apr 1, 2020 •

edited

ClaudiaComito commented Apr 1, 2020 •

edited