Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/514 var std numpylike #515

Merged
merged 14 commits into from Apr 1, 2020
Merged

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Mar 31, 2020

Description

Working on #490 led me to opening several cans of worms and resulted in a rather messy PR (#508). I'm going to withdraw #508 and submit the main changes separately.

With this PR I'm making ht.var() and ht.std() numpy-compliant.

Issue/s resolved: #514 and partly #490 (because the numpy default is biased variance (no Bessel correction), ht.var() of a tensor of 1 element now returns 0 instead of nan).

Changes proposed:

  • kwarg bessel has been replaced by ddof, can be 0 or 1;
  • Default is ddof=0 like for numpy (equivalent to unbiased=False for torch and bessel=False for the current heat implementation).
  • argument bessel (bool) can still be passed so scripts shouldn't break, but remember the default is now no Bessel correction

Type of change

Remove irrelevant options:

  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

functions that call var() or std() relying on unbiased variance will yield the wrong result (default is now biased variance, ddof=0).

@ClaudiaComito ClaudiaComito added the enhancement New feature or request label Mar 31, 2020
@ClaudiaComito ClaudiaComito added this to In progress in Gaussian Naive Bayes via automation Mar 31, 2020
@codecov
Copy link

codecov bot commented Mar 31, 2020

Codecov Report

Merging #515 into master will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #515      +/-   ##
==========================================
+ Coverage   96.60%   96.61%   +0.01%     
==========================================
  Files          65       65              
  Lines       13991    14008      +17     
==========================================
+ Hits        13516    13534      +18     
+ Misses        475      474       -1     
Impacted Files Coverage Δ
heat/core/tests/test_indexing.py 92.75% <ø> (-0.11%) ⬇️
heat/core/dndarray.py 96.76% <100.00%> (+0.17%) ⬆️
heat/core/statistics.py 97.39% <100.00%> (+0.04%) ⬆️
heat/core/tests/test_statistics.py 99.30% <100.00%> (+0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ba5ae65...23dcc7e. Read the comment docs.

@coquelin77
Copy link
Member

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

@ClaudiaComito
Copy link
Contributor Author

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

I have a lot of understanding and in my mind I've been calling this the "doof" argument all along.

Not sure it's worth jeopardizing our import heat as np stance for this though. @Markus-Goetz ?

@coquelin77
Copy link
Member

to further stir the pot, pytorch's default is an unbiased estimator

@coquelin77
Copy link
Member

to further stir the pot, pytorch's default is an unbiased estimator

and to extend that, this would need to be changed in the var function as well to calculate the correct estimator.

@ClaudiaComito
Copy link
Contributor Author

to further stir the pot, pytorch's default is an unbiased estimator

and to extend that, this would need to be changed in the var function as well to calculate the correct estimator.

I think I've addressed all of this, but let me know once you had a look.

@coquelin77
Copy link
Member

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

I have a lot of understanding and in my mind I've been calling this the "doof" argument all along.

Not sure it's worth jeopardizing our import heat as np stance for this though. @Markus-Goetz ?

i do not think that we are jeopardizing this as the code will run clean with the bessel correction as default. I think that we should be using unbiased estimators statistics whenever possible. there are already differences which make it so that import heat as np does not distribute the data between processes. i think that the use of unbiased estimators as defaults can be put into this pile. i disagree with the default value setting in numpy in this matter and i do not think that we should follow numpy blindly.

@ClaudiaComito
Copy link
Contributor Author

i disagree with making a biased estimator the default of any statistical measure, regardless of how numpy does it.

I have a lot of understanding and in my mind I've been calling this the "doof" argument all along.
Not sure it's worth jeopardizing our import heat as np stance for this though. @Markus-Goetz ?

i do not think that we are jeopardizing this as the code will run clean with the bessel correction as default. I think that we should be using unbiased estimators statistics whenever possible. there are already differences which make it so that import heat as np does not distribute the data between processes. i think that the use of unbiased estimators as defaults can be put into this pile. i disagree with the default value setting in numpy in this matter and i do not think that we should follow numpy blindly.

The code doesn't run clean with the Bessel correction in edge cases, where for example ht.var() of a 1-element tensor returns nan instead of 0.

I'm not here to make a statement about best practices in statistics. I think deviating from numpy and setting ddof=1 as default in this case is a bad idea:

  • numpy users will expect the exact same results from heat for the same function call; by the way, sklearn uses the default np.var() profusely as well;
  • we know for a fact that the current default (bessel=True) breaks the tests in the edge cases where splitting produces a 1-element tensor on 1 node;
  • we will have instances of this (1-element tensors) happening in real life, not just in tests;
  • finally but most importantly, being able to split tensors is the reason why HeAT exists! Unbiased estimators are not. We're talking about a 1/n factor for the variance, where n is usually very large. This is not at all the same pile. I don't see this as following numpy blindly, rather I feel that the opposite would be following torch blindly and setting us up for trouble unnecessarily.

Let's agree to disagree as far as I'm concerned, and maybe the others should chip in as well at this point.

@coquelin77
Copy link
Member

coquelin77 commented Apr 1, 2020

this does a better job of making the case as to why it should be 0 and not 1. they didnt chose it arbitrarily. Can you add something like this do the docs? this is pulled directly from numpy

The mean is normally calculated as x.sum() / N, where N = len(x).
If, however, ddof is specified, the divisor N - ddof is used
instead. In standard statistical practice, ddof=1 provides an
unbiased estimator of the variance of a hypothetical infinite population.
ddof=0 provides a maximum likelihood estimate of the variance for
normally distributed variables.

@ClaudiaComito
Copy link
Contributor Author

ClaudiaComito commented Apr 1, 2020

this does a better job of making the case as to why it should be 0 and not 1. they didnt chose it arbitrarily.

Well, I never thought that they did it arbitrarily! By the way thanks for giving me the chance to refresh statistics 101...

coquelin77
coquelin77 previously approved these changes Apr 1, 2020
CHANGELOG.md Outdated
@@ -6,6 +6,7 @@
- [#483](https://github.com/helmholtz-analytics/heat/pull/483) Bugfix: DNDarray.cpu() changes heat device to cpu
- Update documentation theme to "Read the Docs"
- [#499](https://github.com/helmholtz-analytics/heat/pull/499) Bugfix: MPI datatype mapping: `torch.int16` now maps to `MPI.SHORT` instead of `MPI.SHORT_INT`
- [#515] (https://github.com/helmholtz-analytics/heat/pull/515) ht.var(), ht.std() numpy-compliant (ddof=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you be more specific? default is now max likelyhood ... instead of unbiased estimator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK done, both here and in the docs

@coquelin77
Copy link
Member

this does a better job of making the case as to why it should be 0 and not 1. they didnt chose it arbitrarily.

Well, I never thought that they did it arbitrarily! By the way thanks for giving me the chance to refresh statistics 101...

I was not sure why they chose it before. i still dont agree with it but if it has a logical reason behind it which i agree with then i cannot continue to oppose it simply because i have an opinion which leans the other way. (also this is not a major change, it was mostly a principle thing for me)

@ClaudiaComito ClaudiaComito merged commit 3b64c46 into master Apr 1, 2020
Gaussian Naive Bayes automation moved this from In progress to Done Apr 1, 2020
@ClaudiaComito ClaudiaComito deleted the features/514-var-std-numpylike branch April 1, 2020 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Make ht.var(), ht.std() NumPy-compatible
2 participants