Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cov and corrcoef: add support for weights #3864

Closed
wants to merge 1 commit into from
Closed

Conversation

ndawe
Copy link

@ndawe ndawe commented Oct 4, 2013

Given that average supports weights [1], and that when searching around there have been requests [2] for implementations of weighted covariance, I think it would be beneficial to also support weights in cov and corrcoef [3, 4].

[1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html
[2] http://stackoverflow.com/questions/11442962/python-package-that-supports-weighted-covariance-computation
[3] http://stats.stackexchange.com/a/61298/29084
[4] https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_covariance

if weights is not None:
weights = array(weights, dtype=float)
X -= average(X, axis=1-axis, weights=weights)[tup]
N = weights.sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right.
if the sum of weights is 1 and ddof and bias set to default values you always get a division by zero in the scaling due to fact = N - ddof

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliantaylor very good point. The weights should first be normalized to sum to N (the unweighted number of observations) right? I am looking at http://stats.stackexchange.com/a/61298/29084

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ValueError should also be raised if the weights sum to zero.

@ndawe
Copy link
Author

ndawe commented Oct 5, 2013

@juliantaylor this should now be the correct implementation. If the weights are not repeat-type, as in they don't represent an integer number of occurrences of each observation, then we only have the biased definition. So this PR adds two new arguments: weights and repeat_weights. repeat_weights defaults to 0 and the covariance is always biased but if repeat_weights is 1 then either the biased or unbiased equation is used depending on the specified value of bias and ddof and the weights must be integer frequencies of each observation.

weights : array-like, optional
A 1-D array of weights with a length equal to the number of
observations.
repeat_weights : int, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds to me, like it should be boolean. Before adding new keyword arguments, a short note to the mailing list would be good, someone might have some cool ideas about the api. I have never used such a weighted cov, are the actual weights integers if this is True, or do they just sum up to an integer (or well, can be interpreted as degrees of freedom or something like that)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to make it a boolean but then I saw that bias was an int so I assumed that for some reason numpy preferred to use 0/1 switches instead of False/True. I also put this new kwarg last so it would not interfere with existing calls to this function that assume the current order. I will send an email to the list asking for more input and ideas. If repeat_weights is True then the elements of weights must be integers for the results to make sense. The weights are in general not integer frequencies though and in that case we only have the biased covariance. Maybe there are better names for this argument... frequency_weights?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the function is just so old, that when it was written python did not have booleans.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow...

@seberg
Copy link
Member

seberg commented Oct 11, 2013

@juliantaylor, yes it will conflict gh-3882, can't be helped... one of these will have to do an annoying merge.

@ndawe, needs a ping to the mailing list and tests. I am a bit wondering about the "repeat-type" argument. It seems to me that all it does is forcing bias = 1/ddof = 0, but on the other hand, it might be a bit hidden that way.

@ndawe
Copy link
Author

ndawe commented Oct 11, 2013

I don't mind doing a rebase on top of #3882 since I see that as a more important issue.

@josef-pkt
Copy link

@seberg Thanks for the ping, I will look at it.

In statsmodels I have only frequency weights (which I had to correct after the stackoverflow question).
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats.DescrStatsW.cov.html

I don't see unit tests yet in this PR<

@josef-pkt
Copy link

I think the bool repeat_weights could be replaced by allowing two different weights arguments. That's what Stata and SAS are doing, I never tried that. Stata has fweights and pweights (and aweights).

What I remember from the SAS documentation, they are just multiplied together but affect the total "sample size" differently.
In statsmodels, I have frequency weights that are unit tested with integer weights against np.cov with the replicated array. But I don't impose integer weights. The only important assumption is that weights.sum() reflects the total sample size.

I never looked at the details for the non-frequency weights in this case. If those have the interpretation of the inverse standard deviation of the individual observations, then they shouldn't affect the denominator, I guess. This should be compared with other packages (and unit tested against them).

@seberg
Copy link
Member

seberg commented Oct 11, 2013

Ah, that sounds good, having pweights and aweights (or some other/better names). One could really multiply them together and frequency weights would just replace N with N = fweights.sum(), while pweights would (might be wrong) change N by (also) using N *= pweights.sum()**2 / pweights.sum()**2. After that ddof can just be used as always to calculate fact.
EDIT: Changing N with pweights is of course only enough if they are normalized to 1 first, otherwise fact needs to take their sum into accont too.

@josef-pkt
Copy link

I mixed them up, Stata has fweights (repeated observations) and aweights, but I just tried, and it doesn't seem to allow them at the same time, although I don't see a reason why it shouldn't.

In Stata, scaling aweights has no effect on cov, or variance, and corresponds to normalizing those to 1, AFAICS. However, the scale invariance conflicts with the usual interpretation that Stata has that aweights correspond to the standard deviation assuming that the observation is a mean of a subsample.
http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/
The last is maybe not relevant here, just a reminder that I had forgotten what the interpretation of aweights is.

Doubling fweights, doubles the number of assumed observations, that's easier to understand.

I haven't looked for the SAS or SPSS documentation. (SPSS has only one weight, fweights, in the formula collection, IIRC). I have no idea what R does, since it doesn't have such a nice documentation.

@josef-pkt
Copy link

SAS: I don't remember which procedures I was reading, never for cov specifically

for example here SAS has FREQ and WEIGHTS
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473330.htm
Looking at the explanation for sigma^2:
It looks like SAS has a separate VARDEF= option, which allows different denominators in the variance based on the weights.

I remember reading another SAS documentation page that had f_i and w_i at the same time in the formulas.

@seberg
Copy link
Member

seberg commented Oct 11, 2013

Graah. Assuming that Stata blog post is right, I see 4 different types of weights (but they can be squashed):

  1. frequencies
  2. pweights from the blogpost: maybe something like (relative) probabilities, and according to that same as frequencies normed to N. Which I would right now interpret as the estimator for "what I get if I sample N samples according to the probabilities".
  3. aweights according to the blogpost: Scaling of the variance given by the assumption that x_i = Norm(mu, var / aweight_i) (it is about variance not covariance). It gives something like pweights only that after calculation you would factor the norming to N back in (i.e. like frequencies, but the correction factor is n/(n-ddof) and not sum(freq)/(sum(freq) - ddof)).
  4. What R does as default and wikipedia mentions: I assume this is given by an assumption of something like: x_i = Norm(mu, var) + Norm(0, 1/weight_i), i.e. something that sees the weight as an uncertainty. From what I have read (I have neither) SAS and STATA do not implement this at all.

I suppose aweights may not be very logical for the covariance. pweights I think follows logically from frequencies (for these statistics at least). For std SAS has f_i and w_i, but the usage of w_i seems either frequencies or pweights.

So, I guess having frequencies and 4. (R default) next to each other in numpy may be OK (with good docs, saying i.e. to norm frequencies correctly). If it should be more complex then that, I have to start wondering whether we should have weights in numpy at all.

@ndawe
Copy link
Author

ndawe commented Oct 15, 2013

Sorry for the silence. I'll get back to this with comments and suggested changes as soon as possible.

@seberg
Copy link
Member

seberg commented Oct 16, 2013

@ndawe, I just pushed the other pull request into master, so you might want to fetch it before you start working on it more (since I expect it is an ugly merge).

@ndawe
Copy link
Author

ndawe commented Oct 20, 2013

@seberg, I rebased and resolved the conflicts. So have we decided on two arguments? frequencies and weights? For frequencies we use sum(frequencies) as the sample size and with weights we estimate the sample size according to the weighted covariance formula?

@seberg
Copy link
Member

seberg commented Oct 20, 2013

@ndawe, it sound reasonable to me, but please write to the list (since it is a new feature, should be done anyway). Maybe some statisticians have comments, and honestly I am not deep enough into these to make such a call.

@charris
Copy link
Member

charris commented Feb 18, 2014

@ndawe @seberg Looks like this is ready modulo list discussion. Is that right?

@seberg
Copy link
Member

seberg commented Feb 18, 2014

The list discussion may be the next logical step, but this is not ready. We basically somewhat decided that an other approach is probably better and this is not in here. However, since the list may disagree with that...

@ndawe
Copy link
Author

ndawe commented Feb 25, 2014

I think the discussion on the different types of weights is making it more complicated than it needs to be. I only see the need for frequencies and weights, where weights is just the relative importance (like a probability).

@seberg
Copy link
Member

seberg commented Feb 26, 2014

Sorry, I probably have to think this through more some time. If I recall correctly, I fully agree ;). I am not sure if the way to implement is what you got now, or to have frequencies and weights as keyword arguments. Did you fix the ddof usage in combinations to weights? If you are not faster, I think I will mail the list some time the weekend or so about which keyword arguments are good to add.

@charris
Copy link
Member

charris commented Apr 26, 2015

Frankly, I'd just have one weight and let the user set up the correct value, but that may be a bit much...

@ndawe
Copy link
Author

ndawe commented Apr 26, 2015

Yes, I quite frankly think that people are overthinking this. That's my feeling based on the discussion above.

@ndawe
Copy link
Author

ndawe commented Apr 26, 2015

In the meantime I've moved on with my private implementation since it became clear that this would take more time than it's worth to get pushed into numpy. It's a weighted covariance... not that hard. Give users the ability to specify one type of weights (or just two, with frequencies) and let them set up the weight values according to what they need. Done.

@seberg
Copy link
Member

seberg commented Apr 27, 2015

Maybe.... I am fine with just one weight as long as the documentation is clear about what it is and what it is not. And possibly a link to somewhere explaining how you would get the other stuff.

@charris
Copy link
Member

charris commented Apr 27, 2015

Still thinking about this and researching the topic.

@charris
Copy link
Member

charris commented Apr 27, 2015

I'm thinking a combination of two weights, repeat and reliability, and ddof. The weights can be combined, but not by simple multiplication, at least for the correction factor in @ndawe formula above. The sum of squared weights needs to be replaced by ddof * sum(n * r**2), where n are the repeat weights and r are the reliability weights. I'm not sure what ddof does in the reliability case, but ddof=1 will give the unbiased estimate in all cases..

@charris
Copy link
Member

charris commented May 14, 2015

Closed by #4960.

@charris charris closed this May 14, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants