cov and corrcoef: add support for weights #3864

ndawe · 2013-10-04T02:50:32Z

Given that average supports weights [1], and that when searching around there have been requests [2] for implementations of weighted covariance, I think it would be beneficial to also support weights in cov and corrcoef [3, 4].

[1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.average.html
[2] http://stackoverflow.com/questions/11442962/python-package-that-supports-weighted-covariance-computation
[3] http://stats.stackexchange.com/a/61298/29084
[4] https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_covariance

juliantaylor · 2013-10-05T00:13:20Z

numpy/lib/function_base.py

+    if weights is not None:
+        weights = array(weights, dtype=float)
+        X -= average(X, axis=1-axis, weights=weights)[tup]
+        N = weights.sum()


this doesn't look right.
if the sum of weights is 1 and ddof and bias set to default values you always get a division by zero in the scaling due to fact = N - ddof

@juliantaylor very good point. The weights should first be normalized to sum to N (the unweighted number of observations) right? I am looking at http://stats.stackexchange.com/a/61298/29084

A ValueError should also be raised if the weights sum to zero.

ndawe · 2013-10-05T06:57:26Z

@juliantaylor this should now be the correct implementation. If the weights are not repeat-type, as in they don't represent an integer number of occurrences of each observation, then we only have the biased definition. So this PR adds two new arguments: weights and repeat_weights. repeat_weights defaults to 0 and the covariance is always biased but if repeat_weights is 1 then either the biased or unbiased equation is used depending on the specified value of bias and ddof and the weights must be integer frequencies of each observation.

seberg · 2013-10-10T23:45:14Z

numpy/lib/function_base.py

+    weights : array-like, optional
+        A 1-D array of weights with a length equal to the number of
+        observations.
+    repeat_weights : int, optional


Sounds to me, like it should be boolean. Before adding new keyword arguments, a short note to the mailing list would be good, someone might have some cool ideas about the api. I have never used such a weighted cov, are the actual weights integers if this is True, or do they just sum up to an integer (or well, can be interpreted as degrees of freedom or something like that)?

I wanted to make it a boolean but then I saw that bias was an int so I assumed that for some reason numpy preferred to use 0/1 switches instead of False/True. I also put this new kwarg last so it would not interfere with existing calls to this function that assume the current order. I will send an email to the list asking for more input and ideas. If repeat_weights is True then the elements of weights must be integers for the results to make sense. The weights are in general not integer frequencies though and in that case we only have the biased covariance. Maybe there are better names for this argument... frequency_weights?

I bet the function is just so old, that when it was written python did not have booleans.

seberg · 2013-10-11T00:12:04Z

@juliantaylor, yes it will conflict gh-3882, can't be helped... one of these will have to do an annoying merge.

@ndawe, needs a ping to the mailing list and tests. I am a bit wondering about the "repeat-type" argument. It seems to me that all it does is forcing bias = 1/ddof = 0, but on the other hand, it might be a bit hidden that way.

ndawe · 2013-10-11T00:31:10Z

I don't mind doing a rebase on top of #3882 since I see that as a more important issue.

josef-pkt · 2013-10-11T11:45:51Z

@seberg Thanks for the ping, I will look at it.

In statsmodels I have only frequency weights (which I had to correct after the stackoverflow question).
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.weightstats.DescrStatsW.cov.html

I don't see unit tests yet in this PR<

josef-pkt · 2013-10-11T15:04:17Z

I think the bool repeat_weights could be replaced by allowing two different weights arguments. That's what Stata and SAS are doing, I never tried that. Stata has fweights and pweights (and aweights).

What I remember from the SAS documentation, they are just multiplied together but affect the total "sample size" differently.
In statsmodels, I have frequency weights that are unit tested with integer weights against np.cov with the replicated array. But I don't impose integer weights. The only important assumption is that weights.sum() reflects the total sample size.

I never looked at the details for the non-frequency weights in this case. If those have the interpretation of the inverse standard deviation of the individual observations, then they shouldn't affect the denominator, I guess. This should be compared with other packages (and unit tested against them).

seberg · 2013-10-11T15:56:50Z

Ah, that sounds good, having pweights and aweights (or some other/better names). One could really multiply them together and frequency weights would just replace N with N = fweights.sum(), while pweights would (might be wrong) change N by (also) using N *= pweights.sum()**2 / pweights.sum()**2. After that ddof can just be used as always to calculate fact.
EDIT: Changing N with pweights is of course only enough if they are normalized to 1 first, otherwise fact needs to take their sum into accont too.

josef-pkt · 2013-10-11T16:49:18Z

I mixed them up, Stata has fweights (repeated observations) and aweights, but I just tried, and it doesn't seem to allow them at the same time, although I don't see a reason why it shouldn't.

In Stata, scaling aweights has no effect on cov, or variance, and corresponds to normalizing those to 1, AFAICS. However, the scale invariance conflicts with the usual interpretation that Stata has that aweights correspond to the standard deviation assuming that the observation is a mean of a subsample.
http://www.stata.com/support/faqs/statistics/weights-and-summary-statistics/
The last is maybe not relevant here, just a reminder that I had forgotten what the interpretation of aweights is.

Doubling fweights, doubles the number of assumed observations, that's easier to understand.

I haven't looked for the SAS or SPSS documentation. (SPSS has only one weight, fweights, in the formula collection, IIRC). I have no idea what R does, since it doesn't have such a nice documentation.

josef-pkt · 2013-10-11T17:16:14Z

SAS: I don't remember which procedures I was reading, never for cov specifically

for example here SAS has FREQ and WEIGHTS
http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a002473330.htm
Looking at the explanation for sigma^2:
It looks like SAS has a separate VARDEF= option, which allows different denominators in the variance based on the weights.

I remember reading another SAS documentation page that had f_i and w_i at the same time in the formulas.

seberg · 2013-10-11T20:22:35Z

Graah. Assuming that Stata blog post is right, I see 4 different types of weights (but they can be squashed):

frequencies
pweights from the blogpost: maybe something like (relative) probabilities, and according to that same as frequencies normed to N. Which I would right now interpret as the estimator for "what I get if I sample N samples according to the probabilities".
aweights according to the blogpost: Scaling of the variance given by the assumption that x_i = Norm(mu, var / aweight_i) (it is about variance not covariance). It gives something like pweights only that after calculation you would factor the norming to N back in (i.e. like frequencies, but the correction factor is n/(n-ddof) and not sum(freq)/(sum(freq) - ddof)).
What R does as default and wikipedia mentions: I assume this is given by an assumption of something like: x_i = Norm(mu, var) + Norm(0, 1/weight_i), i.e. something that sees the weight as an uncertainty. From what I have read (I have neither) SAS and STATA do not implement this at all.

I suppose aweights may not be very logical for the covariance. pweights I think follows logically from frequencies (for these statistics at least). For std SAS has f_i and w_i, but the usage of w_i seems either frequencies or pweights.

So, I guess having frequencies and 4. (R default) next to each other in numpy may be OK (with good docs, saying i.e. to norm frequencies correctly). If it should be more complex then that, I have to start wondering whether we should have weights in numpy at all.

ndawe · 2013-10-15T21:11:13Z

Sorry for the silence. I'll get back to this with comments and suggested changes as soon as possible.

seberg · 2013-10-16T19:01:23Z

@ndawe, I just pushed the other pull request into master, so you might want to fetch it before you start working on it more (since I expect it is an ugly merge).

ndawe · 2013-10-20T04:06:07Z

@seberg, I rebased and resolved the conflicts. So have we decided on two arguments? frequencies and weights? For frequencies we use sum(frequencies) as the sample size and with weights we estimate the sample size according to the weighted covariance formula?

seberg · 2013-10-20T09:18:42Z

@ndawe, it sound reasonable to me, but please write to the list (since it is a new feature, should be done anyway). Maybe some statisticians have comments, and honestly I am not deep enough into these to make such a call.

charris · 2014-02-18T01:52:49Z

@ndawe @seberg Looks like this is ready modulo list discussion. Is that right?

seberg · 2014-02-18T09:32:32Z

The list discussion may be the next logical step, but this is not ready. We basically somewhat decided that an other approach is probably better and this is not in here. However, since the list may disagree with that...

ndawe · 2014-02-25T02:21:55Z

I think the discussion on the different types of weights is making it more complicated than it needs to be. I only see the need for frequencies and weights, where weights is just the relative importance (like a probability).

seberg · 2014-02-26T15:35:57Z

Sorry, I probably have to think this through more some time. If I recall correctly, I fully agree ;). I am not sure if the way to implement is what you got now, or to have frequencies and weights as keyword arguments. Did you fix the ddof usage in combinations to weights? If you are not faster, I think I will mail the list some time the weekend or so about which keyword arguments are good to add.

charris · 2015-04-26T02:51:44Z

Frankly, I'd just have one weight and let the user set up the correct value, but that may be a bit much...

ndawe · 2015-04-26T03:58:50Z

Yes, I quite frankly think that people are overthinking this. That's my feeling based on the discussion above.

ndawe · 2015-04-26T04:04:29Z

In the meantime I've moved on with my private implementation since it became clear that this would take more time than it's worth to get pushed into numpy. It's a weighted covariance... not that hard. Give users the ability to specify one type of weights (or just two, with frequencies) and let them set up the weight values according to what they need. Done.

seberg · 2015-04-27T14:29:59Z

Maybe.... I am fine with just one weight as long as the documentation is clear about what it is and what it is not. And possibly a link to somewhere explaining how you would get the other stuff.

charris · 2015-04-27T14:54:48Z

Still thinking about this and researching the topic.

charris · 2015-04-27T17:35:23Z

I'm thinking a combination of two weights, repeat and reliability, and ddof. The weights can be combined, but not by simple multiplication, at least for the correction factor in @ndawe formula above. The sum of squared weights needs to be replaced by ddof * sum(n * r**2), where n are the repeat weights and r are the reliability weights. I'm not sure what ddof does in the reliability case, but ddof=1 will give the unbiased estimate in all cases..

charris · 2015-05-14T02:11:49Z

Closed by #4960.

juliantaylor reviewed Oct 5, 2013
View reviewed changes

juliantaylor mentioned this pull request Oct 10, 2013

Fix for cov() with complex input #3882

Closed

seberg reviewed Oct 10, 2013
View reviewed changes

josef-pkt mentioned this pull request Nov 2, 2013

WLS log likelihood, aic and bic statsmodels/statsmodels#1170

Closed

cov and corrcoef: add weights support

0278d8b

seberg mentioned this pull request Aug 13, 2014

ENH: add a weighted covariance calculation. #4960

Merged

rgommers added 01 - Enhancement component: numpy.lib labels Mar 8, 2015

charris closed this May 14, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cov and corrcoef: add support for weights #3864

cov and corrcoef: add support for weights #3864

ndawe commented Oct 4, 2013

juliantaylor Oct 5, 2013

ndawe Oct 5, 2013

ndawe Oct 5, 2013

ndawe commented Oct 5, 2013

seberg Oct 10, 2013

ndawe Oct 11, 2013

seberg Oct 11, 2013

ndawe Oct 11, 2013

seberg commented Oct 11, 2013

ndawe commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

seberg commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

seberg commented Oct 11, 2013

ndawe commented Oct 15, 2013

seberg commented Oct 16, 2013

ndawe commented Oct 20, 2013

seberg commented Oct 20, 2013

charris commented Feb 18, 2014

seberg commented Feb 18, 2014

ndawe commented Feb 25, 2014

seberg commented Feb 26, 2014

charris commented Apr 26, 2015

ndawe commented Apr 26, 2015

ndawe commented Apr 26, 2015

seberg commented Apr 27, 2015

charris commented Apr 27, 2015

charris commented Apr 27, 2015

charris commented May 14, 2015

cov and corrcoef: add support for weights #3864

cov and corrcoef: add support for weights #3864

Conversation

ndawe commented Oct 4, 2013

juliantaylor Oct 5, 2013

Choose a reason for hiding this comment

ndawe Oct 5, 2013

Choose a reason for hiding this comment

ndawe Oct 5, 2013

Choose a reason for hiding this comment

ndawe commented Oct 5, 2013

seberg Oct 10, 2013

Choose a reason for hiding this comment

ndawe Oct 11, 2013

Choose a reason for hiding this comment

seberg Oct 11, 2013

Choose a reason for hiding this comment

ndawe Oct 11, 2013

Choose a reason for hiding this comment

seberg commented Oct 11, 2013

ndawe commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

seberg commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

josef-pkt commented Oct 11, 2013

seberg commented Oct 11, 2013

ndawe commented Oct 15, 2013

seberg commented Oct 16, 2013

ndawe commented Oct 20, 2013

seberg commented Oct 20, 2013

charris commented Feb 18, 2014

seberg commented Feb 18, 2014

ndawe commented Feb 25, 2014

seberg commented Feb 26, 2014

charris commented Apr 26, 2015

ndawe commented Apr 26, 2015

ndawe commented Apr 26, 2015

seberg commented Apr 27, 2015

charris commented Apr 27, 2015

charris commented Apr 27, 2015

charris commented May 14, 2015