Skip to content
/ liam2 Public

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

# support for weighted aggregates#226

Closed
opened this issue Nov 17, 2016 · 8 comments
Closed

# support for weighted aggregates #226

opened this issue Nov 17, 2016 · 8 comments
Labels
Milestone

## Comments

### gdementen commented Nov 17, 2016

 for percentile: argsort value cumsum weight searchsorted(cumsum, cumsum[-1] * percentile) # if percentile is between 0 and 1 The text was updated successfully, but these errors were encountered:
added this to the 0.12 milestone Jan 24, 2018

### gdementen commented Feb 9, 2018

 wpercentile(a,w,p) = een waarde x waarvoor geldt: sum(w|a <= x) / sum(w) = p

# Unweighted

According to Wikipedia (https://en.wikipedia.org/wiki/Percentile), for the unweighted case, there are three
variants for linear interpolation when the percentile(s) fall between two datapoints. In fact, according to the article about quantile (https://en.wikipedia.org/wiki/Quantile), there are 9 different "usual" ways to compute quantiles, all of which implemented by R.

But in the percentile article, the generalization to weighted percentiles is only given for the first "matlab" variant (C = 0.5). The other generalizations below are from me.

p is percentile (for each data point), x is rank, N is len(data)

variant1: C = 0.5 (matlab without weight, nudomarinero/wquantiles, gmisclib.weighted_percentile.wp)
unweighted: p = (x - 0.5) / N
weighted: p = (cum_sorted_weight - 0.5 * sorted_weights) / sum_weight

variant2: C = 1 (numpy/old excel)
unweighted: p = (x - 1) / (N - 1)
weighted: p = (cum_sorted_weight - sorted_weights) / (sum_weight - 1) ???

variant3: C = 0 (new excel)
unweighted: p = x / (N + 1)
weighted: p = cum_sorted_weight / (sum_weight + 1) ???

R
https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/quantile
AFAICT, np.percentile(interpolation='linear') is equivalent to R's "type 7"
numpy/numpy#4990

matlab documentation on percentile
http://nl.mathworks.com/help/stats/prctile.html
Note that Matlab does NOT uses midpoint (as claimed by several sources on stackoverflow) but linear interpolation

# Weighted

generic explanation of weighted percentiles
https://blogs.sas.com/content/iml/2016/08/29/weighted-percentiles.html

SAS uses its own 5th variant (=R2) when weights are present (see above link)
It seems to be a mix between "higher" and "midpoint".

In the computational formulas that SAS uses for weighted percentiles, the weights are divided by the sum of the
weights. Therefore only relative weights are important, and the formulas simplify if you choose weights that sum
to 1.

statsmodels uses the same algorithm as SAS
https://github.com/statsmodels/statsmodels/pull/2707/files#diff-756dceb1c3b86a7ddc558834b3293a33R229

weighted quantile for R (hmisc)
https://www.rdocumentation.org/packages/Hmisc/versions/4.1-0/topics/wtd.stats
https://github.com/harrelfe/Hmisc/blob/master/R/wtd.stats.s#L46
note that reldist (which contains another wtd.quantile) uses hmisc

• weights contain frequency counts that in effect expand x by these counts.
weights can also be sampling weights, in which setting normwt to TRUE will often be appropriate.
This results in making weights sum to the length of the non-missing elements in x.
normwt=TRUE thus reflects the fact that the true sample size is the length of the x vector and not the sum of the
original values of weights (which would be appropriate had normwt=FALSE
• normwt - specify normwt=TRUE to make weights sum to length(x) after deletion of NAs.
If weights are frequency weights, then normwt should be FALSE, and if weights are normalization (aka reliability)
weights, then normwt should be TRUE. In the case of the former, no check is made that weights are valid
frequencies.
• na.rm - set to FALSE to suppress checking for NAs

numpy issue for implementing weights for percentile
numpy/numpy#6326

numpy PR to implement weights for percentile. Very complex. Tries to support both
probability weights and frequency (int) weights.
numpy/numpy#9211

weighted version for matlab
https://nl.mathworks.com/matlabcentral/fileexchange/16920-returns-weighted-percentiles-of-a-sample
seems to handle probability weights, not fweights (normalize weights to len(w) and example use 0-1 weights)
uses (cumsum(sorted_w) - sorted_w) / (n - 1)

another version for matlab
https://github.com/IoSR-Surrey/MatlabToolbox/blob/master/%2Biosr/%2Bstatistics/quantile.m

WEIGHTS is specified (WEIGHTS should be the same size as X), this index
is mapped to the cumulative weights (the weights are scaled to sum to
N(i) - see below), and a new weighted index is returned (using linear
interpolation) for the point where the cumulative weights equal the
unweighted index. The weighted index is used to calculate the P(i)-th
quantile. If the values in WEIGHTS are equal, then the weighted and
unweighted index (and correpsonding quantile) are identical. The
default method R-8 is used if METHOD is specified as an empty array
([]).

wSorted = wSorted * N / sum(wSorted) # normalize weights to sum to N
k = cumsum(wSorted) # cumulative weights
huw = ((N - 1) * p) + 1 # unweighted index
hw = wfunc(huw, k)
h = hw
fh = floor(h)
q = x[fh] + (h - fh) * (x[fh + 1] - x[fh])

JuliaStats/StatsBase distinguish between frequency weight & non frequency weight:

• With fweight, the function returns the same result as `quantile` for a vector with repeated values.
• With non fweights, use linear interpolation. In particular, when `w` is a vector of ones,
the function returns the same result as `quantile`.

pandas.quantile does not support weights and the non weighted version
uses 'linear' by default (I guess it uses numpy underneath)

weightedcalcs (for Pandas) uses "midpoint" for cumw == q, and "lower"? if cumw > q
https://github.com/jsvine/weightedcalcs/blob/master/weightedcalcs/core.py

### gdementen commented Mar 8, 2018

 Here is some test code I threw together, but it fails. ```def test_percentile(nruns=1000): """ Examples -------- >>> test_percentile(10) 0 """ import matplotlib.pyplot as plt failed = 0 for i in range(nruns): percentile = np.random.randint(101) # arr = np.array([0, 2, 3]) arr = np.random.randint(100, size=8).astype(float) # arr = np.random.randint(10, size=np.random.randint(2, 5)).astype(float) # arr = np.random.randint(10, size=3).astype(float) # generate weights between 0 and 1 # w = np.round(np.random.uniform(size=arr.size), 1) w = np.ones_like(arr) nonzero = w != 0 if not nonzero.any(): continue res = wpercentile(arr, w, q=percentile, weights_type='other') arr = arr[nonzero] w = w[nonzero] nw = w / w.sum() q = percentile / 100 sort_inds = np.argsort(arr) a_sort = arr[sort_inds] w_sort = w[sort_inds] nw_sort = nw[sort_inds] cw = np.cumsum(nw_sort) cwmnw = cw - nw_sort plt.bar(cwmnw / cwmnw[-1], a_sort, width=0.0001, align='center') perc = np.arange(0, 101) plt.plot(perc / 100, np.percentile(arr, perc), color='red') # plt.plot(perc / 100, wpercentile(arr, w, perc, weights_type='other'), color='red') plt.show() # Since the check below assumes all values in arr/a_sort are different (because of searchsorted), we skip the # test if they are not. if len(np.unique(arr)) != len(arr): continue # Try to check the result is correct. Since the check fails, either the check is wrong or the code is wrong, # I am unsure. ind = np.searchsorted(a_sort, res) vind = a_sort[ind] nw_before = nw_sort[:ind].sum() if ind > 0 else 0 nw_after = nw_sort[ind + 1:].sum() if ind < len(nw) - 1 else 0 if (nw_before > q) or (nw_after > 1 - q): print("failed for", a_sort, nw_sort, q, "res:", res, "ind", ind, "vind", vind, "nwbefore", nw_before, "nwafter", nw_after) failed += 1 return failed```

### gdementen commented Mar 8, 2018 • edited

 Here are some tests which do pass: ```def test_percentile(): """ Examples -------- >>> np.random.seed(0) >>> # for some reason, there one failure with this seed: >>> # np.random.seed(1234) >>> failed = 0 >>> for i in range(10000): ... q = np.random.randint(101) ... arr = np.random.randint(10, size=np.random.randint(2, 1500)).astype(float) ... # generate weights between 0 and 1 ... w = np.full(arr.size, float(np.random.randint(1, 100)) / 100) ... wmin = wpercentile(arr, w, 0, weights_type='other') ... wmax = wpercentile(arr, w, 100, weights_type='other') ... assert wmin == np.min(arr), "%s %s: wmin (%f) != min (%f)" % (arr, w, wmin, np.min(arr)) ... assert wmax == np.max(arr), "%s %s: wmax (%f) != max (%f)" % (arr, w, wmax, np.max(arr)) ... nw = np.percentile(arr, q) ... ww = wpercentile(arr, w, q, weights_type='other') ... if not np.isclose(nw, ww): ... print("failed for", q, arr, w, "nw:", nw, "with weight:", ww) ... failed += 1 >>> failed 0 >>> failed = 0 >>> for i in range(10000): ... q = np.random.randint(101) ... arr = np.random.randint(10, size=np.random.randint(2, 15)).astype(float) ... # generate weights between 1 and 19 ... w = np.random.randint(1, 20, size=arr.size) ... nw = np.percentile(np.repeat(arr, w), q) ... ww = wpercentile(arr, w, q, weights_type='freq') ... if round(nw, 5) != round(ww, 5): ... print("failed for", q, arr, w, "nw:", nw, "with weight:", ww) ... failed += 1 >>> failed 0 """ pass```

### gdementen commented Mar 14, 2018

 To compare results with R's Hmisc wtd.quantile: ``` > library("Hmisc") > x = c(1, 2, 3, 4) > w = c(.9, .2, .3, .1) > q = c(0, .1, .4, .5, .6, .9, 1) > wtd.quantile(x, weights=w, probs=q, normwt=TRUE) 0% 10% 40% 50% 60% 90% 100% 1.0 1.0 1.4 2.0 2.6 3.7 4.0```

### gdementen commented Mar 14, 2018 • edited

 Here is some code I used temporarily. It works well to replicate np.percentile(np.repeat(arr, w), q) for integral/frequency weights, but failed for other weights. ``` n = np.sum(weights) # compute target cumulative weight for requested percentile(s) target_weight = q * (n - 1) # find indices which bound this cumweight idx_left = np.searchsorted(cum_sorted_weight, target_weight, side='right') idx_right = np.searchsorted(cum_sorted_weight, target_weight + 1, side='right') idx_right = np.minimum(idx_right, len(sorted_values) - 1) # where are we between the two bounds? frac = target_weight - np.floor(target_weight) # get two values to interpolate between v_left = sorted_values[idx_left] v_right = sorted_values[idx_right] return v_left + (v_right - v_left) * frac```

### gdementen commented Mar 14, 2018

 Here is some other code that I used temporarily which replicates Hmisc wtd.quantile(normwt=TRUE, type="(i-1)/(n-1)") and np.percentile for constant weights. ``` n = len(a) # normalize weights so that they sum to n (do NOT use an inplace op to not modify input) weights = weights * n / np.sum(weights) ... p = (cum_sorted_weight - 1) / (n - 1) assert np.isclose(p[-1], 1), "p[-1] (%f) != 1" % p[-1] return np.interp(q, p, sorted_values)```

added a commit that referenced this issue Mar 14, 2018
``` implemented weighted percentile (see #226 for a lot of background on … ```
``` 2f17120 ```
```…this)

Honestly, even with all the time I spent on this (a *lot* more
than I am ready to admit), I am still not 100% confident about this.
See the failing test on issue #226.```
added a commit that referenced this issue Mar 14, 2018
``` improved weighted percentile implementation ```
``` d0a5c12 ```
```(see #226 for a lot of background on this)

I am finally confident that this implementation is "reasonably good". It replicates the results of
Hmisc wtd.quantile type="quantile" for both normwt=TRUE and normwt=FALSE.```

### gdementen commented Mar 14, 2018

 I finally settled to replicate Hmisc type="quantile" for both normwt=TRUE and normwt=FALSE

added a commit that referenced this issue Mar 30, 2018
``` implemented weighted gini (#226) ```
``` e0a5626 ```
added a commit that referenced this issue Mar 30, 2018
``` implemented weighted median (#226) ```
``` 4fc3fb9 ```
added a commit that referenced this issue Mar 30, 2018
``` implemented weighted std (#226) ```
``` 678451b ```
added a commit that referenced this issue Apr 5, 2018
``` implemented weighted avg (#226) ```
``` ae2e074 ```
added a commit that referenced this issue Apr 6, 2018
``` implemented weights for link aggregates (#226) ```
``` a4008e8 ```
closed this as completed in ``` f79df46 ``` Apr 6, 2018
added a commit that referenced this issue Apr 6, 2018
``` last minute addition: implemented weights for count (#226) ```
``` 3a9f7d6 ```
added a commit that referenced this issue Apr 16, 2018
``` fixed avg(scalar) (regression from #226) ```
``` 33fe02d ```
to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant