Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for weighted aggregates #226

Closed
gdementen opened this Issue Nov 17, 2016 · 8 comments

Comments

Projects
None yet
1 participant
@gdementen
Copy link
Member

commented Nov 17, 2016

for percentile:

  • argsort value
  • cumsum weight
  • searchsorted(cumsum, cumsum[-1] * percentile) # if percentile is between 0 and 1

http://stackoverflow.com/questions/20601872/numpy-or-scipy-to-calculate-weighted-median
https://pypi.python.org/pypi/weightedstats/0.2
https://github.com/nudomarinero/wquantiles

@gdementen gdementen added this to the 0.12 milestone Jan 24, 2018

@gdementen

This comment has been minimized.

Copy link
Member Author

commented Feb 9, 2018

wpercentile(a,w,p) = een waarde x waarvoor geldt: sum(w|a <= x) / sum(w) = p

@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2018

Unweighted

According to Wikipedia (https://en.wikipedia.org/wiki/Percentile), for the unweighted case, there are three
variants for linear interpolation when the percentile(s) fall between two datapoints. In fact, according to the article about quantile (https://en.wikipedia.org/wiki/Quantile), there are 9 different "usual" ways to compute quantiles, all of which implemented by R.

But in the percentile article, the generalization to weighted percentiles is only given for the first "matlab" variant (C = 0.5). The other generalizations below are from me.

p is percentile (for each data point), x is rank, N is len(data)

variant1: C = 0.5 (matlab without weight, nudomarinero/wquantiles, gmisclib.weighted_percentile.wp)
unweighted: p = (x - 0.5) / N
weighted: p = (cum_sorted_weight - 0.5 * sorted_weights) / sum_weight

variant2: C = 1 (numpy/old excel)
unweighted: p = (x - 1) / (N - 1)
weighted: p = (cum_sorted_weight - sorted_weights) / (sum_weight - 1) ???

variant3: C = 0 (new excel)
unweighted: p = x / (N + 1)
weighted: p = cum_sorted_weight / (sum_weight + 1) ???

R
https://www.rdocumentation.org/packages/stats/versions/3.4.3/topics/quantile
AFAICT, np.percentile(interpolation='linear') is equivalent to R's "type 7"
numpy/numpy#4990

scipy
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.scoreatpercentile.html#scipy.stats.scoreatpercentile
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.mquantiles.html

numpy
https://docs.scipy.org/doc/numpy-dev/reference/generated/numpy.percentile.html

matlab documentation on percentile
http://nl.mathworks.com/help/stats/prctile.html
Note that Matlab does NOT uses midpoint (as claimed by several sources on stackoverflow) but linear interpolation

SAS documentation
https://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_univariate_sect028.htm

Weighted

generic explanation of weighted percentiles
https://blogs.sas.com/content/iml/2016/08/29/weighted-percentiles.html

SAS uses its own 5th variant (=R2) when weights are present (see above link)
It seems to be a mix between "higher" and "midpoint".

In the computational formulas that SAS uses for weighted percentiles, the weights are divided by the sum of the
weights. Therefore only relative weights are important, and the formulas simplify if you choose weights that sum
to 1.

statsmodels uses the same algorithm as SAS
https://github.com/statsmodels/statsmodels/pull/2707/files#diff-756dceb1c3b86a7ddc558834b3293a33R229

weighted quantile for R (hmisc)
https://www.rdocumentation.org/packages/Hmisc/versions/4.1-0/topics/wtd.stats
https://github.com/harrelfe/Hmisc/blob/master/R/wtd.stats.s#L46
note that reldist (which contains another wtd.quantile) uses hmisc

  • weights contain frequency counts that in effect expand x by these counts.
    weights can also be sampling weights, in which setting normwt to TRUE will often be appropriate.
    This results in making weights sum to the length of the non-missing elements in x.
    normwt=TRUE thus reflects the fact that the true sample size is the length of the x vector and not the sum of the
    original values of weights (which would be appropriate had normwt=FALSE
  • normwt - specify normwt=TRUE to make weights sum to length(x) after deletion of NAs.
    If weights are frequency weights, then normwt should be FALSE, and if weights are normalization (aka reliability)
    weights, then normwt should be TRUE. In the case of the former, no check is made that weights are valid
    frequencies.
  • na.rm - set to FALSE to suppress checking for NAs

numpy issue for implementing weights for percentile
numpy/numpy#6326

numpy PR to implement weights for percentile. Very complex. Tries to support both
probability weights and frequency (int) weights.
numpy/numpy#9211

weighted version for matlab
https://nl.mathworks.com/matlabcentral/fileexchange/16920-returns-weighted-percentiles-of-a-sample
seems to handle probability weights, not fweights (normalize weights to len(w) and example use 0-1 weights)
uses (cumsum(sorted_w) - sorted_w) / (n - 1)

another version for matlab
https://github.com/IoSR-Surrey/MatlabToolbox/blob/master/%2Biosr/%2Bstatistics/quantile.m

WEIGHTS is specified (WEIGHTS should be the same size as X), this index
is mapped to the cumulative weights (the weights are scaled to sum to
N(i) - see below), and a new weighted index is returned (using linear
interpolation) for the point where the cumulative weights equal the
unweighted index. The weighted index is used to calculate the P(i)-th
quantile. If the values in WEIGHTS are equal, then the weighted and
unweighted index (and correpsonding quantile) are identical. The
default method R-8 is used if METHOD is specified as an empty array
([]).

wSorted = wSorted * N / sum(wSorted) # normalize weights to sum to N
k = cumsum(wSorted) # cumulative weights
huw = ((N - 1) * p) + 1 # unweighted index
hw = wfunc(huw, k)
h = hw
fh = floor(h)
q = x[fh] + (h - fh) * (x[fh + 1] - x[fh])

JuliaStats/StatsBase distinguish between frequency weight & non frequency weight:

  • With fweight, the function returns the same result as quantile for a vector with repeated values.
  • With non fweights, use linear interpolation. In particular, when w is a vector of ones,
    the function returns the same result as quantile.
    JuliaStats/StatsBase.jl#316

pandas.quantile does not support weights and the non weighted version
uses 'linear' by default (I guess it uses numpy underneath)

weightedcalcs (for Pandas) uses "midpoint" for cumw == q, and "lower"? if cumw > q
https://github.com/jsvine/weightedcalcs/blob/master/weightedcalcs/core.py

@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2018

Here is some test code I threw together, but it fails.

def test_percentile(nruns=1000):
    """
    Examples
    --------
    >>> test_percentile(10)
    0
    """
    import matplotlib.pyplot as plt
    failed = 0
    for i in range(nruns):
        percentile = np.random.randint(101)
        # arr = np.array([0, 2, 3])
        arr = np.random.randint(100, size=8).astype(float)
        # arr = np.random.randint(10, size=np.random.randint(2, 5)).astype(float)
        # arr = np.random.randint(10, size=3).astype(float)
        # generate weights between 0 and 1
        # w = np.round(np.random.uniform(size=arr.size), 1)
        w = np.ones_like(arr)
        nonzero = w != 0
        if not nonzero.any():
            continue
        res = wpercentile(arr, w, q=percentile, weights_type='other')
        arr = arr[nonzero]
        w = w[nonzero]
        nw = w / w.sum()

        q = percentile / 100
        sort_inds = np.argsort(arr)
        a_sort = arr[sort_inds]
        w_sort = w[sort_inds]
        nw_sort = nw[sort_inds]
        cw = np.cumsum(nw_sort)
        cwmnw = cw - nw_sort
        plt.bar(cwmnw / cwmnw[-1], a_sort, width=0.0001, align='center')
        perc = np.arange(0, 101)
        plt.plot(perc / 100, np.percentile(arr, perc), color='red')
        # plt.plot(perc / 100, wpercentile(arr, w, perc, weights_type='other'), color='red')
        plt.show()

        # Since the check below assumes all values in arr/a_sort are different (because of searchsorted), we skip the
        # test if they are not.
        if len(np.unique(arr)) != len(arr):
            continue
        # Try to check the result is correct. Since the check fails, either the check is wrong or the code is wrong,
        # I am unsure.
        ind = np.searchsorted(a_sort, res)
        vind = a_sort[ind]
        nw_before = nw_sort[:ind].sum() if ind > 0 else 0
        nw_after = nw_sort[ind + 1:].sum() if ind < len(nw) - 1 else 0
        if (nw_before > q) or (nw_after > 1 - q):
            print("failed for", a_sort, nw_sort, q, "res:", res, "ind", ind, "vind", vind,
                  "nwbefore", nw_before, "nwafter", nw_after)
            failed += 1
    return failed
@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2018

Here are some tests which do pass:

def test_percentile():
    """
    Examples
    --------
    >>> np.random.seed(0)
    >>> # for some reason, there one failure with this seed:
    >>> # np.random.seed(1234)
    >>> failed = 0
    >>> for i in range(10000):
    ...     q = np.random.randint(101)
    ...     arr = np.random.randint(10, size=np.random.randint(2, 1500)).astype(float)
    ...     # generate weights between 0 and 1
    ...     w = np.full(arr.size, float(np.random.randint(1, 100)) / 100)
    ...     wmin = wpercentile(arr, w, 0, weights_type='other')
    ...     wmax = wpercentile(arr, w, 100, weights_type='other')
    ...     assert wmin == np.min(arr), "%s %s: wmin (%f) != min (%f)" % (arr, w, wmin, np.min(arr))
    ...     assert wmax == np.max(arr), "%s %s: wmax (%f) != max (%f)" % (arr, w, wmax, np.max(arr))
    ...     nw = np.percentile(arr, q)
    ...     ww = wpercentile(arr, w, q, weights_type='other')
    ...     if not np.isclose(nw, ww):
    ...         print("failed for", q, arr, w, "nw:", nw, "with weight:", ww)
    ...         failed += 1
    >>> failed
    0
    >>> failed = 0
    >>> for i in range(10000):
    ...     q = np.random.randint(101)
    ...     arr = np.random.randint(10, size=np.random.randint(2, 15)).astype(float)
    ...     # generate weights between 1 and 19
    ...     w = np.random.randint(1, 20, size=arr.size)
    ...     nw = np.percentile(np.repeat(arr, w), q)
    ...     ww = wpercentile(arr, w, q, weights_type='freq')
    ...     if round(nw, 5) != round(ww, 5):
    ...         print("failed for", q, arr, w, "nw:", nw, "with weight:", ww)
    ...         failed += 1
    >>> failed
    0
    """
    pass
@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2018

To compare results with R's Hmisc wtd.quantile:

    > library("Hmisc")
    > x = c(1, 2, 3, 4)
    > w = c(.9, .2, .3, .1)
    > q = c(0, .1, .4, .5, .6, .9, 1)
    > wtd.quantile(x, weights=w, probs=q, normwt=TRUE)
      0%  10%  40%  50%  60%  90% 100%
     1.0  1.0  1.4  2.0  2.6  3.7  4.0
@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2018

Here is some code I used temporarily. It works well to replicate np.percentile(np.repeat(arr, w), q) for integral/frequency weights, but failed for other weights.

    n = np.sum(weights)
    # compute target cumulative weight for requested percentile(s)
    target_weight = q * (n - 1)
    # find indices which bound this cumweight
    idx_left = np.searchsorted(cum_sorted_weight, target_weight, side='right')
    idx_right = np.searchsorted(cum_sorted_weight, target_weight + 1, side='right')
    idx_right = np.minimum(idx_right, len(sorted_values) - 1)
    # where are we between the two bounds?
    frac = target_weight - np.floor(target_weight)
    # get two values to interpolate between
    v_left = sorted_values[idx_left]
    v_right = sorted_values[idx_right]
    return v_left + (v_right - v_left) * frac
@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2018

Here is some other code that I used temporarily which replicates Hmisc wtd.quantile(normwt=TRUE, type="(i-1)/(n-1)") and np.percentile for constant weights.

    n = len(a)
    # normalize weights so that they sum to n (do NOT use an inplace op to not modify input)
    weights = weights * n / np.sum(weights)
    ...
    p = (cum_sorted_weight - 1) / (n - 1)
    assert np.isclose(p[-1], 1), "p[-1] (%f) != 1" % p[-1]
    return np.interp(q, p, sorted_values)

gdementen added a commit that referenced this issue Mar 14, 2018

implemented weighted percentile (see #226 for a lot of background on …
…this)

Honestly, even with all the time I spent on this (a *lot* more
than I am ready to admit), I am still not 100% confident about this.
See the failing test on issue #226.

gdementen added a commit that referenced this issue Mar 14, 2018

improved weighted percentile implementation
(see #226 for a lot of background on this)

I am finally confident that this implementation is "reasonably good". It replicates the results of
Hmisc wtd.quantile type="quantile" for both normwt=TRUE and normwt=FALSE.
@gdementen

This comment has been minimized.

Copy link
Member Author

commented Mar 14, 2018

I finally settled to replicate Hmisc type="quantile" for both normwt=TRUE and normwt=FALSE

gdementen added a commit that referenced this issue Mar 30, 2018

gdementen added a commit that referenced this issue Mar 30, 2018

gdementen added a commit that referenced this issue Mar 30, 2018

gdementen added a commit that referenced this issue Apr 5, 2018

gdementen added a commit that referenced this issue Apr 6, 2018

@gdementen gdementen closed this in f79df46 Apr 6, 2018

gdementen added a commit that referenced this issue Apr 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.