# Enhancement to percentile function.#2970

Closed
wants to merge 8 commits into from

### 6 participants

commented Feb 7, 2013

This PR adds the limit and interpolation parameters to the percentile function in NumPy. This harmonizes the features of NumPy's percentile function and SciPy's stats.scoreatpercentile function which was briefly discussed on the PR to introduce similar features into Scipy (scipy/scipy#374)

 jjhelmus Enhancement to percentile function. e5d0518
referenced this pull request in scipy/scipy Feb 7, 2013
Merged

### ENH: stats.scoreatpercentile #374

numpy/lib/function_base.py
 @@ -2994,7 +2994,9 @@ def median(a, axis=None, out=None, overwrite_input=False): # and check, use out array. return mean(sorted[indexer], axis=axis, out=out) -def percentile(a, q, axis=None, out=None, overwrite_input=False): + +def percentile(a, q, limit=(), interpolation_method='fraction', axis=None,
 NumPy member njsmith added a note Feb 8, 2013 limit=None would be clearer I think. to join this conversation on GitHub. Already have an account? Sign in to comment
commented on the diff Feb 8, 2013
numpy/lib/function_base.py
 @@ -3006,6 +3008,16 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False): Input array or object that can be converted to an array. q : float in range of [0,100] (or sequence of floats) Percentile to compute which must be between 0 and 100 inclusive. + limit : tuple, optional + Tuple of two scalars, the lower and upper limits within which to + compute the percentile.
 NumPy member njsmith added a note Feb 8, 2013 I have read this docstring and I still have no idea what limit does... to join this conversation on GitHub. Already have an account? Sign in to comment
and 1 other commented on an outdated diff Feb 8, 2013
numpy/lib/function_base.py
 @@ -3006,6 +3008,16 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False): Input array or object that can be converted to an array. q : float in range of [0,100] (or sequence of floats) Percentile to compute which must be between 0 and 100 inclusive. + limit : tuple, optional + Tuple of two scalars, the lower and upper limits within which to + compute the percentile. + interpolation : {'fraction', 'lower', 'higher'}, optional + This optional parameter specifies the interpolation method to use, + when the desired quantile lies between two data points `i` and `j`: + - fraction: `i + (j - i)*fraction`, where `fraction` is the + fractional part of the index surrounded by `i` and `j`.
 NumPy member njsmith added a note Feb 8, 2013 I think a clearer name for this would be interpolation="linear" -- has scipy nailed this down already, though? jjhelmus added a note Feb 8, 2013 The interpolation_method keyword and syntax was added to SciPy with commit scipy/scipy@9650e63 about a year ago, I think it would be acceptable to change this. to join this conversation on GitHub. Already have an account? Sign in to comment
numpy/lib/function_base.py
 @@ -3074,6 +3086,9 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False): """ a = np.asarray(a) + if limit: + a = a[(limit[0] <= a) & (a <= limit[1])] +
 NumPy member njsmith added a note Feb 8, 2013 This code does tell me what limit does, but I still have no intuitive sense of why this belongs in a percentile function. Can you give an example that shows why this is a natural thing people will want in their percentile function? to join this conversation on GitHub. Already have an account? Sign in to comment
 jjhelmus Syntax changes to percentile function. a44a251
commented Feb 8, 2013

Added a test and additional documentation to explain limit parameter. It performs filtering of a prior to calculation the percentile. A masked array does not accomplish the same thing:

In [7]: x = np.arange(10)

In [8]: np.percentile(x, 50, limit=(2, 5))
Out[8]: 3.5

In [9]: np.percentile([2, 3, 4, 5], 50)
Out[9]: 3.5

In [10]: np.percentile(np.ma.masked_outside(x, 2, 5), 50)
Out[10]: 4.5

Also, is this the expected/desired behavior for a masked array?

numpy/lib/function_base.py
 @@ -3006,6 +3008,17 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False): Input array or object that can be converted to an array. q : float in range of [0,100] (or sequence of floats) Percentile to compute which must be between 0 and 100 inclusive. + limit : tuple, optional + Tuple of two scalars, the lower and upper limits within which to + compute the percentile. Values outside of this range are ommitted from + the percentile calculation. None includes all values in calculation. + interpolation : {'linear', 'lower', 'higher'}, optional + This optional parameter specifies the interpolation method to use, + when the desired quantile lies between two data points `i` and `j`: + - linear: `i + (j - i) * fraction`, where `fraction` is the
 NumPy member seberg added a note Feb 10, 2013 Don't know sphinx well, but I think the list may need to be indented (i.e. with two spaces) and probably fractional aligned with linear. Also it may need a blank line before the list? Maybe compare with can_cast or such. to join this conversation on GitHub. Already have an account? Sign in to comment
numpy/lib/function_base.py
 @@ -3111,16 +3128,29 @@ def _compute_qth_percentile(sorted, q, axis, out): indexer = [slice(None)] * sorted.ndim Nx = sorted.shape[axis] index = q*(Nx-1) + + if int(index) != index: + # round fractional indices according to interpolation method + if interpolation == 'lower': + index = np.floor(index) + elif interpolation == 'higher': + index = np.ceil(index) + elif interpolation == 'linear': + pass # keep index as fraction and interpolate + else: + raise ValueError("interpolation_method can only be 'linear', "
 NumPy member seberg added a note Feb 10, 2013 there is an underscore too much here. to join this conversation on GitHub. Already have an account? Sign in to comment
NumPy member
commented Feb 10, 2013

Looks good, just wondering if something like interpolation='mid_point' or such should be added, just because that is what R and matlab use? Wondering a bit why the function is not properly vectorized along q, it seems that would be just a matter of using fancy indexing instead of slices (I admit it forces a copy, but that should hardly matter).

To achieve that, reshaping the input q = np.asarray(q); q.shape += (1,) and then reducing along axis+q.ndim-1 (since you may add new ones) should do it I think, plus it works for any q not just 1-d ones.

I don't think one can expect these kind of function to work for masked arrays. That would require a way a canonical way of getting the (number of) valid entries for such an object.

added some commits Feb 15, 2013
 jjhelmus fixed percentile docs d627ba6 jjhelmus added midpoint interpolation to percentile c55b583

@seberg I'm not seeing the benefit to vectorize along q. The indexer needs to be a list of slice objects so either we do a list comprehension over q, or we have a comprehension to build up the indexer. Can you expand on the method you had in mind?

NumPy member
commented Feb 15, 2013

I think this is separate from this anyway (even if you are interested, it may be better in a different PR). The names and that limit argument are more interesting to the discussion here :).

I did not check it thoroughly, but what I meant was that you do not need to use slices, but can also use a fancy index. The slice is just slice(n, n+1) or slice(n, n+2). That is (but for a copy) equivalent to np.array([n, n+1], dtype=np.intp) or np.array([n], dtype=np.intp). But if you write it using such an array, you should be able to then add more dimensions to it to support arbitrary input shapes (and what I wrote on q would be how to get there).

Yes, I'm seeing how it would be possible now. But I think that would be a different PR

NumPy member
commented Feb 15, 2013

Sorry, that screwed up there, all of it... midpoint, if anything, would mean that the knots are mid point, it is still linear interpolation. I think I got fooled by some (probably also wrong) memory about boxplots. But the definition as is is wrong. R has a lot of other options of course, but if anyone thinks they matter to numpy, then interpolation might be a bad name, but I don't know this stuff well enough anyway...

(aside to vectorized
http://mail.scipy.org/pipermail/scipy-user/2012-December/033904.html
handles only 1d or 2d IIRC, would need to be checked if it's really faster if we have only a few q.
I am not working on this)

NumPy member
commented Feb 16, 2013

I vectorized it (based on the state in this PR) at: https://gist.github.com/seberg/4966984

Do not want to spam with a graphic, but for negligible size of the other array (avoiding sorting cost), it is a bit more then a factor of two slower for a single value. For a sequence (list), the crossover point is at 3-4 items after which it gets vastly faster. At least the old scipy 0.10.1's percentile is much faster then numpy's. The crossover with it + list comprehension is at about 7-8 items. So it does seem worth it, unless it is very typical to require only a single value.

to the vectorized solution (looks nice): statsmodels currently uses it mainly for 2 values of q (for interquartile range), but if we know there is an efficient version for a large number of q, then we will start to use it at other places.

 jjhelmus q-vectorized percentile function Based on seberg gist (https://gist.github.com/seberg/4966984) 83e1524

Adopted the vectorized version based on seberg's version. Had to add a check for values in i > a.shape[0] which occur when 100 is in q.

NumPy member
commented Apr 3, 2013

ping travisbot.

@seberg, do you think this is ready?

closed this Apr 3, 2013
reopened this Apr 3, 2013
NumPy member
commented Apr 3, 2013

Oh, forgot about this... I would say pretty much yes (but did not go over it again carefully right now).

What bothers me a bit is that limit argument, but I am fine with adding it for consistency with scipy (I don't quite see the use to be honest). In its current form it does however not support support vectorization of the haystack (axis != None) and it maybe bugs for axis=0. If that is caught (or checked that it gives an error already), I am happy though.

NumPy member
commented Apr 4, 2013

Ah, one other thing. I guess this should probably be mentioned in the release notes, since the output changes from list to array?

commented on an outdated diff Apr 4, 2013
numpy/lib/function_base.py
 weights = array(1) sumval = 1.0 else: - indexer[axis] = slice(i, i+2) - j = i + 1 - weights = array([(j - index), (index - i)],float) - wshape = [1]*sorted.ndim - wshape[axis] = 2 - weights.shape = wshape - sumval = weights.sum() - + i = index.astype(np.intp) + arange(2) + indexer = (i, Ellipsis) + weights = index - i[...,::-1] + weights[..., 0] *= -1 + weights.shape = weights.shape + (1,) * (sorted.ndim - 1) + sumval = weights.sum(i.ndim-1) # numerical accuracy reasons? + i[np.where(i > (len(sorted) - 1))] = len(sorted) - 1
 NumPy member seberg added a note Apr 4, 2013 Just style, there is no need for the np.where here, the indexing will do that already internally. to join this conversation on GitHub. Already have an account? Sign in to comment
was assigned Apr 7, 2013
NumPy member
commented Apr 7, 2013

:(, there is a small issue with the vectorization. I think this is the right way to vectorize it, but it is not compatible with some ol cases:

In [2]: np.percentile([[1,2], [3,4]], [1,2,3,4], axis=0)
Out[2]:
[array([ 1.02,  2.02]),
array([ 1.04,  2.04]),
array([ 1.06,  2.06]),
array([ 1.08,  2.08])]

while the vectorized version would put the haystack vectorization first (which feels more correct to me). For starters I guess one could swap the order to keep better compatibility, but wondering if its worth it to try to change it for the future. (I doubt many even vectorize the haystack and needle at the same time right now, but just switching the axes order silently is certainly wrong)

NumPy member
commented Apr 8, 2013

Nvm that, I guess the vectorization is compatible, even if I am not sure if maybe we should try to switch it around in the long run.

added some commits Apr 16, 2013
 jjhelmus removed extraneous np.where function 152fba2 jjhelmus added release not about array output of np.percentile fc3dd2e

I removed the extra np.where and documented the output type change in the release docs. I'm not sure what should be done with the haystack and needle vectorization.

NumPy member
commented Apr 16, 2013

Not sure either, I think the current way is compatible, but it feels wrong to me, I certainly would expect the haystack vectorization to come first. I doubt it really affects many (since it only applies if both are vectorized), but maybe we should have a transition period, possible with a temporary keyword argument like histogram? If you still have patience, I think this should go to the mailing list.

Sent an email to the numpy list asking for input on this PR

NumPy member
commented May 3, 2013

@jjhelmus @seberg I didn't catch the email. What is the status of this?

NumPy member
commented May 3, 2013

To be honest, not sure, I guess we could just decide on something more or less @josef-pkt was the only one who had an opinion. He also suggested the plausible thing to insert the quantile's dimensions at the axis. Isn't there any function that already has such vectorization? np.searchsorted unfortunately doesn't ;).

commented May 7, 2013

To summarize the discussion on the email list with @josef-pkt and @seberg the question is what should be the shape of the output of the following:

a = np.arange(24).reshape(2,3,4)
q = np.array([15., 30., 45., 60., 75.])
np.percentile(a, q, axis=-1)

a, the haystack, has a shape of (2, 3, 4), q, the needle, has a shape of (5,).

The options are:

(2, 3, 5) : haystack, a, dimension first.
(5, 2, 3) : needle, q, dimension first.
Third option : insert the q dimension at axis. This option was not liked.

The haystack dimension first version seems to be the most logical, and has the function acting like a reduceat, which might be considered the most "numpythonic".

The needle dimension first version is what is currently implemented. It allows for unrolling the percentiles, @josef-pkt liked this one the best.

I can see both ways, and if we can find a function that has similar vectorization I say we go with that.

commented May 7, 2013

The closest function I could find was np.take. It uses the third option, the indices are inserted at the axis dimension. Is it reasonable to think of np.percentile as performing "percentile" indexing along the given dimension?

NumPy member
commented May 7, 2013

That is true, that makes sense, somehow missed that example. We may need a kwarg to be able to warn about a default change and allow the old method of q's dimension coming first?

NumPy member
NumPy member
commented May 8, 2013

Yeah, but if axis=-1 inserting at the axis does too. Also I guess it simply isn't vectorized like generalized ufuncs (normal broadcasting), since like in np.take there is no broadcasting going and instead it is "applied along the given axis". The new np.linalg.solve does something a little like it, but it is probably too complex anyway:

def percentile(arr, q):
# axis = -1 of arr is eaten away, the other dimensions can broadcast

# The core of the ufunc (result) are those dimensions that were not broadcasted

I.e. the gufuncs signature would change be (Ellipsis1, N), (Ellipsis1, Ellipsis) -> (Ellipsis). Where the Ellipsis1 could not expand the first operand (but it can expand the second operand, then core_shape = ()).

 jjhelmus np.percentile now inserts new axis at axis parameter 1607c2f

Percentile is now inserting the q dimension at axis. This does change the behaviour of percentile rather dramatically, namely an array is always returned.

One question is if we should support two dimension+ q arrays? As written they are not supported, but no Error is raised if they are passed. I might be possible to add support, weights_shape would need to be set to the shape of a with the shape of q replacing the axis dimension.

the percentile function is related to my attempt at getting a partitioning/selection function into numpy, see
http://mail.scipy.org/pipermail/numpy-discussion/2013-May/066645.html
the code is here
https://github.com/juliantaylor/numpy/tree/select-median

percentile is much faster using iterative partitions than sorting once, even if you want many percentiles.

prev = 0
for p in percentiles:
idx = p - prev
r.append(data[prev:].partition(idx)[idx])
prev = idx + 1

If the functionality is deemed worthwhile this branch should probably adapted and merged after the partition merge.

referenced this pull request in statsmodels/statsmodels May 31, 2013
Open

### add interquartile range, or scoreatpercentiles to statsmodels #596

I have a version of percentile which works with partition, https://github.com/jjhelmus/numpy/tree/percentile_with_partition. The version is messy (treat it like a private repo, I'll be rebasing), I'll clean it up when the partition PR is accepted.

referenced this pull request Aug 29, 2013
Merged

### ENH: percentile function with additional parameters and vecorization #3658

I rewrote the percentile function to use the new percentile functionality in PR #3658. It was easier to start from a clean fork than working off this branch.

NumPy member
commented Aug 29, 2013

@jjhelmus This should be closed then?

@charris Yes, I'll close this PR. New discussion should begin in PR #3658.

closed this Aug 29, 2013
referenced this pull request Sep 23, 2013
Merged