Skip to content

Enhancement to percentile function. #2970

Closed
wants to merge 8 commits into from

6 participants

@jjhelmus
jjhelmus commented Feb 7, 2013

This PR adds the limit and interpolation parameters to the percentile function in NumPy. This harmonizes the features of NumPy's percentile function and SciPy's stats.scoreatpercentile function which was briefly discussed on the PR to introduce similar features into Scipy (scipy/scipy#374)

@jjhelmus jjhelmus referenced this pull request in scipy/scipy Feb 7, 2013
Merged

ENH: stats.scoreatpercentile #374

@njsmith njsmith commented on an outdated diff Feb 8, 2013
numpy/lib/function_base.py
@@ -2994,7 +2994,9 @@ def median(a, axis=None, out=None, overwrite_input=False):
# and check, use out array.
return mean(sorted[indexer], axis=axis, out=out)
-def percentile(a, q, axis=None, out=None, overwrite_input=False):
+
+def percentile(a, q, limit=(), interpolation_method='fraction', axis=None,
@njsmith
NumPy member
njsmith added a note Feb 8, 2013

limit=None would be clearer I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@njsmith njsmith commented on the diff Feb 8, 2013
numpy/lib/function_base.py
@@ -3006,6 +3008,16 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False):
Input array or object that can be converted to an array.
q : float in range of [0,100] (or sequence of floats)
Percentile to compute which must be between 0 and 100 inclusive.
+ limit : tuple, optional
+ Tuple of two scalars, the lower and upper limits within which to
+ compute the percentile.
@njsmith
NumPy member
njsmith added a note Feb 8, 2013

I have read this docstring and I still have no idea what limit does...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@njsmith njsmith and 1 other commented on an outdated diff Feb 8, 2013
numpy/lib/function_base.py
@@ -3006,6 +3008,16 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False):
Input array or object that can be converted to an array.
q : float in range of [0,100] (or sequence of floats)
Percentile to compute which must be between 0 and 100 inclusive.
+ limit : tuple, optional
+ Tuple of two scalars, the lower and upper limits within which to
+ compute the percentile.
+ interpolation : {'fraction', 'lower', 'higher'}, optional
+ This optional parameter specifies the interpolation method to use,
+ when the desired quantile lies between two data points `i` and `j`:
+ - fraction: `i + (j - i)*fraction`, where `fraction` is the
+ fractional part of the index surrounded by `i` and `j`.
@njsmith
NumPy member
njsmith added a note Feb 8, 2013

I think a clearer name for this would be interpolation="linear" -- has scipy nailed this down already, though?

@jjhelmus
jjhelmus added a note Feb 8, 2013

The interpolation_method keyword and syntax was added to SciPy with commit scipy/scipy@9650e63 about a year ago, I think it would be acceptable to change this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@njsmith njsmith commented on an outdated diff Feb 8, 2013
numpy/lib/function_base.py
@@ -3074,6 +3086,9 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False):
"""
a = np.asarray(a)
+ if limit:
+ a = a[(limit[0] <= a) & (a <= limit[1])]
+
@njsmith
NumPy member
njsmith added a note Feb 8, 2013

This code does tell me what limit does, but I still have no intuitive sense of why this belongs in a percentile function. Can you give an example that shows why this is a natural thing people will want in their percentile function?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@jjhelmus
jjhelmus commented Feb 8, 2013

Added a test and additional documentation to explain limit parameter. It performs filtering of a prior to calculation the percentile. A masked array does not accomplish the same thing:

In [7]: x = np.arange(10)

In [8]: np.percentile(x, 50, limit=(2, 5))
Out[8]: 3.5

In [9]: np.percentile([2, 3, 4, 5], 50)
Out[9]: 3.5

In [10]: np.percentile(np.ma.masked_outside(x, 2, 5), 50)
Out[10]: 4.5

Also, is this the expected/desired behavior for a masked array?

@seberg seberg commented on an outdated diff Feb 10, 2013
numpy/lib/function_base.py
@@ -3006,6 +3008,17 @@ def percentile(a, q, axis=None, out=None, overwrite_input=False):
Input array or object that can be converted to an array.
q : float in range of [0,100] (or sequence of floats)
Percentile to compute which must be between 0 and 100 inclusive.
+ limit : tuple, optional
+ Tuple of two scalars, the lower and upper limits within which to
+ compute the percentile. Values outside of this range are ommitted from
+ the percentile calculation. None includes all values in calculation.
+ interpolation : {'linear', 'lower', 'higher'}, optional
+ This optional parameter specifies the interpolation method to use,
+ when the desired quantile lies between two data points `i` and `j`:
+ - linear: `i + (j - i) * fraction`, where `fraction` is the
@seberg
NumPy member
seberg added a note Feb 10, 2013

Don't know sphinx well, but I think the list may need to be indented (i.e. with two spaces) and probably fractional aligned with linear. Also it may need a blank line before the list? Maybe compare with can_cast or such.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@seberg seberg commented on an outdated diff Feb 10, 2013
numpy/lib/function_base.py
@@ -3111,16 +3128,29 @@ def _compute_qth_percentile(sorted, q, axis, out):
indexer = [slice(None)] * sorted.ndim
Nx = sorted.shape[axis]
index = q*(Nx-1)
+
+ if int(index) != index:
+ # round fractional indices according to interpolation method
+ if interpolation == 'lower':
+ index = np.floor(index)
+ elif interpolation == 'higher':
+ index = np.ceil(index)
+ elif interpolation == 'linear':
+ pass # keep index as fraction and interpolate
+ else:
+ raise ValueError("interpolation_method can only be 'linear', "
@seberg
NumPy member
seberg added a note Feb 10, 2013

there is an underscore too much here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@seberg
NumPy member
seberg commented Feb 10, 2013

Looks good, just wondering if something like interpolation='mid_point' or such should be added, just because that is what R and matlab use? Wondering a bit why the function is not properly vectorized along q, it seems that would be just a matter of using fancy indexing instead of slices (I admit it forces a copy, but that should hardly matter).

To achieve that, reshaping the input q = np.asarray(q); q.shape += (1,) and then reducing along axis+q.ndim-1 (since you may add new ones) should do it I think, plus it works for any q not just 1-d ones.

I don't think one can expect these kind of function to work for masked arrays. That would require a way a canonical way of getting the (number of) valid entries for such an object.

@jjhelmus

@seberg I'm not seeing the benefit to vectorize along q. The indexer needs to be a list of slice objects so either we do a list comprehension over q, or we have a comprehension to build up the indexer. Can you expand on the method you had in mind?

@seberg
NumPy member
seberg commented Feb 15, 2013

I think this is separate from this anyway (even if you are interested, it may be better in a different PR). The names and that limit argument are more interesting to the discussion here :).

I did not check it thoroughly, but what I meant was that you do not need to use slices, but can also use a fancy index. The slice is just slice(n, n+1) or slice(n, n+2). That is (but for a copy) equivalent to np.array([n, n+1], dtype=np.intp) or np.array([n], dtype=np.intp). But if you write it using such an array, you should be able to then add more dimensions to it to support arbitrary input shapes (and what I wrote on q would be how to get there).

@jjhelmus

Yes, I'm seeing how it would be possible now. But I think that would be a different PR

@seberg
NumPy member
seberg commented Feb 15, 2013

Sorry, that screwed up there, all of it... midpoint, if anything, would mean that the knots are mid point, it is still linear interpolation. I think I got fooled by some (probably also wrong) memory about boxplots. But the definition as is is wrong. R has a lot of other options of course, but if anyone thinks they matter to numpy, then interpolation might be a bad name, but I don't know this stuff well enough anyway...

@josef-pkt

(aside to vectorized
http://mail.scipy.org/pipermail/scipy-user/2012-December/033904.html
handles only 1d or 2d IIRC, would need to be checked if it's really faster if we have only a few q.
I am not working on this)

@seberg
NumPy member
seberg commented Feb 16, 2013

I vectorized it (based on the state in this PR) at: https://gist.github.com/seberg/4966984

Do not want to spam with a graphic, but for negligible size of the other array (avoiding sorting cost), it is a bit more then a factor of two slower for a single value. For a sequence (list), the crossover point is at 3-4 items after which it gets vastly faster. At least the old scipy 0.10.1's percentile is much faster then numpy's. The crossover with it + list comprehension is at about 7-8 items. So it does seem worth it, unless it is very typical to require only a single value.

@josef-pkt

to the vectorized solution (looks nice): statsmodels currently uses it mainly for 2 values of q (for interquartile range), but if we know there is an efficient version for a large number of q, then we will start to use it at other places.

@jjhelmus

Adopted the vectorized version based on seberg's version. Had to add a check for values in i > a.shape[0] which occur when 100 is in q.

@charris
NumPy member
charris commented Apr 3, 2013

ping travisbot.

@seberg, do you think this is ready?

@charris charris closed this Apr 3, 2013
@charris charris reopened this Apr 3, 2013
@seberg
NumPy member
seberg commented Apr 3, 2013

Oh, forgot about this... I would say pretty much yes (but did not go over it again carefully right now).

What bothers me a bit is that limit argument, but I am fine with adding it for consistency with scipy (I don't quite see the use to be honest). In its current form it does however not support support vectorization of the haystack (axis != None) and it maybe bugs for axis=0. If that is caught (or checked that it gives an error already), I am happy though.

@seberg
NumPy member
seberg commented Apr 4, 2013

Ah, one other thing. I guess this should probably be mentioned in the release notes, since the output changes from list to array?

@seberg seberg commented on an outdated diff Apr 4, 2013
numpy/lib/function_base.py
weights = array(1)
sumval = 1.0
else:
- indexer[axis] = slice(i, i+2)
- j = i + 1
- weights = array([(j - index), (index - i)],float)
- wshape = [1]*sorted.ndim
- wshape[axis] = 2
- weights.shape = wshape
- sumval = weights.sum()
-
+ i = index.astype(np.intp) + arange(2)
+ indexer = (i, Ellipsis)
+ weights = index - i[...,::-1]
+ weights[..., 0] *= -1
+ weights.shape = weights.shape + (1,) * (sorted.ndim - 1)
+ sumval = weights.sum(i.ndim-1) # numerical accuracy reasons?
+ i[np.where(i > (len(sorted) - 1))] = len(sorted) - 1
@seberg
NumPy member
seberg added a note Apr 4, 2013

Just style, there is no need for the np.where here, the indexing will do that already internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris charris was assigned Apr 7, 2013
@seberg
NumPy member
seberg commented Apr 7, 2013

:(, there is a small issue with the vectorization. I think this is the right way to vectorize it, but it is not compatible with some ol cases:

In [2]: np.percentile([[1,2], [3,4]], [1,2,3,4], axis=0)
Out[2]: 
[array([ 1.02,  2.02]),
 array([ 1.04,  2.04]),
 array([ 1.06,  2.06]),
 array([ 1.08,  2.08])]

while the vectorized version would put the haystack vectorization first (which feels more correct to me). For starters I guess one could swap the order to keep better compatibility, but wondering if its worth it to try to change it for the future. (I doubt many even vectorize the haystack and needle at the same time right now, but just switching the axes order silently is certainly wrong)

@seberg
NumPy member
seberg commented Apr 8, 2013

Nvm that, I guess the vectorization is compatible, even if I am not sure if maybe we should try to switch it around in the long run.

@jjhelmus

I removed the extra np.where and documented the output type change in the release docs. I'm not sure what should be done with the haystack and needle vectorization.

@seberg
NumPy member
seberg commented Apr 16, 2013

Not sure either, I think the current way is compatible, but it feels wrong to me, I certainly would expect the haystack vectorization to come first. I doubt it really affects many (since it only applies if both are vectorized), but maybe we should have a transition period, possible with a temporary keyword argument like histogram? If you still have patience, I think this should go to the mailing list.

@jjhelmus

Sent an email to the numpy list asking for input on this PR

@charris
NumPy member
charris commented May 3, 2013

@jjhelmus @seberg I didn't catch the email. What is the status of this?

@seberg
NumPy member
seberg commented May 3, 2013

To be honest, not sure, I guess we could just decide on something more or less @josef-pkt was the only one who had an opinion. He also suggested the plausible thing to insert the quantile's dimensions at the axis. Isn't there any function that already has such vectorization? np.searchsorted unfortunately doesn't ;).

@jjhelmus
jjhelmus commented May 7, 2013

@charris

To summarize the discussion on the email list with @josef-pkt and @seberg the question is what should be the shape of the output of the following:

a = np.arange(24).reshape(2,3,4)
q = np.array([15., 30., 45., 60., 75.])
np.percentile(a, q, axis=-1)

a, the haystack, has a shape of (2, 3, 4), q, the needle, has a shape of (5,).

The options are:

(2, 3, 5) : haystack, a, dimension first.
(5, 2, 3) : needle, q, dimension first.
Third option : insert the q dimension at axis. This option was not liked.

The haystack dimension first version seems to be the most logical, and has the function acting like a reduceat, which might be considered the most "numpythonic".

The needle dimension first version is what is currently implemented. It allows for unrolling the percentiles, @josef-pkt liked this one the best.

I can see both ways, and if we can find a function that has similar vectorization I say we go with that.

@jjhelmus
jjhelmus commented May 7, 2013

The closest function I could find was np.take. It uses the third option, the indices are inserted at the axis dimension. Is it reasonable to think of np.percentile as performing "percentile" indexing along the given dimension?

@seberg
NumPy member
seberg commented May 7, 2013

That is true, that makes sense, somehow missed that example. We may need a kwarg to be able to warn about a default change and allow the old method of q's dimension coming first?

@njsmith
NumPy member
@seberg
NumPy member
seberg commented May 8, 2013

Yeah, but if axis=-1 inserting at the axis does too. Also I guess it simply isn't vectorized like generalized ufuncs (normal broadcasting), since like in np.take there is no broadcasting going and instead it is "applied along the given axis". The new np.linalg.solve does something a little like it, but it is probably too complex anyway:

def percentile(arr, q):
    # axis = -1 of arr is eaten away, the other dimensions can broadcast
    broadcast_dims = arr.ndim - 1

    # hypothetical function, q is only broadcasted for its leading dimensions
    broadcast_shape = broadcast_shape((arr.shape[:-1], q.shape[:broadcast_dims])

    # The core of the ufunc (result) are those dimensions that were not broadcasted
    core_shape = q.shape[broadcast_dims:]

    result_shape = broadcast_shape + core_shape

I.e. the gufuncs signature would change be (Ellipsis1, N), (Ellipsis1, Ellipsis) -> (Ellipsis). Where the Ellipsis1 could not expand the first operand (but it can expand the second operand, then core_shape = ()).

@jjhelmus

Percentile is now inserting the q dimension at axis. This does change the behaviour of percentile rather dramatically, namely an array is always returned.

One question is if we should support two dimension+ q arrays? As written they are not supported, but no Error is raised if they are passed. I might be possible to add support, weights_shape would need to be set to the shape of a with the shape of q replacing the axis dimension.

@juliantaylor

the percentile function is related to my attempt at getting a partitioning/selection function into numpy, see
http://mail.scipy.org/pipermail/numpy-discussion/2013-May/066645.html
the code is here
https://github.com/juliantaylor/numpy/tree/select-median

percentile is much faster using iterative partitions than sorting once, even if you want many percentiles.

prev = 0
for p in percentiles:
  idx = p - prev
  r.append(data[prev:].partition(idx)[idx])
  prev = idx + 1

If the functionality is deemed worthwhile this branch should probably adapted and merged after the partition merge.

@josef-pkt josef-pkt referenced this pull request in statsmodels/statsmodels May 31, 2013
Open

add interquartile range, or scoreatpercentiles to statsmodels #596

@jjhelmus

I have a version of percentile which works with partition, https://github.com/jjhelmus/numpy/tree/percentile_with_partition. The version is messy (treat it like a private repo, I'll be rebasing), I'll clean it up when the partition PR is accepted.

@jjhelmus

I rewrote the percentile function to use the new percentile functionality in PR #3658. It was easier to start from a clean fork than working off this branch.

@charris
NumPy member
charris commented Aug 29, 2013

@jjhelmus This should be closed then?

@jjhelmus

@charris Yes, I'll close this PR. New discussion should begin in PR #3658.

@jjhelmus jjhelmus closed this Aug 29, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.