ENH: added functionality nanmedian to numpy #4307

dfreese · 2014-02-17T02:48:50Z

Implemented a nanmedian and associated tests as an
extension of np.median to complement the other
nanfunctions

Added negative values to the unit tests

Cleaned up documentation of nanmedian

jaimefrio · 2014-02-17T05:51:48Z

If you take a look at the code of masked arrays' median, there are two very gross inefficiencies in there:

It uses sorting, not selection. This means it is O(n log n) instead of O(n).
It uses apply_along_axis for multidimensional cases, which is not any faster than a Python loop.

Ideally you would want to correct both situations, although I don't think there is an entirely satisfactory solution. A couple of ideas:

If you stick to sorting, you can get rid of apply_along_axis. The basic insight is that NaNs get sorted to the end of an array. So if you count the NaNs in each 1D array over which you have to compute the median, you can figure by how much you have to shift the position of the items you have to average to find the median. This shift does not have to be the same for all 1D arrays, so you won't be able to simply take a slice of the sorted array, but I think it can be done with fancy indexing.
If you prefer to use selection, then it is a pity that np.partition won't broadcast the indices, so you are basically going to be stuck with something similar to using apply_along_axis. A possible improvement would be to, instead of iterating over all the 1D arrays, iterate over all groups of 1D arrays that have the same number of NaNs. It would be more complicated, but for the typical case where only a few of the 1D arrays have NaNs, and there aren't many of them, it should make things much faster.

I think that 2 would be a better option than 1, although probably harder to code.

juliantaylor · 2014-02-17T12:50:18Z

the rational for not supporting broadcasting, besides being to lazy to think about how that would work, is that partition is normally a very expensive operations, so the cost of an explicit loop is usually amortized quickly.

I would suggest instead of using the ma module to use apply_along_axis directly
on the 1d arrays then move the nans to the end using fancy indexing and partition along the rest
this should be quite fast as moving nans in the axis slices should improve cpu cache utilization.
see scipy/scipy#3335 for an existing implementation that can still be improved via moving nans.

one could also add a cutoff for when the looping cost get high at which one simply sorts the array and shifts the indices, but I don't think its necessary for the first version.

dfreese · 2014-02-23T02:28:57Z

I implemented the formulation by @empeeu in #4287, which goes about the method you mentioned @juliantaylor. The implementation of the out functionality seems a little rough, but I wasn't sure the best way to go about that with the apply_along_axis.

jaimefrio · 2014-02-23T06:54:21Z

I think there should be a check for the dtype at the beginning of the function that simply calls median if the type cannot have NaNs in it: no point calling isnan on an array of integers.

And I may be wrong, but I think that even a nan-function should be optimized for no nans in the input as being the most likely situation. I have made some timings on master after Julian's shortcut to speed-up partition for index -1:

>>> a = np.random.rand(1000)
>>> min(repeat('np.partition(a, -1)', 'from __main__ import a, np', repeat=100000, number=1))
7.3894142929020745e-06
>>> min(repeat('np.sum(np.isnan(a))', 'from __main__ import a, np', repeat=100000, number=1))
9.031506408518908e-06
>>> min(repeat('np.partition(a, (499, 500, -1))', 'from __main__ import a, np', repeat=100000, number=1))
1.1905167525583238e-05

I think you should assume there will be no nans. So for a 1000 item array, always start by making a call to np.partition(a, (499, 500, -1)). You then check to see if the last item is a nan. If it isn't, then you have computed the median almost 3x faster. If, on the other hand, there are indeed nans, you will have to count them and call partition again. Doing this naively, things would run at .75x the current speed. As long as less than two thirds of the 1D slices have nans, you are going to come ahead.

But if you get a little creative, you can check whether after partial sorting a[500] is a nan, and if it isn't, count the nans only on the second half of the partially sorted array, then figure out the offsets you need, and call partition only on the first half of the array:

>>> min(repeat('np.sum(np.isnan(a[501:]))', 'from __main__ import a, np', repeat=100000, number=1))
7.799937293384573e-06
>>> min(repeat('np.partition(a[:501], (480, 481))', 'from __main__ import a, np', repeat=100000, number=1))
7.799937293384573e-06

You would need to code it and time it, to see how well it does depending on the density of nans, but I tink it will be faster for the most typical cases, and only do slighty worse for the rest.

empeeu · 2014-02-26T13:39:28Z

I've been wondering, wouldn't a count_nan() function be the best option in this case? With count_nan() you just have to loop through the array once and increment an integer -- avoiding the memory allocation of a bool array when using isnan(). Alternatively, you could add another specialization to partition -- as it is partitioning an element, it keeps track whether or not that element is a nan. Though, I don't know the codebase well enough to figure out where to make the changes... Anyway, just an idea.

juliantaylor · 2014-02-26T15:35:35Z

count_nonzero(bool) is significantly faster than sum(bool), though I don't know how thats relevant, my initial proposal of doing that was flawed it would only work with full sorting as pointed out by jaime.

I think doing a isnan call once, moving all nans to the end and then doing a partition should be fast enough. It should not be much more that 20%-30% overhead and it seems a lot of that actually originates from a np.where inefficiency that can be fixed.

juliantaylor · 2014-02-26T20:31:39Z

in case its unclear, I mean something along the lins of this to move nans to the end for the 1d partition

c = np.isnan(data)
# fast if few nans
s = np.where(c)[0]
# select non-nans at end of array
enonan = data[-s.size:][~c[-s.size:]]
# fill nans in beginning of array with non-nans of end
data[s[:enonan.size]] = enonan
# we don't care about moving nans to the end

juliantaylor · 2014-02-26T21:12:43Z

meh now I ended up just implementing it fully in scipy :) scipy/scipy#3396

empeeu · 2014-03-04T22:15:53Z

@juliantaylor Re: count_nonzero(bool) is significantly faster than sum(bool), though I don't know how thats relevant

If we know how many nans are in the array, we just have to move the pivot for the partition to get the median while ignoring nans. Something like

a = np.arange(10, dtype=float)
a[:3] = np.nan
sz = a.size - np.count_nan(a)
part = np.partition(a, (sz - 1) // 2)       #for sz odd
nanmedian = part[(sz-1)//2]

so, if we had an efficient count_nan, then nanmedian would basically be just as fast as median. Or am I missing something?

Well, I suppose this gets nastier when you're not dealing with 1D arrays, and I suppose that's the crux of the matter. But, I figured I'd ask anyway.

charris · 2014-03-24T22:43:29Z

@empeeu Be nice to get this finished up.

dfreese · 2014-03-26T18:37:23Z

It's up to date with the method @juliantaylor used assuming small numbers of nans.

charris · 2014-03-27T15:20:04Z

Looking good. Could you squash the commits and use the prefixes in doc/source/dev/gitwash/development_workflow.rst. Also add a note in doc/release/1.9.0-notes.rst in the New Features section.

You can squash commits using git rebase -i HEAD~<number of commits> followed by a force push to origin.

dfreese · 2014-03-27T18:30:12Z

Should be together now. Let me know if I missed something.

juliantaylor · 2014-03-27T18:31:01Z

does this work with extended axis? e.g. nanmedian(d, axis=(0,1))

dfreese · 2014-03-27T18:41:24Z

no, currently it raises an IndexError if a tuple is given.

On Thu, Mar 27, 2014 at 1:31 PM, Julian Taylor notifications@github.comwrote:

does this work with extended axis? e.g. nanmedian(d, axis=(0,1))

Reply to this email directly or view it on GitHubhttps://github.com//pull/4307#issuecomment-38842834
.

charris · 2014-03-27T18:48:03Z

@dfreese See #3908.

charris · 2014-03-31T23:41:41Z

@dfreese Needs a rebase. Could you also add the extended axis? I'd like to get this in.

juliantaylor · 2014-03-31T23:47:01Z

then we still need a nanpercentile :/

dfreese · 2014-04-01T08:20:18Z

Extended axis and keep dims are in. Sorry for the excessive delay on the update.

juliantaylor · 2014-04-01T15:36:21Z

numpy/lib/nanfunctions.py

+    if axis is None:
+        part = a.ravel()
+        if out is not None:
+            out = _nanmedian1d(part, overwrite_input)


I don't think this works you are assigning to a local reference, it should probably be out[...] = ... to copy into out

the out test only succeeds by accident so that needs improving too

juliantaylor · 2014-04-01T15:39:19Z

I am a bit confused, what happened to the approach of counting nans and shifting the partition?
I guessing it would duplicate to much np.median code?

empeeu · 2014-04-02T23:48:25Z

@charris Sorry for the late reply (trying to buy a house). I ceded control to dfreese on this one. I'm not up to date on the discussion here.

dfreese · 2014-04-08T19:32:03Z

I will look at getting the out fixed up here tonight. I tried to keep it somewhat tied to median for simplicity's sake, but I haven't tried to compare the two methods.

juliantaylor · 2014-04-10T16:43:34Z

numpy/lib/nanfunctions.py

+            # fill nans in beginning of array with non-nans of end
+            x[s[:enonan.size]] = enonan
+            # slice nans away
+            return np.median(x[:-s.size], overwrite_input=True)


seems this duplication can be avoided by reducing the if to if overwrite: x = arr1d.copy() else: x = arr1d

charris · 2014-04-10T19:44:29Z

@juliantaylor Look good to you?

juliantaylor · 2014-04-22T23:03:24Z

I think it would be good if nanmedian on empty arrays/axis behaves as nanmean, else its good

juliantaylor · 2014-04-22T23:04:46Z

a easy way to do so could be to just call nanmean if the array size is zero

juliantaylor · 2014-04-22T23:06:25Z

or as mean and median behave different maybe it is better to adapt nanmean to behave like the others instead

dfreese · 2014-04-24T01:19:15Z

I was mistaken earlier when I looked at some of the handling of empty arrays. apply_along_axis() makes handling an empty axis a little awkward. I added in a check for that prior to calling apply_along_axis, which should cause the same result as mean/median/nanmean.

sidenote: nanmean treats empty arrays and all-nan arrays the same way; in either case the warnings are the same.

juliantaylor · 2014-04-24T17:41:13Z

numpy/lib/nanfunctions.py

+    else:
+        # apply_along_axis doesn't handle empty arrays well,
+        # so deal them upfront
+        if (np.array(a.shape)==0).any():


0 in a.shape is shorter and faster, ufunc reductions have enormous overheads

dfreese · 2014-04-24T22:32:51Z

@juliantaylor I added a commit that does those things. Moving axis == None to the main function is pretty straightforward, but the implementation for empty arrays ends up being fairly messy in order to handle extended axis and keepdims. Not sure if it's ideal to add that much code to miss the function call for an empty array.

juliantaylor · 2014-04-24T22:41:29Z

would this work?

if 0 in a.shape:
   return np.nanmean(a, **kwargs)

dfreese · 2014-04-24T22:51:31Z

yeah.

juliantaylor · 2014-04-24T23:02:32Z

numpy/lib/nanfunctions.py

+            result = result * np.ones([1 for i in a.shape])
+        if out is not None:
+            out[:] = result
+        return result 


why is the axis None part now here and inside _nanmedian
this can be removed and let _ureduce handle the keepdims

Implemented a nanmedian and associated tests as an extension of np.median to complement the other nanfunctions Added negative values to the unit tests Cleaned up documentation of nanmedian

juliantaylor · 2014-05-06T18:38:27Z

thanks merging, please comment when you update a branch, there is no automatic notification else.

ENH: added functionality nanmedian to numpy

juliantaylor · 2014-05-06T18:39:13Z

now we still need nanpercentile, any takers?

dfreese · 2014-05-22T22:08:50Z

Thanks for the help and patience. I put together a nanpercentile in the same vein. (#4734)

argriffing mentioned this pull request Feb 17, 2014

Closes issue #586. Updates handling of nan behaviour in numpy.lib.function_base.median. #4287

Closed

empeeu mentioned this pull request Mar 8, 2014

ENH: Added proper handling of nans for numpy.lib.function_base.median #4460

Closed

juliantaylor reviewed Apr 1, 2014
View reviewed changes

juliantaylor reviewed Apr 10, 2014
View reviewed changes

juliantaylor reviewed Apr 24, 2014
View reviewed changes

ENH: added functionality nanmedian to numpy

beec75b

Implemented a nanmedian and associated tests as an extension of np.median to complement the other nanfunctions Added negative values to the unit tests Cleaned up documentation of nanmedian

juliantaylor added a commit that referenced this pull request May 6, 2014

Merge pull request #4307 from dfreese/feature/nanmedian

d150103

ENH: added functionality nanmedian to numpy

juliantaylor merged commit d150103 into numpy:master May 6, 2014

dfreese deleted the feature/nanmedian branch May 9, 2014 07:09

ENH: added functionality nanmedian to numpy #4307

ENH: added functionality nanmedian to numpy #4307

Conversation

dfreese commented Feb 17, 2014

jaimefrio commented Feb 17, 2014

juliantaylor commented Feb 17, 2014

dfreese commented Feb 23, 2014

jaimefrio commented Feb 23, 2014

empeeu commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

juliantaylor commented Feb 26, 2014

empeeu commented Mar 4, 2014

charris commented Mar 24, 2014

dfreese commented Mar 26, 2014

charris commented Mar 27, 2014

dfreese commented Mar 27, 2014

juliantaylor commented Mar 27, 2014

dfreese commented Mar 27, 2014

charris commented Mar 27, 2014

charris commented Mar 31, 2014

juliantaylor commented Mar 31, 2014

dfreese commented Apr 1, 2014

juliantaylor Apr 1, 2014

Choose a reason for hiding this comment

juliantaylor Apr 1, 2014

Choose a reason for hiding this comment

juliantaylor commented Apr 1, 2014

empeeu commented Apr 2, 2014

dfreese commented Apr 8, 2014

juliantaylor Apr 10, 2014

Choose a reason for hiding this comment

charris commented Apr 10, 2014

juliantaylor commented Apr 22, 2014

juliantaylor commented Apr 22, 2014

juliantaylor commented Apr 22, 2014

dfreese commented Apr 24, 2014

juliantaylor Apr 24, 2014

Choose a reason for hiding this comment

dfreese commented Apr 24, 2014

juliantaylor commented Apr 24, 2014

dfreese commented Apr 24, 2014

juliantaylor Apr 24, 2014

Choose a reason for hiding this comment

juliantaylor commented May 6, 2014

juliantaylor commented May 6, 2014

dfreese commented May 22, 2014