Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: added functionality nanmedian to numpy #4307

Merged
merged 1 commit into from
May 6, 2014

Conversation

dfreese
Copy link

@dfreese dfreese commented Feb 17, 2014

Implemented a nanmedian and associated tests as an
extension of np.median to complement the other
nanfunctions

Added negative values to the unit tests

Cleaned up documentation of nanmedian

@jaimefrio
Copy link
Member

If you take a look at the code of masked arrays' median, there are two very gross inefficiencies in there:

  1. It uses sorting, not selection. This means it is O(n log n) instead of O(n).
  2. It uses apply_along_axis for multidimensional cases, which is not any faster than a Python loop.

Ideally you would want to correct both situations, although I don't think there is an entirely satisfactory solution. A couple of ideas:

  1. If you stick to sorting, you can get rid of apply_along_axis. The basic insight is that NaNs get sorted to the end of an array. So if you count the NaNs in each 1D array over which you have to compute the median, you can figure by how much you have to shift the position of the items you have to average to find the median. This shift does not have to be the same for all 1D arrays, so you won't be able to simply take a slice of the sorted array, but I think it can be done with fancy indexing.
  2. If you prefer to use selection, then it is a pity that np.partition won't broadcast the indices, so you are basically going to be stuck with something similar to using apply_along_axis. A possible improvement would be to, instead of iterating over all the 1D arrays, iterate over all groups of 1D arrays that have the same number of NaNs. It would be more complicated, but for the typical case where only a few of the 1D arrays have NaNs, and there aren't many of them, it should make things much faster.

I think that 2 would be a better option than 1, although probably harder to code.

@juliantaylor
Copy link
Contributor

the rational for not supporting broadcasting, besides being to lazy to think about how that would work, is that partition is normally a very expensive operations, so the cost of an explicit loop is usually amortized quickly.

I would suggest instead of using the ma module to use apply_along_axis directly
on the 1d arrays then move the nans to the end using fancy indexing and partition along the rest
this should be quite fast as moving nans in the axis slices should improve cpu cache utilization.
see scipy/scipy#3335 for an existing implementation that can still be improved via moving nans.

one could also add a cutoff for when the looping cost get high at which one simply sorts the array and shifts the indices, but I don't think its necessary for the first version.

@dfreese
Copy link
Author

dfreese commented Feb 23, 2014

I implemented the formulation by @empeeu in #4287, which goes about the method you mentioned @juliantaylor. The implementation of the out functionality seems a little rough, but I wasn't sure the best way to go about that with the apply_along_axis.

@jaimefrio
Copy link
Member

I think there should be a check for the dtype at the beginning of the function that simply calls median if the type cannot have NaNs in it: no point calling isnan on an array of integers.

And I may be wrong, but I think that even a nan-function should be optimized for no nans in the input as being the most likely situation. I have made some timings on master after Julian's shortcut to speed-up partition for index -1:

>>> a = np.random.rand(1000)
>>> min(repeat('np.partition(a, -1)', 'from __main__ import a, np', repeat=100000, number=1))
7.3894142929020745e-06
>>> min(repeat('np.sum(np.isnan(a))', 'from __main__ import a, np', repeat=100000, number=1))
9.031506408518908e-06
>>> min(repeat('np.partition(a, (499, 500, -1))', 'from __main__ import a, np', repeat=100000, number=1))
1.1905167525583238e-05

I think you should assume there will be no nans. So for a 1000 item array, always start by making a call to np.partition(a, (499, 500, -1)). You then check to see if the last item is a nan. If it isn't, then you have computed the median almost 3x faster. If, on the other hand, there are indeed nans, you will have to count them and call partition again. Doing this naively, things would run at .75x the current speed. As long as less than two thirds of the 1D slices have nans, you are going to come ahead.

But if you get a little creative, you can check whether after partial sorting a[500] is a nan, and if it isn't, count the nans only on the second half of the partially sorted array, then figure out the offsets you need, and call partition only on the first half of the array:

>>> min(repeat('np.sum(np.isnan(a[501:]))', 'from __main__ import a, np', repeat=100000, number=1))
7.799937293384573e-06
>>> min(repeat('np.partition(a[:501], (480, 481))', 'from __main__ import a, np', repeat=100000, number=1))
7.799937293384573e-06

You would need to code it and time it, to see how well it does depending on the density of nans, but I tink it will be faster for the most typical cases, and only do slighty worse for the rest.

@empeeu
Copy link
Contributor

empeeu commented Feb 26, 2014

I've been wondering, wouldn't a count_nan() function be the best option in this case? With count_nan() you just have to loop through the array once and increment an integer -- avoiding the memory allocation of a bool array when using isnan(). Alternatively, you could add another specialization to partition -- as it is partitioning an element, it keeps track whether or not that element is a nan. Though, I don't know the codebase well enough to figure out where to make the changes... Anyway, just an idea.

@juliantaylor
Copy link
Contributor

count_nonzero(bool) is significantly faster than sum(bool), though I don't know how thats relevant, my initial proposal of doing that was flawed it would only work with full sorting as pointed out by jaime.

I think doing a isnan call once, moving all nans to the end and then doing a partition should be fast enough. It should not be much more that 20%-30% overhead and it seems a lot of that actually originates from a np.where inefficiency that can be fixed.

@juliantaylor
Copy link
Contributor

in case its unclear, I mean something along the lins of this to move nans to the end for the 1d partition

c = np.isnan(data)
# fast if few nans
s = np.where(c)[0]
# select non-nans at end of array
enonan = data[-s.size:][~c[-s.size:]]
# fill nans in beginning of array with non-nans of end
data[s[:enonan.size]] = enonan
# we don't care about moving nans to the end

@juliantaylor
Copy link
Contributor

meh now I ended up just implementing it fully in scipy :) scipy/scipy#3396

@empeeu
Copy link
Contributor

empeeu commented Mar 4, 2014

@juliantaylor Re: count_nonzero(bool) is significantly faster than sum(bool), though I don't know how thats relevant

If we know how many nans are in the array, we just have to move the pivot for the partition to get the median while ignoring nans. Something like

a = np.arange(10, dtype=float)
a[:3] = np.nan
sz = a.size - np.count_nan(a)
part = np.partition(a, (sz - 1) // 2)       #for sz odd
nanmedian = part[(sz-1)//2]

so, if we had an efficient count_nan, then nanmedian would basically be just as fast as median. Or am I missing something?

Well, I suppose this gets nastier when you're not dealing with 1D arrays, and I suppose that's the crux of the matter. But, I figured I'd ask anyway.

@charris
Copy link
Member

charris commented Mar 24, 2014

@empeeu Be nice to get this finished up.

@dfreese
Copy link
Author

dfreese commented Mar 26, 2014

It's up to date with the method @juliantaylor used assuming small numbers of nans.

@charris
Copy link
Member

charris commented Mar 27, 2014

Looking good. Could you squash the commits and use the prefixes in doc/source/dev/gitwash/development_workflow.rst. Also add a note in doc/release/1.9.0-notes.rst in the New Features section.

You can squash commits using git rebase -i HEAD~<number of commits> followed by a force push to origin.

@dfreese
Copy link
Author

dfreese commented Mar 27, 2014

Should be together now. Let me know if I missed something.

@juliantaylor
Copy link
Contributor

does this work with extended axis? e.g. nanmedian(d, axis=(0,1))

@dfreese
Copy link
Author

dfreese commented Mar 27, 2014

no, currently it raises an IndexError if a tuple is given.

On Thu, Mar 27, 2014 at 1:31 PM, Julian Taylor notifications@github.comwrote:

does this work with extended axis? e.g. nanmedian(d, axis=(0,1))

Reply to this email directly or view it on GitHubhttps://github.com//pull/4307#issuecomment-38842834
.

@charris
Copy link
Member

charris commented Mar 27, 2014

@dfreese See #3908.

@charris
Copy link
Member

charris commented Mar 31, 2014

@dfreese Needs a rebase. Could you also add the extended axis? I'd like to get this in.

@juliantaylor
Copy link
Contributor

then we still need a nanpercentile :/

@dfreese
Copy link
Author

dfreese commented Apr 1, 2014

Extended axis and keep dims are in. Sorry for the excessive delay on the update.

if axis is None:
part = a.ravel()
if out is not None:
out = _nanmedian1d(part, overwrite_input)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works you are assigning to a local reference, it should probably be out[...] = ... to copy into out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the out test only succeeds by accident so that needs improving too

@juliantaylor
Copy link
Contributor

I am a bit confused, what happened to the approach of counting nans and shifting the partition?
I guessing it would duplicate to much np.median code?

@empeeu
Copy link
Contributor

empeeu commented Apr 2, 2014

@charris Sorry for the late reply (trying to buy a house). I ceded control to dfreese on this one. I'm not up to date on the discussion here.

@dfreese
Copy link
Author

dfreese commented Apr 8, 2014

I will look at getting the out fixed up here tonight. I tried to keep it somewhat tied to median for simplicity's sake, but I haven't tried to compare the two methods.

# fill nans in beginning of array with non-nans of end
x[s[:enonan.size]] = enonan
# slice nans away
return np.median(x[:-s.size], overwrite_input=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems this duplication can be avoided by reducing the if to if overwrite: x = arr1d.copy() else: x = arr1d

@charris
Copy link
Member

charris commented Apr 10, 2014

@juliantaylor Look good to you?

@juliantaylor
Copy link
Contributor

I think it would be good if nanmedian on empty arrays/axis behaves as nanmean, else its good

@juliantaylor
Copy link
Contributor

a easy way to do so could be to just call nanmean if the array size is zero

@juliantaylor
Copy link
Contributor

or as mean and median behave different maybe it is better to adapt nanmean to behave like the others instead

@dfreese
Copy link
Author

dfreese commented Apr 24, 2014

I was mistaken earlier when I looked at some of the handling of empty arrays. apply_along_axis() makes handling an empty axis a little awkward. I added in a check for that prior to calling apply_along_axis, which should cause the same result as mean/median/nanmean.

sidenote: nanmean treats empty arrays and all-nan arrays the same way; in either case the warnings are the same.

else:
# apply_along_axis doesn't handle empty arrays well,
# so deal them upfront
if (np.array(a.shape)==0).any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 in a.shape is shorter and faster, ufunc reductions have enormous overheads

@dfreese
Copy link
Author

dfreese commented Apr 24, 2014

@juliantaylor I added a commit that does those things. Moving axis == None to the main function is pretty straightforward, but the implementation for empty arrays ends up being fairly messy in order to handle extended axis and keepdims. Not sure if it's ideal to add that much code to miss the function call for an empty array.

@juliantaylor
Copy link
Contributor

would this work?

if 0 in a.shape:
   return np.nanmean(a, **kwargs)

@dfreese
Copy link
Author

dfreese commented Apr 24, 2014

yeah.

result = result * np.ones([1 for i in a.shape])
if out is not None:
out[:] = result
return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the axis None part now here and inside _nanmedian
this can be removed and let _ureduce handle the keepdims

Implemented a nanmedian and associated tests as an
extension of np.median to complement the other
nanfunctions

Added negative values to the unit tests

Cleaned up documentation of nanmedian
@juliantaylor
Copy link
Contributor

thanks merging, please comment when you update a branch, there is no automatic notification else.

juliantaylor added a commit that referenced this pull request May 6, 2014
ENH: added functionality nanmedian to numpy
@juliantaylor juliantaylor merged commit d150103 into numpy:master May 6, 2014
@juliantaylor
Copy link
Contributor

now we still need nanpercentile, any takers?

@dfreese dfreese deleted the feature/nanmedian branch May 9, 2014 07:09
@dfreese
Copy link
Author

dfreese commented May 22, 2014

Thanks for the help and patience. I put together a nanpercentile in the same vein. (#4734)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants