-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: added functionality nanmedian to numpy #4307
Conversation
If you take a look at the code of masked arrays' median, there are two very gross inefficiencies in there:
Ideally you would want to correct both situations, although I don't think there is an entirely satisfactory solution. A couple of ideas:
I think that 2 would be a better option than 1, although probably harder to code. |
the rational for not supporting broadcasting, besides being to lazy to think about how that would work, is that partition is normally a very expensive operations, so the cost of an explicit loop is usually amortized quickly. I would suggest instead of using the one could also add a cutoff for when the looping cost get high at which one simply sorts the array and shifts the indices, but I don't think its necessary for the first version. |
I implemented the formulation by @empeeu in #4287, which goes about the method you mentioned @juliantaylor. The implementation of the out functionality seems a little rough, but I wasn't sure the best way to go about that with the apply_along_axis. |
I think there should be a check for the dtype at the beginning of the function that simply calls And I may be wrong, but I think that even a nan-function should be optimized for no nans in the input as being the most likely situation. I have made some timings on master after Julian's shortcut to speed-up partition for index -1:
I think you should assume there will be no nans. So for a 1000 item array, always start by making a call to But if you get a little creative, you can check whether after partial sorting
You would need to code it and time it, to see how well it does depending on the density of nans, but I tink it will be faster for the most typical cases, and only do slighty worse for the rest. |
I've been wondering, wouldn't a |
I think doing a isnan call once, moving all nans to the end and then doing a partition should be fast enough. It should not be much more that 20%-30% overhead and it seems a lot of that actually originates from a np.where inefficiency that can be fixed. |
in case its unclear, I mean something along the lins of this to move nans to the end for the 1d partition c = np.isnan(data)
# fast if few nans
s = np.where(c)[0]
# select non-nans at end of array
enonan = data[-s.size:][~c[-s.size:]]
# fill nans in beginning of array with non-nans of end
data[s[:enonan.size]] = enonan
# we don't care about moving nans to the end |
meh now I ended up just implementing it fully in scipy :) scipy/scipy#3396 |
@juliantaylor Re: count_nonzero(bool) is significantly faster than sum(bool), though I don't know how thats relevant If we know how many nans are in the array, we just have to move the pivot for the partition to get the median while ignoring nans. Something like
so, if we had an efficient Well, I suppose this gets nastier when you're not dealing with 1D arrays, and I suppose that's the crux of the matter. But, I figured I'd ask anyway. |
@empeeu Be nice to get this finished up. |
It's up to date with the method @juliantaylor used assuming small numbers of nans. |
Looking good. Could you squash the commits and use the prefixes in You can squash commits using |
Should be together now. Let me know if I missed something. |
does this work with extended axis? e.g. nanmedian(d, axis=(0,1)) |
no, currently it raises an IndexError if a tuple is given. On Thu, Mar 27, 2014 at 1:31 PM, Julian Taylor notifications@github.comwrote:
|
@dfreese Needs a rebase. Could you also add the extended axis? I'd like to get this in. |
then we still need a nanpercentile :/ |
Extended axis and keep dims are in. Sorry for the excessive delay on the update. |
if axis is None: | ||
part = a.ravel() | ||
if out is not None: | ||
out = _nanmedian1d(part, overwrite_input) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this works you are assigning to a local reference, it should probably be out[...] = ...
to copy into out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the out test only succeeds by accident so that needs improving too
I am a bit confused, what happened to the approach of counting nans and shifting the partition? |
@charris Sorry for the late reply (trying to buy a house). I ceded control to dfreese on this one. I'm not up to date on the discussion here. |
I will look at getting the out fixed up here tonight. I tried to keep it somewhat tied to median for simplicity's sake, but I haven't tried to compare the two methods. |
# fill nans in beginning of array with non-nans of end | ||
x[s[:enonan.size]] = enonan | ||
# slice nans away | ||
return np.median(x[:-s.size], overwrite_input=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems this duplication can be avoided by reducing the if to if overwrite: x = arr1d.copy() else: x = arr1d
@juliantaylor Look good to you? |
I think it would be good if nanmedian on empty arrays/axis behaves as nanmean, else its good |
a easy way to do so could be to just call nanmean if the array size is zero |
or as mean and median behave different maybe it is better to adapt nanmean to behave like the others instead |
I was mistaken earlier when I looked at some of the handling of empty arrays. apply_along_axis() makes handling an empty axis a little awkward. I added in a check for that prior to calling apply_along_axis, which should cause the same result as mean/median/nanmean. sidenote: nanmean treats empty arrays and all-nan arrays the same way; in either case the warnings are the same. |
else: | ||
# apply_along_axis doesn't handle empty arrays well, | ||
# so deal them upfront | ||
if (np.array(a.shape)==0).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 in a.shape
is shorter and faster, ufunc reductions have enormous overheads
@juliantaylor I added a commit that does those things. Moving axis == None to the main function is pretty straightforward, but the implementation for empty arrays ends up being fairly messy in order to handle extended axis and keepdims. Not sure if it's ideal to add that much code to miss the function call for an empty array. |
would this work?
|
yeah. |
result = result * np.ones([1 for i in a.shape]) | ||
if out is not None: | ||
out[:] = result | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the axis None part now here and inside _nanmedian
this can be removed and let _ureduce handle the keepdims
Implemented a nanmedian and associated tests as an extension of np.median to complement the other nanfunctions Added negative values to the unit tests Cleaned up documentation of nanmedian
thanks merging, please comment when you update a branch, there is no automatic notification else. |
ENH: added functionality nanmedian to numpy
now we still need nanpercentile, any takers? |
Thanks for the help and patience. I put together a nanpercentile in the same vein. (#4734) |
Implemented a nanmedian and associated tests as an
extension of np.median to complement the other
nanfunctions
Added negative values to the unit tests
Cleaned up documentation of nanmedian