-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: handle unmasked NaN in ma.median like normal median #8364
Conversation
urg no this doesn't actually work like that, I just haven't added -inf tests ... |
40b3d49
to
3643355
Compare
checking for fully masked axes and sorting it out then seems to work, but man is it ugly. |
normal indexing doesn't fail anything with the fully masked check, I should probably change it back as its easier to understand. |
I don't see why the masked values not sorting to the end matters. If we were doing an argsort, sure, but here I don't see that it matters where the infs come from. |
it doesn't matter but we do have to preserve whether the median comes from a masked value due to undefined ordering or from a fully masked axis. This is now handled by checking the mask again if the median is a masked value. how do you enable the deprecation warnings with the regular test scripts? just running them |
Where does the undefined ordering come from. I still don't understand what that problem is, but I'm assuming that you know how many masked values there are in axis, which you need to know to find the median in any case. So where am I going wrong in the following:
Apropos deprecation warnings, I'm not clear what you want to do. In development versions the default is to raise an error for deprecation warnings, are you saying that isn't happening? |
the undiscarded infs may be masked as the ordering is undefined. the deprecation errors are happening when running via |
Why does that matter? An inf is an inf, it matters not where it came from and since you know how many masked infs there are, all that remain can be consided unmasked no matter if they came from a masked value or not. I only see a problem is your are doing a argmedian type thing. I see test failures here running
But I also see them without the |
you can't unconditionally return unmasked values as the result can be masked when the whole axis is masked. the warnings are probably Deprecation warnings from |
3643355
to
025a65f
Compare
updated reverting to the original indexing, seems to be enough with the additional mask check |
hm is that python3.6 error in travis a bug in python? why would that work on all other pythons |
That's actually "only" a deprecation in 3.6 because you either need to use raw strings or may only use "explicitly specified" escapes together with |
I think only the listed escape sequences are not deprecated: https://docs.python.org/2/reference/lexical_analysis.html#string-literals |
interesting, so making it a raw string should work. Probably should also fix the python2 guarded use in nosetester in case this ever gets a real syntax error. |
025a65f
to
525942c
Compare
def _median_nancheck(data, result, axis, out): | ||
""" | ||
Check median result from data for NaN values at the end and return NaN | ||
in that case. Also used for masked median |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A standard docstring would be good.
n = np.isnan(data[..., -1]) | ||
# masked NaN are fine | ||
try: | ||
n = n.filled(False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to deal with the possibility that n is not masked, no? Either a comment or explicit type check would help. Should also be mentioned in the docstring that it handles both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm yes an explicit MaskedArray type is more self-documenting
np.true_divide(s, 2., casting='unsafe') | ||
s = np.lib.function_base._median_nancheck(asorted, s, axis, out) | ||
else: | ||
s = mid.mean(out=out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiousity, what does this do for not inexact types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast them to inexact, median always returns float, note this is not changed from before. That the old 1d case did use mean for inexact directly was a bug, np.ma.median([inf, inf]) was wrongly masked (test added)
else: | ||
s = mid.mean(out=out) | ||
|
||
# if result is masked either it is full of minimum fill or all masked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does minimum fill come from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be the array is full of the minimum fill value, or full enough so that the median is minimum full (= inf)
Just a couple of questions and a request for a complete docstring for |
I'm tempted to suggest another private file in numpy/lib to contain the common function |
I also don't like the use of the nancheck helper, but it just needs one extra operation to work for both and not repeating it is important as it is not trivial code. |
|
I don't want people using the function as it is just use to implement a detail and we might want to change it. Keeping it explicitly marked private should help there. |
OK. |
This requires to base masked median on sort(endwith=False) as we need to distinguish Inf and NaN. Using Inf as filler element of the sort does not work as then the mask is not guaranteed to be at the end. Closes numpygh-8340 Also fixed 1d ma.median not handling np.inf correctly, the nd variant was ok.
Python 3.6 gets more strict about escape sequences, \. is invalid. As it could get a syntax error the version check would not work.
The apply_along_axis path is significantly more expensive than currently accounted for in the check. Increase the minimum axis size from 400 to 1000 elements. Either apply_along_axis got more expensive over time or the original benchmarking was flawed.
525942c
to
3b31fa1
Compare
updated branch. |
Tuning changes don't bother me. For the sorting routines I tuned the size at which fallback to insertion sort took place, but that was about 10 years ago and I would not be surprised if the best value has changed in that time, but a sort is still a sort. |
Thanks Julian. |
thanks, |
Yes, please do the backport. A release note entry would be fine, I can forward port that, I will need to update the notes in any case for rc1. |
You don't need to add anything to the list of PRs, I will be updating that. |
urg I am an idiot, I didn't shuffle the data properly on the new benchmark, which distorts the result. The real cutoff is more around 550-600 and not 1000 for realistic data ... |
BUG: handle unmasked NaN in ma.median like normal median
This requires to base masked median on sort(endwith=False) as we need to
distinguish Inf and NaN.
Using Inf as filler element of the sort does not work as then the mask is not
guaranteed to be at the end.
Closes gh-8340
Also fixed 1d ma.median not handling np.inf correctly, the nd variant
was ok.