BUG: handle unmasked NaN in ma.median like normal median #8364

juliantaylor · 2016-12-11T13:43:40Z

BUG: handle unmasked NaN in ma.median like normal median

This requires to base masked median on sort(endwith=False) as we need to
distinguish Inf and NaN.
Using Inf as filler element of the sort does not work as then the mask is not
guaranteed to be at the end.
Closes gh-8340

Also fixed 1d ma.median not handling np.inf correctly, the nd variant
was ok.

juliantaylor · 2016-12-11T13:46:31Z

urg no this doesn't actually work like that, I just haven't added -inf tests ...

juliantaylor · 2016-12-11T14:31:23Z

checking for fully masked axes and sorting it out then seems to work, but man is it ugly.
maybe the original indexing can be kept if doing the same, hmm

juliantaylor · 2016-12-11T15:47:46Z

normal indexing doesn't fail anything with the fully masked check, I should probably change it back as its easier to understand.

charris · 2016-12-11T16:56:01Z

I don't see why the masked values not sorting to the end matters. If we were doing an argsort, sure, but here I don't see that it matters where the infs come from.

juliantaylor · 2016-12-11T19:20:52Z

it doesn't matter but we do have to preserve whether the median comes from a masked value due to undefined ordering or from a fully masked axis. This is now handled by checking the mask again if the median is a masked value.

how do you enable the deprecation warnings with the regular test scripts? just running them -3 does not fail and I want to use the --pdb flag because I don't know where they are coming from.

charris · 2016-12-11T19:36:47Z

Where does the undefined ordering come from. I still don't understand what that problem is, but I'm assuming that you know how many masked values there are in axis, which you need to know to find the median in any case. So where am I going wrong in the following:

if the axis is all masked, whatever the relevant behaviot is
replace masked values with inf
sort
if the last value is nan, then it is an unmasked nan, so return nan
otherwise discard the number of masked values from the end. If there were unmasked infs there will be undiscarded inf remaining.

Apropos deprecation warnings, I'm not clear what you want to do. In development versions the default is to raise an error for deprecation warnings, are you saying that isn't happening?

juliantaylor · 2016-12-11T20:16:46Z

the undiscarded infs may be masked as the ordering is undefined.

the deprecation errors are happening when running via test() but not when running the test script as an executable. This is unfortunate as adding pdb does not work in the former. I just need to know how to trigger the warning outside of the testsuite.

charris · 2016-12-11T20:55:19Z

the undiscarded infs may be masked as the ordering is undefined.

Why does that matter? An inf is an inf, it matters not where it came from and since you know how many masked infs there are, all that remain can be consided unmasked no matter if they came from a masked value or not. I only see a problem is your are doing a argmedian type thing.

I see test failures here running runtests or directly:

$ python2.7 -3 runtests.py # tests fail
$ python2.7 -3 numpy/ma/tests/test_extras.py  # also shows a ton of other warnings

But I also see them without the -3 and they look like different errors than travis. Are you pushing new commits?

juliantaylor · 2016-12-11T21:07:22Z

you can't unconditionally return unmasked values as the result can be masked when the whole axis is masked.

the warnings are probably Deprecation warnings from __getslice__, I guess I need to use supress_warnings to get rid of them like the regular testsuite does.

juliantaylor · 2016-12-11T21:48:03Z

updated reverting to the original indexing, seems to be enough with the additional mask check
also found a bug in the new code due to not test for non-float, used the minimum_fill from the median result which is always float
the deprecation warning should also be fixed by ignoring.

juliantaylor · 2016-12-11T22:05:47Z

hm is that python3.6 error in travis a bug in python? why would that work on all other pythons

MSeifert04 · 2016-12-11T22:26:22Z

hm is that python3.6 error in travis a bug in python? why would that work on all other pythons

That's actually "only" a deprecation in 3.6 because you either need to use raw strings or may only use "explicitly specified" escapes together with \. For example \n (linebreak) is allowed but \. is not allowed. Simple fix is to use \\. there

MSeifert04 · 2016-12-11T22:27:43Z

I think only the listed escape sequences are not deprecated: https://docs.python.org/2/reference/lexical_analysis.html#string-literals

juliantaylor · 2016-12-11T22:43:16Z

interesting, so making it a raw string should work. Probably should also fix the python2 guarded use in nosetester in case this ever gets a real syntax error.

charris · 2016-12-12T01:05:56Z

numpy/lib/function_base.py

+def _median_nancheck(data, result, axis, out):
+    """
+    Check median result from data for NaN values at the end and return NaN
+    in that case. Also used for masked median


A standard docstring would be good.

charris · 2016-12-12T01:28:23Z

numpy/lib/function_base.py

+    n = np.isnan(data[..., -1])
+    # masked NaN are fine
+    try:
+        n = n.filled(False)


This is to deal with the possibility that n is not masked, no? Either a comment or explicit type check would help. Should also be mentioned in the docstring that it handles both.

hm yes an explicit MaskedArray type is more self-documenting

charris · 2016-12-12T01:43:47Z

numpy/ma/extras.py

+            np.true_divide(s, 2., casting='unsafe')
+            s = np.lib.function_base._median_nancheck(asorted, s, axis, out)
+        else:
+            s = mid.mean(out=out)


Out of curiousity, what does this do for not inexact types?

cast them to inexact, median always returns float, note this is not changed from before. That the old 1d case did use mean for inexact directly was a bug, np.ma.median([inf, inf]) was wrongly masked (test added)

charris · 2016-12-12T01:46:12Z

numpy/ma/extras.py

+        else:
+            s = mid.mean(out=out)
+
+        # if result is masked either it is full of minimum fill or all masked


Where does minimum fill come from here?

should be the array is full of the minimum fill value, or full enough so that the median is minimum full (= inf)

charris · 2016-12-12T01:55:35Z

Just a couple of questions and a request for a complete docstring for _median_nancheck. I'm a bit bothered by a dual purpose function in function_base but can't think of a clean solution that doesn't seem repetitive.

charris · 2016-12-12T04:06:45Z

I'm tempted to suggest another private file in numpy/lib to contain the common function _median_nancheck, perhaps something like _common.py, but I think that would only make sense if there was more than one function to put in it. Of course, we could also have two almost identical copies, but that presents something of a maintenance problem...

juliantaylor · 2016-12-12T20:41:02Z

I also don't like the use of the nancheck helper, but it just needs one extra operation to work for both and not repeating it is important as it is not trivial code.
Maybe it is better placed in lib/utils.py?

charris · 2016-12-12T21:16:05Z

lib/utils.py might be a good place, could take away the leading _ then. Is the function likely to be useful for more than the current two functions?

juliantaylor · 2016-12-12T21:49:27Z

I don't want people using the function as it is just use to implement a detail and we might want to change it. Keeping it explicitly marked private should help there.
It is quite specific to sorting like code, percentile has similar code but its probably to different to mash into the current median nancheck

charris · 2016-12-12T22:06:18Z

OK.

This requires to base masked median on sort(endwith=False) as we need to distinguish Inf and NaN. Using Inf as filler element of the sort does not work as then the mask is not guaranteed to be at the end. Closes numpygh-8340 Also fixed 1d ma.median not handling np.inf correctly, the nd variant was ok.

Python 3.6 gets more strict about escape sequences, \. is invalid. As it could get a syntax error the version check would not work.

The apply_along_axis path is significantly more expensive than currently accounted for in the check. Increase the minimum axis size from 400 to 1000 elements. Either apply_along_axis got more expensive over time or the original benchmarking was flawed.

juliantaylor · 2016-12-12T22:23:27Z

updated branch.
Also added a retuning of the small median threshold in nanmedian, it used to go into a apply_along_axis path with axes larger than 400, but that currently is still much slower. At around 1000 it gets profitable.
Not sure why the original value was 400, maybe stuff got slower or the original benchmark was flawed.
As this just changes a cutoff its probably also still suitable for 1.12, e.g. astropy messed around with similar tuning so it is relevant.

charris · 2016-12-12T23:10:07Z

Tuning changes don't bother me. For the sorting routines I tuned the size at which fallback to insertion sort took place, but that was about 10 years ago and I would not be surprised if the best value has changed in that time, but a sort is still a sort.

charris · 2016-12-12T23:11:25Z

Thanks Julian.

juliantaylor · 2016-12-12T23:14:27Z

thanks,
should I do the backport? it also needs a release not entry like the one we had in 1.10 for median

charris · 2016-12-12T23:16:48Z

Yes, please do the backport. A release note entry would be fine, I can forward port that, I will need to update the notes in any case for rc1.

charris · 2016-12-12T23:17:34Z

You don't need to add anything to the list of PRs, I will be updating that.

juliantaylor · 2016-12-13T19:17:35Z

urg I am an idiot, I didn't shuffle the data properly on the new benchmark, which distorts the result. The real cutoff is more around 550-600 and not 1000 for realistic data ...
I'll probably update it again later, doesn't need to hold up any rc.

juliantaylor force-pushed the masked-median-nan branch from 40b3d49 to 3643355 Compare December 11, 2016 14:29

juliantaylor force-pushed the masked-median-nan branch from 3643355 to 025a65f Compare December 11, 2016 21:46

juliantaylor force-pushed the masked-median-nan branch from 025a65f to 525942c Compare December 11, 2016 22:45

charris added 00 - Bug component: numpy.ma masked arrays labels Dec 12, 2016

charris added this to the 1.12.0 release milestone Dec 12, 2016

charris reviewed Dec 12, 2016

View reviewed changes

juliantaylor added 3 commits December 12, 2016 23:19

MAINT: make regex strings raw

82a1d81

Python 3.6 gets more strict about escape sequences, \. is invalid. As it could get a syntax error the version check would not work.

juliantaylor force-pushed the masked-median-nan branch from 525942c to 3b31fa1 Compare December 12, 2016 22:20

charris merged commit aab7993 into numpy:master Dec 12, 2016

juliantaylor mentioned this pull request Dec 12, 2016

BUG: Fixes for ma.median and nanpercentile. #8372

Merged

charris removed this from the 1.12.0 release milestone Dec 12, 2016

juliantaylor deleted the masked-median-nan branch December 13, 2016 19:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: handle unmasked NaN in ma.median like normal median #8364

BUG: handle unmasked NaN in ma.median like normal median #8364

juliantaylor commented Dec 11, 2016 •

edited

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

MSeifert04 commented Dec 11, 2016

MSeifert04 commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris Dec 12, 2016

charris Dec 12, 2016

juliantaylor Dec 12, 2016

charris Dec 12, 2016

juliantaylor Dec 12, 2016

charris Dec 12, 2016

juliantaylor Dec 12, 2016

charris commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 13, 2016

BUG: handle unmasked NaN in ma.median like normal median #8364

BUG: handle unmasked NaN in ma.median like normal median #8364

Conversation

juliantaylor commented Dec 11, 2016 • edited

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

MSeifert04 commented Dec 11, 2016

MSeifert04 commented Dec 11, 2016

juliantaylor commented Dec 11, 2016

charris Dec 12, 2016

Choose a reason for hiding this comment

charris Dec 12, 2016

Choose a reason for hiding this comment

juliantaylor Dec 12, 2016

Choose a reason for hiding this comment

charris Dec 12, 2016

Choose a reason for hiding this comment

juliantaylor Dec 12, 2016

Choose a reason for hiding this comment

charris Dec 12, 2016

Choose a reason for hiding this comment

juliantaylor Dec 12, 2016

Choose a reason for hiding this comment

charris commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 12, 2016

charris commented Dec 12, 2016

charris commented Dec 12, 2016

juliantaylor commented Dec 13, 2016

juliantaylor commented Dec 11, 2016 •

edited