Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: handle unmasked NaN in ma.median like normal median #8364

Merged
merged 3 commits into from
Dec 12, 2016

Conversation

juliantaylor
Copy link
Contributor

@juliantaylor juliantaylor commented Dec 11, 2016

BUG: handle unmasked NaN in ma.median like normal median

This requires to base masked median on sort(endwith=False) as we need to
distinguish Inf and NaN.
Using Inf as filler element of the sort does not work as then the mask is not
guaranteed to be at the end.
Closes gh-8340

Also fixed 1d ma.median not handling np.inf correctly, the nd variant
was ok.

@juliantaylor
Copy link
Contributor Author

urg no this doesn't actually work like that, I just haven't added -inf tests ...

@juliantaylor
Copy link
Contributor Author

checking for fully masked axes and sorting it out then seems to work, but man is it ugly.
maybe the original indexing can be kept if doing the same, hmm

@juliantaylor
Copy link
Contributor Author

normal indexing doesn't fail anything with the fully masked check, I should probably change it back as its easier to understand.

@charris
Copy link
Member

charris commented Dec 11, 2016

I don't see why the masked values not sorting to the end matters. If we were doing an argsort, sure, but here I don't see that it matters where the infs come from.

@juliantaylor
Copy link
Contributor Author

it doesn't matter but we do have to preserve whether the median comes from a masked value due to undefined ordering or from a fully masked axis. This is now handled by checking the mask again if the median is a masked value.

how do you enable the deprecation warnings with the regular test scripts? just running them -3 does not fail and I want to use the --pdb flag because I don't know where they are coming from.

@charris
Copy link
Member

charris commented Dec 11, 2016

Where does the undefined ordering come from. I still don't understand what that problem is, but I'm assuming that you know how many masked values there are in axis, which you need to know to find the median in any case. So where am I going wrong in the following:

  • if the axis is all masked, whatever the relevant behaviot is
  • replace masked values with inf
  • sort
  • if the last value is nan, then it is an unmasked nan, so return nan
  • otherwise discard the number of masked values from the end. If there were unmasked infs there will be undiscarded inf remaining.

Apropos deprecation warnings, I'm not clear what you want to do. In development versions the default is to raise an error for deprecation warnings, are you saying that isn't happening?

@juliantaylor
Copy link
Contributor Author

the undiscarded infs may be masked as the ordering is undefined.

the deprecation errors are happening when running via test() but not when running the test script as an executable. This is unfortunate as adding pdb does not work in the former. I just need to know how to trigger the warning outside of the testsuite.

@charris
Copy link
Member

charris commented Dec 11, 2016

the undiscarded infs may be masked as the ordering is undefined.

Why does that matter? An inf is an inf, it matters not where it came from and since you know how many masked infs there are, all that remain can be consided unmasked no matter if they came from a masked value or not. I only see a problem is your are doing a argmedian type thing.

I see test failures here running runtests or directly:

$ python2.7 -3 runtests.py # tests fail
$ python2.7 -3 numpy/ma/tests/test_extras.py  # also shows a ton of other warnings

But I also see them without the -3 and they look like different errors than travis. Are you pushing new commits?

@juliantaylor
Copy link
Contributor Author

you can't unconditionally return unmasked values as the result can be masked when the whole axis is masked.

the warnings are probably Deprecation warnings from __getslice__, I guess I need to use supress_warnings to get rid of them like the regular testsuite does.

@juliantaylor
Copy link
Contributor Author

updated reverting to the original indexing, seems to be enough with the additional mask check
also found a bug in the new code due to not test for non-float, used the minimum_fill from the median result which is always float
the deprecation warning should also be fixed by ignoring.

@juliantaylor
Copy link
Contributor Author

hm is that python3.6 error in travis a bug in python? why would that work on all other pythons

@MSeifert04
Copy link
Contributor

hm is that python3.6 error in travis a bug in python? why would that work on all other pythons

That's actually "only" a deprecation in 3.6 because you either need to use raw strings or may only use "explicitly specified" escapes together with \. For example \n (linebreak) is allowed but \. is not allowed. Simple fix is to use \\. there

@MSeifert04
Copy link
Contributor

I think only the listed escape sequences are not deprecated: https://docs.python.org/2/reference/lexical_analysis.html#string-literals

@juliantaylor
Copy link
Contributor Author

interesting, so making it a raw string should work. Probably should also fix the python2 guarded use in nosetester in case this ever gets a real syntax error.

def _median_nancheck(data, result, axis, out):
"""
Check median result from data for NaN values at the end and return NaN
in that case. Also used for masked median
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A standard docstring would be good.

n = np.isnan(data[..., -1])
# masked NaN are fine
try:
n = n.filled(False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to deal with the possibility that n is not masked, no? Either a comment or explicit type check would help. Should also be mentioned in the docstring that it handles both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm yes an explicit MaskedArray type is more self-documenting

np.true_divide(s, 2., casting='unsafe')
s = np.lib.function_base._median_nancheck(asorted, s, axis, out)
else:
s = mid.mean(out=out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiousity, what does this do for not inexact types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cast them to inexact, median always returns float, note this is not changed from before. That the old 1d case did use mean for inexact directly was a bug, np.ma.median([inf, inf]) was wrongly masked (test added)

else:
s = mid.mean(out=out)

# if result is masked either it is full of minimum fill or all masked
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does minimum fill come from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be the array is full of the minimum fill value, or full enough so that the median is minimum full (= inf)

@charris
Copy link
Member

charris commented Dec 12, 2016

Just a couple of questions and a request for a complete docstring for _median_nancheck. I'm a bit bothered by a dual purpose function in function_base but can't think of a clean solution that doesn't seem repetitive.

@charris
Copy link
Member

charris commented Dec 12, 2016

I'm tempted to suggest another private file in numpy/lib to contain the common function _median_nancheck, perhaps something like _common.py, but I think that would only make sense if there was more than one function to put in it. Of course, we could also have two almost identical copies, but that presents something of a maintenance problem...

@juliantaylor
Copy link
Contributor Author

I also don't like the use of the nancheck helper, but it just needs one extra operation to work for both and not repeating it is important as it is not trivial code.
Maybe it is better placed in lib/utils.py?

@charris
Copy link
Member

charris commented Dec 12, 2016

lib/utils.py might be a good place, could take away the leading _ then. Is the function likely to be useful for more than the current two functions?

@juliantaylor
Copy link
Contributor Author

I don't want people using the function as it is just use to implement a detail and we might want to change it. Keeping it explicitly marked private should help there.
It is quite specific to sorting like code, percentile has similar code but its probably to different to mash into the current median nancheck

@charris
Copy link
Member

charris commented Dec 12, 2016

OK.

This requires to base masked median on sort(endwith=False) as we need to
distinguish Inf and NaN.
Using Inf as filler element of the sort does not work as then the mask is not
guaranteed to be at the end.
Closes numpygh-8340

Also fixed 1d ma.median not handling np.inf correctly, the nd variant
was ok.
Python 3.6 gets more strict about escape sequences, \. is invalid.
As it could get a syntax error the version check would not work.
The apply_along_axis path is significantly more expensive than currently
accounted for in the check. Increase the minimum axis size from 400 to
1000 elements.
Either apply_along_axis got more expensive over time or the original
benchmarking was flawed.
@juliantaylor
Copy link
Contributor Author

updated branch.
Also added a retuning of the small median threshold in nanmedian, it used to go into a apply_along_axis path with axes larger than 400, but that currently is still much slower. At around 1000 it gets profitable.
Not sure why the original value was 400, maybe stuff got slower or the original benchmark was flawed.
As this just changes a cutoff its probably also still suitable for 1.12, e.g. astropy messed around with similar tuning so it is relevant.

@charris
Copy link
Member

charris commented Dec 12, 2016

Tuning changes don't bother me. For the sorting routines I tuned the size at which fallback to insertion sort took place, but that was about 10 years ago and I would not be surprised if the best value has changed in that time, but a sort is still a sort.

@charris charris merged commit aab7993 into numpy:master Dec 12, 2016
@charris
Copy link
Member

charris commented Dec 12, 2016

Thanks Julian.

@juliantaylor
Copy link
Contributor Author

thanks,
should I do the backport? it also needs a release not entry like the one we had in 1.10 for median

@charris
Copy link
Member

charris commented Dec 12, 2016

Yes, please do the backport. A release note entry would be fine, I can forward port that, I will need to update the notes in any case for rc1.

@charris
Copy link
Member

charris commented Dec 12, 2016

You don't need to add anything to the list of PRs, I will be updating that.

@charris charris removed this from the 1.12.0 release milestone Dec 12, 2016
@juliantaylor juliantaylor deleted the masked-median-nan branch December 13, 2016 19:16
@juliantaylor
Copy link
Contributor Author

urg I am an idiot, I didn't shuffle the data properly on the new benchmark, which distorts the result. The real cutoff is more around 550-600 and not 1000 for realistic data ...
I'll probably update it again later, doesn't need to hold up any rc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in ma.median handling of nan values, and calculates incorrect value
3 participants