New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes issue #586. Updates handling of nan behaviour in numpy.lib.function_base.median. #4287
Conversation
#Check if the array contains any nan's | ||
ids = None | ||
if axis is None or a.ndim == 1: | ||
if np.any(np.isnan(a)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not the best option performance wise, consider the following:
In [14]: a = np.random.rand(1000)
In [15]: np.median(a)
Out[15]: 0.51385702922222753
In [16]: np.partition(a, (499, 500))[499:501].sum() / 2
Out[16]: 0.51385702922222753
In [17]: %timeit np.partition(a, (499, 500))[499:501].sum() / 2
10000 loops, best of 3: 21.9 µs per loop
In [18]: %timeit np.any(np.isnan(a))
100000 loops, best of 3: 14 µs per loop
In [19]: %timeit np.partition(a, (499, 500, -1))[499:501].sum() / 2
10000 loops, best of 3: 24.4 µs per loop
Calling np.isnan
takes about 66% of the time to call np.partition
on this array, a significant slowdown. If, on the other hand, you add -1
to the list of indices to place in sorted position in the call to np.partition
, as in line 19 above, the slowdown is much less noticeable, and because NAN is always placed at the end of an array when sorting, you can now check if there is a NAN in the array by checking if its last item after the call to partition is.
You may want to check for a wider variety of sizes of the input, to confirm if the advantage remains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of the performance penalty, it is probably also a good idea to only do something different if the dtype can hold a NAN, i.e. only for floating point types.
I wonder if it would be better to have an explicit |
@charris: nanmedian would also be good (see commit message), but IIRC this
|
I also don't like the performance impact of this, as @jaimefrio mentioned, nan is always the maximum of a sort so it is better to just add the maximum to the partition call and then check if the last element is a nan, as the partition is iterative the cost is not so high. this also allows implementing nanmedian trivially by simply shifting the index by the number of nans in the axis note this diff probably conflicts with #3908 |
Ok. I will re-write while keeping performance in mind. I will check a wider variety of array sizes to see the performance impact. Adding -1 index to the partition is brilliant! I wouldn't have guessed that isnan is so slow. @jaimefrio Since I'm new at this, is there there a simple way to check if the dtype if floating? Something that would see float, float32, etc? @juliantaylor shifting the index is exactly what I had in mind for |
The types that can have NaNs are the same as those that return true from Does the partition trick work for complex values? I worry that complex
|
the sort order for complex is:
so one needs to select the size - 3 kth element to catch all nans With maximum specialization I meant the selection code in selection.c.src, see dumbselect. I don't think there is a way to avoid an explicit loop for nanmedian, though maybe you can do a pseudo binary search. (searchsorted won't work as the partitioned array is not fully sorted) |
A procedural question (again, since I'm new here): Should I close this push request until I make the discussed changes? Or is there a way to update this request (avoid duplicates) once I have a newer version? |
you can continue to push to the branch and this issue will update |
I don't think
(You may have to repeat the above test several times, as it often does group all NANs at the end, although not always) Because we cannot group all NANs at the end, it is irrelevant, but I think that a binary search for a needle, on a haystack that has been partitioned around that needle, would return correct results. An application that would require I am not sure if it would have much use outside of counting NANs, but you could run the quicksort partition algorithm with pivots supplied by the user, on an unsorted array, and return a partially sorted array, and the positions where those pivots would have to be inserted in the sorted array to preserve order, kind of a |
right, I was thinking of partial sort semantics which would be much slower than running isnan first, how dumb of me... |
Also added a flag "check_nan" that can be turned off to regain original performance. Also added nanmedian function to calculate the median while ignorning nans.
Couple of updates. I tried a number of approaches to speed up the nan-checking in the median function. I use a little benchmark code:
Results are here: So, while there was a cross-over array size where the np.isnan(np.max(a)) was a little bit faster, I didn't implement a cross over because I figured it would be different on different machines, and it makes the implementation uglier. So, because of the basically unavoidable performance hit, I added a flag that you could turn off to no longer check for nans. At the same time, I'm only checking for nans for datatypes that support them (thanks @njsmith) Okay, then I added a nanmedian function. Someone pointed out that scipy already had this, so I took some hints from their implementation, but mine seems to be considerably faster. To compare the performance of the original median, my new median, nanmedian, and scipy's nanmedian, I wrote this little script, where I put my development version in a 'numpy2' folder:
So, with the flag set to False I seem to recover the performance of the original median function. What do you guys think. (Oh, and I still need to update the comments on nanmedian) |
This competes with #4307? |
It looks like we both implemented a version of nanmedian, but I also dealt with the handling of NaNs in the plain-old median function. @dfreese, do you have any suggestions? |
it would be best if the nanmedian can be put into a separate PR, it is independent of checking for nan in regular median and my tiny brain is getting confused :( also this should be put on hold until #3908 is resolved, any change to median will conflict with it. you aren't seeding the rng in your benchmarks or? partition performance depends on the layout of the data array, for better benchmarks you should use the same random seed. |
@empeeu, your implementation using apply_along_axis which was suggested for mine, so your version should probably be used going forward. I have do have unit tests for nanmedian that could prove useful. |
here the nan specialization I was talking about: --- a/numpy/core/src/npysort/selection.c.src
+++ b/numpy/core/src/npysort/selection.c.src
@@ -322,6 +322,19 @@ int
store_pivot(kth, kth, pivots, npiv);
return 0;
}
+ else if (kth == num - 1) {
+ npy_intp k;
+ npy_intp maxidx = low;
+ @type@ maxval = v[IDX(low)];
+ for (k = low + 1; k < num; k++) {
+ if (!@TYPE@_LT(v[IDX(k)], maxval)) {
+ maxidx = k;
+ maxval = v[IDX(k)];
+ }
+ }
+ SWAP(SORTEE(kth), SORTEE(maxidx));
+ return 0;
+ }
/* dumb integer msb, float npy_log2 too slow for small parititions */
{ that gets you the -1 partition almost for free, before:
after
|
@dfreese : I agree with juliantaylor that nanmedian should go in a separate PR, so I will remove it from this one. Since you have a PR open for nanmedian, why don't you take what you find useful from my nanmedian and combine it with yours? I'll just focus on median, nanmedian is all yours. @juliantaylor Nice specialization. I ventured into the c code a little bit the other day, but got lost, so I appreciate seeing where the partition code it (I was thinking of writing a count_nan function, so I went looking for count_nonzero). I'll benchmark to see what happens. |
…dling of nans in median.
@empeeu: Sure, that sounds good. |
for ARGOUTVIEWM?_FARRAY[34] typemaps, the included require_fortran check was inverted -- failing if the resulting array had fortran ordering, not if it did not. This modification inverts these checks.
the partition special case for maximum element has been merged into master |
@juliantaylor How about we put this in and then revisit #3908 to see if we can thrash out the issues. |
I would prefer sorting out 3908 first, also this PR is not ready to be merged yet |
… checking is now pretty much as fast as the old one. Hence, removed the check_nan flag from previous commits.
@juliantaylor You really know your stuff. Benchmark for the latest commit: I was a little worried by how close they were, but I checked it, and it seems to be correct. Waiting on 3908 now. Let me know if there's anything else to do here. |
breaking the api breaks matplotlib build and pip installation. Introduce npy_PyFile_Dup2 and npy_PyFile_DupClose2 as replacements
numpy.i bugfix: fortran ordering check
@empeeu Needs a rebase. |
BUG: restore api for file npy_PyFile_Dup and npy_PyFile_DupClose
Caused SciPy tests to fail when built with this NumPy.
BUG: Revert PR #4421: causes SciPy tests to fail
run manual apt-get update to pick up the latest py3 security update
DOC: Fixed documentation on lstsq function on when it return an empty re...
work around outdated travis boxes
ENH: Added an output argument for numpy.outer
…. This closes ticket# 2126 (issue #586)
of nans in reference to ticket #586
Also added a flag "check_nan" that can be turned off to regain original performance. Also added nanmedian function to calculate the median while ignorning nans.
…dling of nans in median.
… checking is now pretty much as fast as the old one. Hence, removed the check_nan flag from previous commits.
Seems like I've completely messed up the rebase. I'm going to close this up and try again. |
Please review and add to main branch. I also included a unit test to help with the review process.
(also, I'm considering writing a nanmedian function which would ignore nans, similar to the nansum, nanmax, etc. series of functions.)