Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes issue #586. Updates handling of nan behaviour in numpy.lib.function_base.median. #4287

Closed
wants to merge 24 commits into from

Conversation

empeeu
Copy link
Contributor

@empeeu empeeu commented Feb 13, 2014

Please review and add to main branch. I also included a unit test to help with the review process.

(also, I'm considering writing a nanmedian function which would ignore nans, similar to the nansum, nanmax, etc. series of functions.)

#Check if the array contains any nan's
ids = None
if axis is None or a.ndim == 1:
if np.any(np.isnan(a)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not the best option performance wise, consider the following:

In [14]: a = np.random.rand(1000)

In [15]: np.median(a)
Out[15]: 0.51385702922222753

In [16]: np.partition(a, (499, 500))[499:501].sum() / 2
Out[16]: 0.51385702922222753

In [17]: %timeit np.partition(a, (499, 500))[499:501].sum() / 2
10000 loops, best of 3: 21.9 µs per loop

In [18]: %timeit np.any(np.isnan(a))
100000 loops, best of 3: 14 µs per loop

In [19]: %timeit np.partition(a, (499, 500, -1))[499:501].sum() / 2
10000 loops, best of 3: 24.4 µs per loop

Calling np.isnan takes about 66% of the time to call np.partition on this array, a significant slowdown. If, on the other hand, you add -1 to the list of indices to place in sorted position in the call to np.partition, as in line 19 above, the slowdown is much less noticeable, and because NAN is always placed at the end of an array when sorting, you can now check if there is a NAN in the array by checking if its last item after the call to partition is.

You may want to check for a wider variety of sizes of the input, to confirm if the advantage remains.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the performance penalty, it is probably also a good idea to only do something different if the dtype can hold a NAN, i.e. only for floating point types.

@charris
Copy link
Member

charris commented Feb 13, 2014

I wonder if it would be better to have an explicit nanmedian function?

@njsmith
Copy link
Member

njsmith commented Feb 13, 2014

@charris: nanmedian would also be good (see commit message), but IIRC this
is about median's output being flat out wrong currently in the presence of
NaNs.
On 13 Feb 2014 00:22, "Charles Harris" notifications@github.com wrote:

I wonder if it would be better to have an explicit nanmedian function?


Reply to this email directly or view it on GitHubhttps://github.com//pull/4287#issuecomment-34949529
.

@juliantaylor
Copy link
Contributor

I also don't like the performance impact of this, as @jaimefrio mentioned, nan is always the maximum of a sort so it is better to just add the maximum to the partition call and then check if the last element is a nan, as the partition is iterative the cost is not so high.
It can be reduced even more by specializing maximum, currently only minimum (small kth) is specialized.

this also allows implementing nanmedian trivially by simply shifting the index by the number of nans in the axis

note this diff probably conflicts with #3908

@empeeu
Copy link
Contributor Author

empeeu commented Feb 13, 2014

Ok. I will re-write while keeping performance in mind. I will check a wider variety of array sizes to see the performance impact. Adding -1 index to the partition is brilliant! I wouldn't have guessed that isnan is so slow.

@jaimefrio Since I'm new at this, is there there a simple way to check if the dtype if floating? Something that would see float, float32, etc?

@juliantaylor shifting the index is exactly what I had in mind for nanmedian. However, it gets a little tricky when you're dealing with multi-dimensional arrays, because each dimension might have a different number of nans in its axis. Adding a loop was my best idea so far, but I cannot imagine that being efficient. So, I'll take suggestions there. I'll also look at specializing maximum, though I'm not sure what that means.

@njsmith
Copy link
Member

njsmith commented Feb 13, 2014

The types that can have NaNs are the same as those that return true from
np.issubdtype(a.dtype, np.inexact). See
http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html

Does the partition trick work for complex values? I worry that complex
values with one nan element and one non-nan element might get sorted in
with the rest of the values...
On 13 Feb 2014 08:20, "Matt Ueckermann" notifications@github.com wrote:

Ok. I will re-write while keeping performance in mind. I will check a
wider variety of array sizes to see the performance impact. Adding -1 index
to the partition is brilliant! I wouldn't have guessed that isnan is so
slow.

@jaimefrio https://github.com/jaimefrio Since I'm new at this, is there
there a simple way to check if the dtype if floating? Something that would
see float, float32, etc?

@juliantaylor https://github.com/juliantaylor shifting the index is
exactly what I had in mind for nanmedian. However, it gets a little
tricky when you're dealing with multi-dimensional arrays, because each
dimension might have a different number of nans in its axis. Adding a loop
was my best idea so far, but I cannot imagine that being efficient. So,
I'll take suggestions there. I'll also look at specializing maximum, though
I'm not sure what that means.


Reply to this email directly or view it on GitHubhttps://github.com//pull/4287#issuecomment-34976845
.

@juliantaylor
Copy link
Contributor

the sort order for complex is:

[R + Rj, R + nanj, nan + Rj, nan + nanj]

so one needs to select the size - 3 kth element to catch all nans

With maximum specialization I meant the selection code in selection.c.src, see dumbselect.
You can skip that for this PR if you are unfamiliar with C, its an optimization we can add latter.

I don't think there is a way to avoid an explicit loop for nanmedian, though maybe you can do a pseudo binary search. (searchsorted won't work as the partitioned array is not fully sorted)

@empeeu
Copy link
Contributor Author

empeeu commented Feb 13, 2014

A procedural question (again, since I'm new here): Should I close this push request until I make the discussed changes? Or is there a way to update this request (avoid duplicates) once I have a newer version?

@juliantaylor
Copy link
Contributor

you can continue to push to the branch and this issue will update
if you amend or rebase you have to use git push -f

@jaimefrio
Copy link
Member

I don't think np.nanmedian can avoid a call to np.isnan. The indices in the call to np.partition refer to positions in the array, not to positions in the relative ordering of unique items. So calling np.partition with -1 will make sure that,the last item of the array will be in its correct sorted position, but if there are duplicates of that largest value, there are no guarantees to where those will end up:

>>> a = np.random.rand(100)
>>> a[np.random.randint(100, size=(10,))] = np.nan
>>> np.partition(a, -1)
array([ 0.15734533,  0.37424167,  0.2857852 ,  0.43048991,  0.02425887,
        ...
        0.90945659,  0.71758622,  0.99375881,         nan,  0.99419513,
               nan,         nan,         nan,  0.9975807 ,         nan,
               nan,         nan,         nan,         nan,         nan])

(You may have to repeat the above test several times, as it often does group all NANs at the end, although not always)

Because we cannot group all NANs at the end, it is irrelevant, but I think that a binary search for a needle, on a haystack that has been partitioned around that needle, would return correct results. An application that would require searchsorted taking a multidimensional haystack, which it also doesn't...

I am not sure if it would have much use outside of counting NANs, but you could run the quicksort partition algorithm with pivots supplied by the user, on an unsorted array, and return a partially sorted array, and the positions where those pivots would have to be inserted in the sorted array to preserve order, kind of a searchunsorted function. It may allow speeding up some of the set operations.

@juliantaylor
Copy link
Contributor

right, I was thinking of partial sort semantics which would be much slower than running isnan first, how dumb of me...

Also added a flag "check_nan" that can be turned off to regain original
performance.

Also added nanmedian function to calculate the median while ignorning nans.
@empeeu
Copy link
Contributor Author

empeeu commented Feb 17, 2014

Couple of updates. I tried a number of approaches to speed up the nan-checking in the median function. I use a little benchmark code:

import timeit
import numpy as np
#%%
setup_even = lambda N: """
import numpy as np
import itertools
a = np.random.rand(%d)
a[a.size // 2] = np.nan
""" % N
setup_nonan = lambda N: """
import numpy as np
import itertools
a = np.random.rand(%d)
""" % N
stmt_min = """#min 
ids = np.isnan(np.min(a))
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition((sz - 1) // 2)
"""
stmt_max = """#max 
ids = np.isnan(np.max(a))
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition((sz - 1) // 2)
"""
stmt_sum = """#sum 
ids = np.isnan(np.sum(a))
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition((sz - 1) // 2)
"""
stmt_dot = """#dot 
ids = np.isnan(np.dot(a, a))
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition((sz - 1) // 2)
"""
stmt_any = """#any
ids = np.any(np.isnan(a))
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition((sz - 1) // 2)
"""
stmt_iter = """#argwhere
ids = np.where(np.isnan(a))
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition((sz - 1) // 2)
"""
stmt_part = """#part 
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh, -1))
else:
    a.partition(((sz - 1) // 2, -1))
"""
stmt_base = """#base
sz = a.size
if sz % 2 == 0:
    szh = sz // 2
    a.partition((szh - 1, szh))
else:
    a.partition(((sz - 1) // 2))
"""
#%%
NN = np.logspace(1, 8)
NN = NN - NN % 2 #Make all of it even
even=[]
#odd=[]
nonan = []
stmts = [stmt_part, stmt_dot, stmt_sum, stmt_max, stmt_min, stmt_any, stmt_iter, stmt_base]
for N in NN:
    even.append([])
    #odd.append([])
    nonan.append([])
    setup_e = setup_even(N)
    setup_o = setup_nonan(N)
    number = int(10 ** (4 - np.log10(N) // 2))
    for stmt in stmts:
        even[-1].append(
            np.min(timeit.repeat(
                stmt=stmt, setup=setup_e, number=number, repeat=5)
            ) / number
        )
        nonan[-1].append(
            np.min(timeit.repeat(
                stmt=stmt, setup=setup_o, number=number, repeat=5)
            ) / number
        )
        print "N", N, stmt[:4], 'even', even[-1][-1], 'nonan', nonan[-1][-1]
#%%
even_arr = np.array(even)
nonan_arr = np.array(nonan)
import matplotlib.pyplot as plt
#%%
plt.figure(1)
plt.subplot(211)
plt.loglog(NN, (even_arr.T/ NN).T * 1e6)
plt.loglog(NN, (even_arr[:, -1]/ NN).T * 1e6, '--', linewidth=2)
plt.legend(['part', 'dot', 'sum', 'max', 'min', 'any', 'argwhere', 'base'], loc=0)
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Execution time per element $\mu s$')
plt.title('With nans')
plt.subplot(212)
plt.loglog(NN, (nonan_arr.T/ NN).T * 1e6)
plt.loglog(NN, (nonan_arr[:, -1]/ NN).T * 1e6, '--', linewidth=2)
plt.legend(['part', 'dot', 'sum', 'max', 'min', 'any', 'argwhere', 'base'], loc=0)
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Execution time per element $\mu s$')
plt.title('Without nans')
plt.savefig("Benchmarks.png")
plt.savefig("Benchmarks.svg")
#%%
plt.figure(2)
plt.semilogx(NN, (even_arr[:, [0, 3, -1]].T / (even_arr[:, -1].T)).T)
plt.legend(['part', 'max', 'base'], loc=0)
plt.ylim([0, 4])
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Fractional penalty $(t / t_b)$')
plt.savefig("Benchmarks2.png")
plt.savefig("Benchmarks2.svg")

Results are here:
benchmarks
With the normalized close-up here:
benchmarks2

So, while there was a cross-over array size where the np.isnan(np.max(a)) was a little bit faster, I didn't implement a cross over because I figured it would be different on different machines, and it makes the implementation uglier. So, because of the basically unavoidable performance hit, I added a flag that you could turn off to no longer check for nans. At the same time, I'm only checking for nans for datatypes that support them (thanks @njsmith)

Okay, then I added a nanmedian function. Someone pointed out that scipy already had this, so I took some hints from their implementation, but mine seems to be considerably faster. To compare the performance of the original median, my new median, nanmedian, and scipy's nanmedian, I wrote this little script, where I put my development version in a 'numpy2' folder:

import timeit
import numpy as np
import os
os.chdir('Projects/numpy')
#%%
setup_even = lambda N: """
import numpy as np
import numpy2 as np2
from scipy.stats import nanmedian
a = np.random.rand(%d)
a[a.size // 2] = np.nan
""" % N
setup_nonan = lambda N: """
import numpy as np
import numpy2 as np2
from scipy.stats import nanmedian
a = np.random.rand(%d)
""" % N
stmt_nochk = """#no check
np2.median(a, check_nan=False)
"""
stmt_dochk = """#do check 
np2.median(a)
"""
stmt_nanmed = """#nan median
np2.nanmedian(a)
"""
stmt_spnanmed = """#sp nan median
nanmedian(a)
"""
stmt_base = """#base
np.median(a)
"""
#%%
NN = np.logspace(1, 8, 25)
NN = NN - NN % 2 #Make all of it even
even=[]
nan = []
stmts = [stmt_nochk, stmt_dochk, stmt_nanmed, stmt_spnanmed, stmt_base]
for N in NN:
    even.append([])    
    nan.append([]) 
    setup_e = setup_even(N)    
    setup_o = setup_nonan(N)    
    number = int(10 ** (4 - np.log10(N) // 2))
    for stmt in stmts:
        even[-1].append(
            np.min(timeit.repeat(
                stmt=stmt, setup=setup_e, number=number, repeat=5)
            ) / number
        )
        nan[-1].append(
            np.min(timeit.repeat(
                stmt=stmt, setup=setup_o, number=number, repeat=5)
            ) / number
        )
        print "N", N, stmt[:4], 'even', even[-1][-1], 'nan', nan[-1][-1]
#%%
even_arr = np.array(even)
nan_arr = np.array(nan)
import matplotlib.pyplot as plt
#%%
plt.figure(1)
plt.subplot(211)
plt.loglog(NN, (even_arr.T/ NN).T * 1e6)
plt.loglog(NN, (even_arr[:, -1]/ NN).T * 1e6, '--', linewidth=2)
plt.legend(['no check', 'check nan', 'nan median', 'sp nan median', 'base'], loc=0)
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Execution time per element $\mu s$')
plt.title('With nans')
plt.subplot(212)
plt.loglog(NN, (nan_arr.T/ NN).T * 1e6)
plt.loglog(NN, (nan_arr[:, -1]/ NN).T * 1e6, '--', linewidth=2)
plt.legend(['no check', 'check nan', 'nan median', 'sp nan median', 'base'], loc=0)
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Execution time per element $\mu s$')
plt.title('Without nans')
plt.savefig("Benchmarksbb.png")
plt.savefig("Benchmarksbb.svg")
#%%
plt.figure(2)
plt.semilogx(NN, (even_arr.T / (even_arr[:, -1].T)).T)
plt.legend(['no check', 'check nan', 'nan median', 'sp nan median', 'base'], loc=0)
plt.ylim([0, 4])
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Fractional penalty $(t / t_b)$')
plt.title('with nan')
plt.savefig("Benchmarksbb2a.png")
plt.savefig("Benchmarksbb2a.svg")
#%%
plt.figure(3)
plt.semilogx(NN, (nan_arr.T / (nan_arr[:, -1].T)).T)
plt.legend(['no check', 'check nan', 'nan median', 'sp nan median', 'base'], loc=0)
plt.ylim([0, 4])
plt.xlabel('Size of "a" (number of elements)')
plt.ylabel('Fractional penalty $(t / t_b)$')
plt.title('without nan')
plt.savefig("Benchmarksbb2b.png")
plt.savefig("Benchmarksbb2b.svg")

Results are here:
benchmarksbb2a
benchmarksbb2b

So, with the flag set to False I seem to recover the performance of the original median function. What do you guys think.

(Oh, and I still need to update the comments on nanmedian)

@argriffing
Copy link
Contributor

This competes with #4307?

@empeeu
Copy link
Contributor Author

empeeu commented Feb 17, 2014

It looks like we both implemented a version of nanmedian, but I also dealt with the handling of NaNs in the plain-old median function. @dfreese, do you have any suggestions?

@juliantaylor
Copy link
Contributor

it would be best if the nanmedian can be put into a separate PR, it is independent of checking for nan in regular median and my tiny brain is getting confused :(

also this should be put on hold until #3908 is resolved, any change to median will conflict with it.
I imagine it could get tricky to get nan support into percentile which is already very convoluted.
This may be an argument to not implement median in terms of percentile as it just makes modification of the more important function harder.

you aren't seeding the rng in your benchmarks or? partition performance depends on the layout of the data array, for better benchmarks you should use the same random seed.
also if we add a special max case to the partition code the nan check should be almost free so the benchmarks will be invalidated by it.
I'll try to add the specialization to master soon. Then meaningful comparisons can be made.

@dfreese
Copy link

dfreese commented Feb 17, 2014

@empeeu, your implementation using apply_along_axis which was suggested for mine, so your version should probably be used going forward. I have do have unit tests for nanmedian that could prove useful.

@juliantaylor
Copy link
Contributor

here the nan specialization I was talking about:

--- a/numpy/core/src/npysort/selection.c.src
+++ b/numpy/core/src/npysort/selection.c.src
@@ -322,6 +322,19 @@ int
         store_pivot(kth, kth, pivots, npiv);
         return 0;
     }
+    else if (kth == num - 1) {
+        npy_intp k;
+        npy_intp maxidx = low;
+        @type@ maxval = v[IDX(low)];
+        for (k = low + 1; k < num; k++) {
+            if (!@TYPE@_LT(v[IDX(k)], maxval)) {
+                maxidx = k;
+                maxval = v[IDX(k)];
+            }
+        }
+        SWAP(SORTEE(kth), SORTEE(maxidx));
+        return 0;
+    }

     /* dumb integer msb, float npy_log2 too slow for small parititions */
     {

that gets you the -1 partition almost for free, before:

In [2]: np.random.seed(56); d = np.arange(200000); np.random.shuffle(d)
In [5]: %timeit np.partition(d, (d.size // 2))
1000 loops, best of 3: 1.91 ms per loop
In [6]: %timeit np.partition(d, (d.size // 2, -1))
100 loops, best of 3: 2.41 ms per loop

after

In [2]: np.random.seed(56); d = np.arange(200000); np.random.shuffle(d)
In [5]: %timeit np.partition(d, (d.size // 2))
1000 loops, best of 3: 1.94 ms per loop
In [6]: %timeit np.partition(d, (d.size // 2, -1))
100 loops, best of 3: 2.03 ms per loop

@empeeu
Copy link
Contributor Author

empeeu commented Feb 18, 2014

@dfreese : I agree with juliantaylor that nanmedian should go in a separate PR, so I will remove it from this one. Since you have a PR open for nanmedian, why don't you take what you find useful from my nanmedian and combine it with yours? I'll just focus on median, nanmedian is all yours.

@juliantaylor Nice specialization. I ventured into the c code a little bit the other day, but got lost, so I appreciate seeing where the partition code it (I was thinking of writing a count_nan function, so I went looking for count_nonzero). I'll benchmark to see what happens.

@dfreese
Copy link

dfreese commented Feb 18, 2014

@empeeu: Sure, that sounds good.

for ARGOUTVIEWM?_FARRAY[34] typemaps, the included require_fortran check was inverted -- failing if the resulting array had fortran ordering, not if it did not.  This modification inverts these checks.
@juliantaylor
Copy link
Contributor

the partition special case for maximum element has been merged into master

@charris
Copy link
Member

charris commented Feb 22, 2014

@juliantaylor How about we put this in and then revisit #3908 to see if we can thrash out the issues.

@juliantaylor
Copy link
Contributor

I would prefer sorting out 3908 first, also this PR is not ready to be merged yet

… checking is now pretty much as fast as the old one. Hence, removed the check_nan flag from previous commits.
@empeeu
Copy link
Contributor Author

empeeu commented Feb 26, 2014

@juliantaylor You really know your stuff. Benchmark for the latest commit:
final

I was a little worried by how close they were, but I checked it, and it seems to be correct.

Waiting on 3908 now. Let me know if there's anything else to do here.

juliantaylor and others added 2 commits March 5, 2014 23:18
breaking the api breaks matplotlib build and pip installation.
Introduce npy_PyFile_Dup2 and npy_PyFile_DupClose2 as replacements
numpy.i bugfix: fortran ordering check
@charris
Copy link
Member

charris commented Mar 5, 2014

@empeeu Needs a rebase.

charris and others added 15 commits March 5, 2014 16:40
BUG: restore api for file npy_PyFile_Dup and npy_PyFile_DupClose
Caused SciPy tests to fail when built with this NumPy.
BUG: Revert PR #4421: causes SciPy tests to fail
run manual apt-get update to pick up the latest py3 security update
DOC: Fixed documentation on lstsq function on when it return an empty re...
work around outdated travis boxes
ENH: Added an output argument for numpy.outer
Also added a flag "check_nan" that can be turned off to regain original
performance.

Also added nanmedian function to calculate the median while ignorning nans.
… checking is now pretty much as fast as the old one. Hence, removed the check_nan flag from previous commits.
@empeeu
Copy link
Contributor Author

empeeu commented Mar 8, 2014

Seems like I've completely messed up the rebase. I'm going to close this up and try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants