Add bn.move_nanmedian() #42

Closed
kwgoodman opened this Issue Mar 16, 2012 · 6 comments

Comments

Projects
None yet
3 participants
Owner

kwgoodman commented Mar 16, 2012

No description provided.

I would like to give my vote to this issue... The only thing missing from Bottleneck!

Owner

kwgoodman commented May 20, 2013

bn.move_median() is the one part of the code base I am not familiar with. So if anyone is interested in contributing a bn.move_nanmedian...

bn.move_median() code is here:

https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/src/move/csrc/move_median.c

And here is some discussion:

https://groups.google.com/group/bottle-neck/browse_thread/thread/3cdcbeb3c0e4bccc/e18275488d1b7f85?lnk=gst&q=move_nanmedian#

Contributor

jennolsen84 commented Feb 11, 2016

Hi, I did some work on this. The implementation works, and is available at https://github.com/jennolsen84/bottleneck/tree/movemedian . As a bonus, it supports min_count, which makes it consistent with rest of bottleneck.

Next, I am going to re-name some functions, and add some documentation. I have tested it a lot, and the library seems to work fine.

Currently, the csrc for move_median will hardcode the data type to float64. This causes un-necessary conversions when operating on float32 code. See:

typedef npy_intp _size_t;
typedef npy_float64 value_t;
. In order to avoid this, I am thinking about moving the code to C++ and using templates. Luckily, cython has support for wrapping C++ templates.

Would a C++ implementation be OK for a PR?

Any other concerns before I submit a PR? I can submit a PR very soon.

I intend to write a high level documentation on the algorithm too, since I had to introduce a double linked list to keep track of all the nans in each heap. So now the datastructure has 2 heaps, a FIFO linked list for window, and 2 double linked lists for nans! Luckily, everything is still O(n * lg w).

Owner

kwgoodman commented Feb 12, 2016

Wow! This is a big improvement for bottleneck!

I expect that keeping track of nans will make move_median slower. But I don't have a feel for how much slower.

Master:

In [16]: a=np.random.rand(1000001)
In [17]: timeit bn.move_median(a,1)
100 loops, best of 3: 2.58 ms per loop
In [18]: timeit bn.move_median(a,2)
10 loops, best of 3: 18.8 ms per loop
In [19]: timeit bn.move_median(a,3)
10 loops, best of 3: 22.4 ms per loop
In [20]: timeit bn.move_median(a,4)
10 loops, best of 3: 29.3 ms per loop
In [21]: timeit bn.move_median(a,8)
10 loops, best of 3: 36.4 ms per loop
In [22]: timeit bn.move_median(a,16)
10 loops, best of 3: 43 ms per loop
In [23]: timeit bn.move_median(a,32)
10 loops, best of 3: 56.8 ms per loop

Your branch:

In [1]: a=np.random.rand(1000001)
In [2]: timeit bn.move_median(a,1)
100 loops, best of 3: 2.59 ms per loop
In [4]: timeit bn.move_median(a,2)
10 loops, best of 3: 61.3 ms per loop
In [5]: timeit bn.move_median(a,3)
10 loops, best of 3: 69.7 ms per loop
In [6]: timeit bn.move_median(a,4)
10 loops, best of 3: 76.5 ms per loop
In [7]: timeit bn.move_median(a,8)
10 loops, best of 3: 81.8 ms per loop
In [8]: timeit bn.move_median(a,16)
10 loops, best of 3: 89.8 ms per loop
In [9]: timeit bn.move_median(a,32)
10 loops, best of 3: 105 ms per loop

Perhaps you can compare timings with pandas to see if we are in the right ballpark. I tried getting back performance by changing move.pyx to use mm_update_movemedian_nonan (which we can do automatically for int input arrays):

 -            mm_update_movemedian_possiblenan(mm, ai)
+            mm_update_movemedian_nonan(mm, ai)

That gives:

In [1]: a=np.random.rand(1000001)
In [2]: timeit bn.move_median(a,1)
100 loops, best of 3: 2.63 ms per loop
In [3]: timeit bn.move_median(a,2)
10 loops, best of 3: 40.9 ms per loop
In [4]: timeit bn.move_median(a,3)
10 loops, best of 3: 44.8 ms per loop
In [5]: timeit bn.move_median(a,4)
10 loops, best of 3: 55.7 ms per loop
In [6]: timeit bn.move_median(a,8)
10 loops, best of 3: 62.4 ms per loop
In [7]: timeit bn.move_median(a,16)
10 loops, best of 3: 69.2 ms per loop
In [8]: timeit bn.move_median(a,32)
10 loops, best of 3: 83.4 ms per loop

One option is to preserve the current code for the int input array case. It would also give us the option to make a quick variable name change, recompile, and ignore nans again. But I'm not sure the added complication is worth it.

My take:

  1. Make sure performance is about what we expect by comparing to pandas
  2. Make PR with C
  3. Save benchmarking C++ for float32/64 for later
  4. If 3 looks good then possible C++ PR for later

How does that sound to you?

Contributor

jennolsen84 commented Feb 12, 2016

Sounds good

  1. I checked the pandas performance. Bottleneck will be about 3.5 to 5 times faster in every case, except the trivial case in bn of window_size=1 ;), where a copy of the array is returned.
  2. OK, I will do that in the next 24 hours.
  3. With C++, we could close the gap down further too. We could have 2 different node classes, one that has a double linked lists, and one that doesn't. And then partially specialize the functions using templates to get rid of the un-necessary logic as much as we can.
  4. I also intend to add the ability to use pthreads in the C/C++ code. I will use this code with pandas. I am unsure if we want to expose that in bottleneck (as you might want to expose this more generally in bn so that all functions can be parallelized in similar manner).

Btw, I also have array functions that do the loops in C, those assumed C contiguous arrays (with stride=1) for now. Just FYI, I need to expand that before we start to use it.

Owner

kwgoodman commented Mar 31, 2016

Closed by 3b2c722

@kwgoodman kwgoodman closed this Mar 31, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment