EHN: consider multiple skiplist classes for different datatypes #12141

jennolsen84 · 2016-01-26T00:36:40Z

Right now, when we calculate something like a rolling max, if a float32 dataset is being passed in, it upcasts it to a float64. This increases the memory requirements of a big dataset considerably.

This is due to the fact that the skiplist implementation only works with doubles. Would you be opposed if I added instantiations for int32, int64, float32, uint64, uint32?

This way, we can avoid copies for big datasets (which take up a lot of memory, and time).

cc @kawochen

xref #3007

wesm · 2016-01-26T00:40:18Z

Why don't we adapt the algorithms from bottleneck (which are faster and do not require using skiplists)

kawochen · 2016-01-26T01:12:04Z

rolling_max and rolling_min are from bottleneck. I think skiplists are used in rolling_median and rolling_quantile (an alternative is max-median-min-heap)

jennolsen84 · 2016-01-26T05:36:31Z

Yeah, I was thinking of rolling_median. Bottleneck's implementation doesn't ignore NaNs (http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.move_median). I will see if I can contribute an implementation there that accepts NaNs.

This one and only commit comment doesn't look that encouraging though :) https://github.com/kwgoodman/bottleneck/commits/5a9ecb818b63ffcb8d6e787e51a236a109302c6a/bottleneck/src/auto_pyx/csrc/move_median.c

kawochen · 2016-01-26T16:51:50Z

wow bottleneck is quite a bit faster. Should work for rolling_quantile too.

jreback · 2016-01-26T17:10:36Z

@kawochen can you post any benchmarks? as an aside, still need to keep these routines as bottleneckis optional (though that is a separate issue)

jennolsen84 · 2016-01-26T21:07:27Z

When benchmarking, please remember to benchmark for different window sizes and different axis.

When testing median, I came across different implementations where the running times were very different depending on window_size due to the algorithms being used. E.g. if window sizes are small, maintaining multiple heaps hurts performance compared to running quickselect multiple times. Whereas, with big window sizes, it (big O) starts to matter a lot and multiple heap implementations start to win.

Also, with bigger matrices, and iterating across different axis, the cache effects could matter a bit too. So, when you benchmark, please be careful about these effects.

kawochen · 2016-01-28T15:26:30Z

Will work on that :)

jennolsen84 · 2016-02-08T12:20:41Z

rolling_median is an operation that can be sped up using multi-threading easily. The C library can easily export a re-entrant function that doesn't use the GIL and takes in an input and an output buffer.

What is the pandas policy regarding this? E.g. numexpr evals automatically use multiple cores (unless explicitly told to use only one core), but bottleneck uses 1 core by default. If it is OK to use multiple threads, should I just create a concurrent.futures.ThreadPoolExecutor, and submit the tasks there?
(if the input size is big enough, and the window size is small enough of course)

jennolsen84 · 2016-02-09T09:58:01Z

@jreback @wesm any feedback on the pandas policy? I would like to avoid doing un-necessary work, if possible. In this case, I am referring to writing a threaded implementation, and have the PR in pandas get rejected.

I have upgraded the move_median algorithm in bottleneck to accept nans, and min_period (but it is called min_count in bottleneck).

I benched it in different scenarios and am consistently getting ~ 3.5 - 5.5x performance of the pandas master [except for the trivial case of window_size=1, which bn optimizes]. So, I am guessing that the pandas' rolling median is using an algorithm with a O(n * lg window_size) running time already.

The double heap algorithm implementation (contributed by someone else) in bottleneck is pretty efficient. Also, bottleneck generates code for double, float, int32, and int64 data types at compile time.

In case you're interested in the move median implementation:

https://github.com/jennolsen84/bottleneck/tree/movemedian

I have to clean up the function names a little bit and add more documentation. I have tested the implementation thoroughly so it should be good. Test cases were a bit tough to generate -- could use advice on this. I ended up hammering the implementation with random datasets, and that helped shake out (hopefully all?) a lot of bugs.

jennolsen84 · 2016-02-09T10:05:45Z

here is the timing data (I stopped it after while, as it wasn't that interesting):

https://gist.github.com/jennolsen84/70cf058cd0e02ef299d3

wesm · 2016-02-09T21:34:27Z

@jennolsen84 is it possible to handle the multithreading using pthreads in C rather than via Python threads?

jreback · 2016-02-09T22:20:20Z

@jennolsen84 so in general to play nice, you can:

release the GIL whenever possible (though have to be careful as sometimes this can add overhead in some cases)
you shouldn't use python threads or python multiprocessing as this doesn't play nice with higher level than pandas who want to use python threads.

note that:numexpr recently (IIRC in 2.5, just released) now actually plays nice by holding a lock its own thread usage (before it was not threadsafe).

jennolsen84 · 2016-02-10T06:11:34Z

@jreback yep, I sent in the PR to numexpr for that lock.. :) pydata/numexpr#200 .

@wesm yes, can definitely use pthreads. How many threads should be used? I was thinking of peeking at numexpr settings, but numexpr is not required by pandas.

On a bigger picture. I think the best option might be to expose something that returns a list of functions, which can then be scheduled as the user pleases. This allows the user to use an existing threadpool (one that might be partially in use, so this can avoid oversaturation of the CPU, etc). At this moment, it is not possible to implement efficiently on top of pandas due to the function signatures in pandas. (The out parameter gets us part of the way there, we would also need to have a window for warmup).

Thoughts? I can move the discussion of the last paragraph to a separate issue, if you would like.

jennolsen84 · 2016-07-18T07:56:00Z

bottleneck 1.1 now has a good implementation of move median. It can handle nans, etc. There is talk about implementing parallel algorithms in bottleneck as well. Perhaps pandas can just use those.

mroeschke · 2023-04-04T18:29:37Z

Seems like the general feel is that we want to move towards using bottleneck (or maybe even numba) in the future so I guess extending the skiplist to handle more types is not really a priority so closing

jreback mentioned this issue Jan 26, 2016

BLD: make bottleneck a required dependency #12149

Closed

jreback added Enhancement Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Feb 17, 2016

mroeschke closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EHN: consider multiple skiplist classes for different datatypes #12141

EHN: consider multiple skiplist classes for different datatypes #12141

jennolsen84 commented Jan 26, 2016

wesm commented Jan 26, 2016

kawochen commented Jan 26, 2016

jennolsen84 commented Jan 26, 2016

kawochen commented Jan 26, 2016

jreback commented Jan 26, 2016

jennolsen84 commented Jan 26, 2016

kawochen commented Jan 28, 2016

jennolsen84 commented Feb 8, 2016

jennolsen84 commented Feb 9, 2016

jennolsen84 commented Feb 9, 2016

wesm commented Feb 9, 2016

jreback commented Feb 9, 2016

jennolsen84 commented Feb 10, 2016

jennolsen84 commented Jul 18, 2016

mroeschke commented Apr 4, 2023

EHN: consider multiple skiplist classes for different datatypes #12141

EHN: consider multiple skiplist classes for different datatypes #12141

Comments

jennolsen84 commented Jan 26, 2016

wesm commented Jan 26, 2016

kawochen commented Jan 26, 2016

jennolsen84 commented Jan 26, 2016

kawochen commented Jan 26, 2016

jreback commented Jan 26, 2016

jennolsen84 commented Jan 26, 2016

kawochen commented Jan 28, 2016

jennolsen84 commented Feb 8, 2016

jennolsen84 commented Feb 9, 2016

jennolsen84 commented Feb 9, 2016

wesm commented Feb 9, 2016

jreback commented Feb 9, 2016

jennolsen84 commented Feb 10, 2016

jennolsen84 commented Jul 18, 2016

mroeschke commented Apr 4, 2023