Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EHN: consider multiple skiplist classes for different datatypes #12141

Closed
jennolsen84 opened this issue Jan 26, 2016 · 15 comments
Closed

EHN: consider multiple skiplist classes for different datatypes #12141

jennolsen84 opened this issue Jan 26, 2016 · 15 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance

Comments

@jennolsen84
Copy link
Contributor

Right now, when we calculate something like a rolling max, if a float32 dataset is being passed in, it upcasts it to a float64. This increases the memory requirements of a big dataset considerably.

This is due to the fact that the skiplist implementation only works with doubles. Would you be opposed if I added instantiations for int32, int64, float32, uint64, uint32?

This way, we can avoid copies for big datasets (which take up a lot of memory, and time).

cc @kawochen

xref #3007

@wesm
Copy link
Member

wesm commented Jan 26, 2016

Why don't we adapt the algorithms from bottleneck (which are faster and do not require using skiplists)

@kawochen
Copy link
Contributor

rolling_max and rolling_min are from bottleneck. I think skiplists are used in rolling_median and rolling_quantile (an alternative is max-median-min-heap)

@jennolsen84
Copy link
Contributor Author

Yeah, I was thinking of rolling_median. Bottleneck's implementation doesn't ignore NaNs (http://berkeleyanalytics.com/bottleneck/reference.html#bottleneck.move_median). I will see if I can contribute an implementation there that accepts NaNs.

This one and only commit comment doesn't look that encouraging though :) https://github.com/kwgoodman/bottleneck/commits/5a9ecb818b63ffcb8d6e787e51a236a109302c6a/bottleneck/src/auto_pyx/csrc/move_median.c

@kawochen
Copy link
Contributor

wow bottleneck is quite a bit faster. Should work for rolling_quantile too.

@jreback
Copy link
Contributor

jreback commented Jan 26, 2016

@kawochen can you post any benchmarks? as an aside, still need to keep these routines as bottleneckis optional (though that is a separate issue)

@jennolsen84
Copy link
Contributor Author

When benchmarking, please remember to benchmark for different window sizes and different axis.

When testing median, I came across different implementations where the running times were very different depending on window_size due to the algorithms being used. E.g. if window sizes are small, maintaining multiple heaps hurts performance compared to running quickselect multiple times. Whereas, with big window sizes, it (big O) starts to matter a lot and multiple heap implementations start to win.

Also, with bigger matrices, and iterating across different axis, the cache effects could matter a bit too. So, when you benchmark, please be careful about these effects.

@kawochen
Copy link
Contributor

Will work on that :)

@jennolsen84
Copy link
Contributor Author

rolling_median is an operation that can be sped up using multi-threading easily. The C library can easily export a re-entrant function that doesn't use the GIL and takes in an input and an output buffer.

What is the pandas policy regarding this? E.g. numexpr evals automatically use multiple cores (unless explicitly told to use only one core), but bottleneck uses 1 core by default. If it is OK to use multiple threads, should I just create a concurrent.futures.ThreadPoolExecutor, and submit the tasks there?
(if the input size is big enough, and the window size is small enough of course)

@jennolsen84
Copy link
Contributor Author

@jreback @wesm any feedback on the pandas policy? I would like to avoid doing un-necessary work, if possible. In this case, I am referring to writing a threaded implementation, and have the PR in pandas get rejected.

I have upgraded the move_median algorithm in bottleneck to accept nans, and min_period (but it is called min_count in bottleneck).

I benched it in different scenarios and am consistently getting ~ 3.5 - 5.5x performance of the pandas master [except for the trivial case of window_size=1, which bn optimizes]. So, I am guessing that the pandas' rolling median is using an algorithm with a O(n * lg window_size) running time already.

The double heap algorithm implementation (contributed by someone else) in bottleneck is pretty efficient. Also, bottleneck generates code for double, float, int32, and int64 data types at compile time.

In case you're interested in the move median implementation:

https://github.com/jennolsen84/bottleneck/tree/movemedian

I have to clean up the function names a little bit and add more documentation. I have tested the implementation thoroughly so it should be good. Test cases were a bit tough to generate -- could use advice on this. I ended up hammering the implementation with random datasets, and that helped shake out (hopefully all?) a lot of bugs.

@jennolsen84
Copy link
Contributor Author

here is the timing data (I stopped it after while, as it wasn't that interesting):

https://gist.github.com/jennolsen84/70cf058cd0e02ef299d3

@wesm
Copy link
Member

wesm commented Feb 9, 2016

@jennolsen84 is it possible to handle the multithreading using pthreads in C rather than via Python threads?

@jreback
Copy link
Contributor

jreback commented Feb 9, 2016

@jennolsen84 so in general to play nice, you can:

  • release the GIL whenever possible (though have to be careful as sometimes this can add overhead in some cases)
  • you shouldn't use python threads or python multiprocessing as this doesn't play nice with higher level than pandas who want to use python threads.

note that:numexpr recently (IIRC in 2.5, just released) now actually plays nice by holding a lock its own thread usage (before it was not threadsafe).

@jennolsen84
Copy link
Contributor Author

@jreback yep, I sent in the PR to numexpr for that lock.. :) pydata/numexpr#200 .

@wesm yes, can definitely use pthreads. How many threads should be used? I was thinking of peeking at numexpr settings, but numexpr is not required by pandas.

On a bigger picture. I think the best option might be to expose something that returns a list of functions, which can then be scheduled as the user pleases. This allows the user to use an existing threadpool (one that might be partially in use, so this can avoid oversaturation of the CPU, etc). At this moment, it is not possible to implement efficiently on top of pandas due to the function signatures in pandas. (The out parameter gets us part of the way there, we would also need to have a window for warmup).

Thoughts? I can move the discussion of the last paragraph to a separate issue, if you would like.

@jreback jreback added Enhancement Performance Memory or execution speed performance Dtype Conversions Unexpected or buggy dtype conversions Internals Related to non-user accessible pandas implementation labels Feb 17, 2016
@jennolsen84
Copy link
Contributor Author

bottleneck 1.1 now has a good implementation of move median. It can handle nans, etc. There is talk about implementing parallel algorithms in bottleneck as well. Perhaps pandas can just use those.

@mroeschke
Copy link
Member

Seems like the general feel is that we want to move towards using bottleneck (or maybe even numba) in the future so I guess extending the skiplist to handle more types is not really a priority so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement Internals Related to non-user accessible pandas implementation Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

5 participants