-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance of rolling_median()
#12609
Comments
Would welcome improvements here. The code is here: |
@etiennebacher can you take a look? |
Sorry, I'm barely a beginner in Rust, I just got news of these benchmarks because we implemented |
This comment was marked as outdated.
This comment was marked as outdated.
Taking a look. EDIT: didn't find a quick win. Someone/we can take a look at that paper. |
Worth to note that polars is by no means slow in this field. Pandas is exceptionally fast for rolling functions, relatively to other stuff in pandas, so being faster than pandas is already pretty good. For example unoptimized base R (on 1e5 input) is way! slower: Rdatatable/data.table#5692 (comment) |
I implemented fast rolling for r-polars and it matches quite closely the |
Thank you @jangorecki. I am certainly not happy about our current state. The first goal was to get the functionality in, but after some looking we see that especially for min, max and median there is still much left on the table. For the rolling sum/mean I think there is still something left w.r.t. fast paths for the single window increment case. We currently made it generic for large window shifts, but this is not most common. I shall post back here, when we are a bit further. |
@ritchie46 this article is quite related https://duckdb.org/2021/11/12/moving-holistic.html For median there are two approaches that scales well, min-max-heap and sort-median. |
@ritchie46 My simple prototype for fast rolling using vanilla rust-polars iter tools is as fast as |
@sorhawell I don't see a median implementation in your linked code? |
No there is none. I was only referring to rolling_sum mean ... etc. Polars median is very much slower the data.table impl and scales poorly. My point was, that besides rolling_median polars is also a bit slower (2x) for rolling_mean etc. A vanilla rolling_sum re-impl written with rust-polars is 2x faster then what rust_polars provides itself. |
The rolling sum is something we can trivially improve. I think we should just add a fast path in our sum code as we are now generic over different window jumps. Rolling median is underway. Rollin extramas will also be improved. |
@jangorecki nice!. I am almost finished. I do think that in the future we want a different algorithm for smaller windows sizes. |
FYI the benchmarks have been updated with the latest version of |
Description
An important contributor to
data.table
made a few benchmarks for rolling functions (only mean and median so far): https://github.com/jangorecki/rollbenchWhile the performance of rolling mean seems acceptable (still 2x slower than
data.table
though), the performance of rolling median degrades very quickly as the number of observations and/or window size increases. Note that r-polars is used (with bindings to rust-polars 0.35), not py-polars, but I got very similar timings with py-polars. Thedata.table
implementation is not merged yet but can be found in this PR: Rdatatable/data.table#5692The author links to the paper that implements the "sort-median" algorithm: https://arxiv.org/abs/1406.1717
I have no clue how the rolling median is implemented in polars, but I guess there's room for improvements here.
Here's what I have with
py-polars
0.19.15:The text was updated successfully, but these errors were encountered: