-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement reduce()
and fold()
#403
Conversation
this polars fold is more dynamic than the macro-like fold i proposed in #400 as this fold here folds over the Series at query time, whereas in #400 we folded over the "abstract" expressions. This fold is more related to map polars and R lambdas / map is pretty hairy stuff. I will try to write it as hint as how I think it should be done This seems to be the py-polars call stack for fold py-polars can benefit from pyo3 to interface lambdas, where we our own extendr_concurrent.rs . It looks like you knew that :) our lambda-interface have 4 "sections" 1 - the core polars-R interface module in extendr_concurrent.rs It seems you have used snippets from 2,3,4. You should likely only need something like in 4. Perhaps you may be able to derive $fold from $map() and just piggy-bag all the hairy stuff. this is how rpolars map looks like Line 738 in 33a56d9
The signature of binary_lambda is (Series,Series) -> Series, whereas it is (Series)->(Series) for map. I hope we can fairly cheap reuse the map implementation by using a struct to stitch acc Series together with x Series for input. |
Thanks for all the details!
This is actually what I tried to do so far. The problem is that, as you say, the Therefore, I thought the issue is that the |
I persued the wrap struct approach. I will try also to look into what your error is exactly. I'm on long family visit and only have a few hours with a computer per day. I have a panic in fold2 because the global threadcom is destroyed if folding over multiple exprs instead of just 1. otherwise it works, not sure why this error, it seems someone "hangs up the phone" a little too early. Will try to look more into it. |
…ore debug print places
Merge remote-tracking branch 'origin/main' into reduce-fold # Conflicts: # man/pl_pl.Rd
@etiennebacher The reason your approach gave errors are: both host side (collect, fetch...) lambda called serve_r and client side lambda(map, fold, reduce) must have compatible signatures Also there can only be one global storage CONFIG where thread can acquire a ThreadCom and send R jobs to main thread otherwise. |
I found a subtle bug in the concurrent handling, where if a user writes a lambda that uses e.g. collect inside the concurrent handler, then a new concurrent handler will start and drop the old one. The result is no new job can be submitted and the query will stop. If all R jobs already were in queue, the error would not manifest. With $fold(), new R job are submitted sequentially, and therefore the error showed. Now the collectors: collect(), fetch() & profile() will check if there already is a concurrent handler running. I think any recursive polars query inside embedded R code inside a polars query ... and so on can still submit R jobs to the single outer concurrent handler. So it should be fine now. |
@etiennebacher we need to decide if we wanna go with Enum-signatures over struct-wrappers. Enum has near zero overhead, where struct just quite small in any practical sense. Enum would allow more future flexibility. I don't know how clean the Enum-signtures can be implemented, or it would be mentally too abstract. I can give it a try. Otherwise we could leave for another time. |
pl$fold2(pl$lit(1:5),\(acc,x) acc + 2L*x, list(pl$lit(5:1), pl$lit(11:15)))$lit_to_s()
polars Series: shape: (5,)
Series: '' [i32]
[
33
34
35
36
37
] |
For info the current implementation doesn't work with the simple example from df = pl$DataFrame(mtcars)
# Make the rowwise sum of two columns and add 1 to it
df$with_columns(
pl$fold2(
acc = pl$lit(1), lambda = \(acc, x) acc + x, exprs = pl$col("mpg", "drat")
)
)
Maybe I'm missing some info, but from what you describe it sounds like
Is this bug important outside of this PR? If so, it should probably be fixed in a separate smaller PR because this one might take a while to be merged. |
Makes sense I will separate out bugfix and enum-signature in one or two PRs. |
Merge remote-tracking branch 'origin/main' into reduce-fold # Conflicts: # src/rust/Cargo.toml # src/rust/src/concurrent.rs # src/rust/src/rlib.rs
reduce()
and fold()
reduce()
and fold()
fold reduce seems to work well now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking over this PR, @sorhawell. This looks good to me but I can't find a nice way to adapt the rowwise-median example of #400. Do you want to update your answer there and close it at the same time?
Also, I can't approve this PR since I created it
reduce()
and fold()
reduce()
and fold()
I will assign the issue to me. The corresponding data.table example was not very efficient either because OP accessed a single row at the time likely without making use of data.table indexing. Probably the best solution is to chunk the frame lazily and then transpose, compute medians on 100-1000 rows at the time and then concat back the result. It is likely fairly cache friendly. Any computation which can be performed vectorized can be done with some fold. |
I started to look into how to implement those two functions but for now I run into a panic in a sub-thread:
and I can't find a way to get the panic message. Also, this error comes from
concurrent_handler()
in Rust but I don't understand when this function is actually called.@sorhawell this is mostly me dabbling with Rust so this might be completely off. Let me know if you want to rewrite that from scratch, otherwise I'd greatly appreciate some pointers on this if you have the time 😄
Related to #400