-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy logic-plan from one LazyFrame to another LazyFrame? #16430
Comments
So far, two possible solution is
in this case, we can hack the json path Parquetscan.input to create lf which joins server data and clinet operation (the side-effect is redundant round-trip of parquet serialisation/deserialisation, which is worthwhile) |
There is an interesting test that was recently added: https://github.com/pola-rs/polars/blob/main/py-polars/tests/unit/lazyframe/cuda/test_node_visitor.py It hooks into the plan node iteration and replaces |
Another thing I was thinking about was if the DataFrame was to be embedded in json, could it be in Arrow IPC format instead of embedding the values as they are? |
hello @ritchie46 As you can see from the discussion in discord and above, probably it's worthwhile to introduce the serialisation/deserialisation of pure logic-plan (without data) does it make sense to you, or alternatively, do you have any concern if pr with similar feature proposed |
|
This would be handy for anyone running polars in a loop, ie. you have a ring buffer that you create a dataframe from on each iteration, and then you create a lazyframe from an already optimized logical plan (i'm not sure if the expensive part of optimization is from logical plan or physical plan optimization). But this might mitigate the greatly increased cost of the resolution of Ie. this lets you "emulate" what flink/risingwave/arroyo do by letting you kind of run polars on a streaming data source. Obviously it's not ideal, but might mitigate the cost a little if your actual computations aren't too expensive. |
Two possible idea without hacking
|
Share a solution developed last Fri. Client Side:
Server Side
|
Been looking in to using this for a similar case where we want to save the logic plan and then reapply at a different time or on a different df with the same schema. Via your plan recorder did you handle the case where you want to do a column operation within the select function for example |
hey @GeorgeGibson01 Do you mean support some polars function inside LazyFrame.select? I cannot remember correctly but it shall be doable. I shall tried case like If.select(pl.count()) or something similar to check how many lines |
Shall be similar to the serialise of pl.Expr if I remember correctly, while pl.expr is json-serialisable |
simple example would be
|
haha my test case was simplier... let me have a try hello, @GeorgeGibson01 I just checked and it is doable on my side. We know that both string and polars expression are json-serialisable, hence the whole structure recording our function call and args are json serialisable In this case, we can create a json encoder, and perform below logic: If list/dict, encode element/item one by one If pl.Expr, call expr.meta.serialize() method Else, assume it is json serialisable Sorry that I am not allowed to share the real code I wrote, hope this general idea helps, feel free to let me know if you have any questions |
also, when you want to deserialise json to a data-class which has polars expr, you can setup object_hook in json.loads This is basically what I tried, feel free to let me know if you have any questions |
Description
Is this possible to serialise/deserialise the logic plan only?
Possible use case:
Suppose that there is
a large LazyFrame on server side with great memory and compute resource (denoted as large_lf)
a small LazyFrame (denoted as small_df whose schema is indifferent with large_df) on client side with limited resource
In this case, user could implement a few actions, and send request to server side to apply those actions on large_df
Had a quick look at document and discord looks like it is not yet supported, is it possible to support it in the future (or is a PR related would be welcomed?)
The text was updated successfully, but these errors were encountered: