Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

Open
linlol opened this issue May 23, 2024 · 6 comments
Open

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

linlol opened this issue May 23, 2024 · 6 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@linlol
Copy link

linlol commented May 23, 2024

Description

Is this possible to serialise/deserialise the logic plan only?

Possible use case:

Suppose that there is

  1. a large LazyFrame on server side with great memory and compute resource (denoted as large_lf)

  2. a small LazyFrame (denoted as small_df whose schema is indifferent with large_df) on client side with limited resource

In this case, user could implement a few actions, and send request to server side to apply those actions on large_df

Had a quick look at document and discord looks like it is not yet supported, is it possible to support it in the future (or is a PR related would be welcomed?)

@linlol linlol added the enhancement New feature or an improvement of an existing feature label May 23, 2024
@linlol
Copy link
Author

linlol commented May 23, 2024

So far, two possible solution is

  1. Hack the input on JSON generated from serialize/deserialize method, it definitly works for small dataset, however, the overhead could be great when we try to dump a polars df more than 1M

  2. hack the parquet path in disk? this is what I tried so far. Both client side and server side, lazyframe is inited via scan_parquet method,

in this case, we can hack the json path Parquetscan.input to create lf which joins server data and clinet operation

(the side-effect is redundant round-trip of parquet serialisation/deserialisation, which is worthwhile)

@cmdlineluser
Copy link
Contributor

There is an interesting test that was recently added:

https://github.com/pola-rs/polars/blob/main/py-polars/tests/unit/lazyframe/cuda/test_node_visitor.py

It hooks into the plan node iteration and replaces DataFrameScan / Join nodes with custom callbacks.

@eitsupi
Copy link
Contributor

eitsupi commented May 24, 2024

Another thing I was thinking about was if the DataFrame was to be embedded in json, could it be in Arrow IPC format instead of embedding the values as they are?

@linlol
Copy link
Author

linlol commented May 28, 2024

hello @ritchie46

As you can see from the discussion in discord and above, probably it's worthwhile to introduce the serialisation/deserialisation of pure logic-plan (without data)

does it make sense to you, or alternatively, do you have any concern if pr with similar feature proposed

@cmdlineluser
Copy link
Contributor

#16624 (comment)

This is important as we want to be able to send the query to another machine.

@kszlim
Copy link
Contributor

kszlim commented Jun 17, 2024

This would be handy for anyone running polars in a loop, ie. you have a ring buffer that you create a dataframe from on each iteration, and then you create a lazyframe from an already optimized logical plan (i'm not sure if the expensive part of optimization is from logical plan or physical plan optimization). But this might mitigate the greatly increased cost of the resolution of schema (that has occurred in the last little while).

Ie. this lets you "emulate" what flink/risingwave/arroyo do by letting you kind of run polars on a streaming data source. Obviously it's not ideal, but might mitigate the cost a little if your actual computations aren't too expensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

4 participants