Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

Open
linlol opened this issue May 23, 2024 · 14 comments
Open

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

linlol opened this issue May 23, 2024 · 14 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@linlol
Copy link

linlol commented May 23, 2024

Description

Is this possible to serialise/deserialise the logic plan only?

Possible use case:

Suppose that there is

  1. a large LazyFrame on server side with great memory and compute resource (denoted as large_lf)

  2. a small LazyFrame (denoted as small_df whose schema is indifferent with large_df) on client side with limited resource

In this case, user could implement a few actions, and send request to server side to apply those actions on large_df

Had a quick look at document and discord looks like it is not yet supported, is it possible to support it in the future (or is a PR related would be welcomed?)

@linlol linlol added the enhancement New feature or an improvement of an existing feature label May 23, 2024
@linlol
Copy link
Author

linlol commented May 23, 2024

So far, two possible solution is

  1. Hack the input on JSON generated from serialize/deserialize method, it definitly works for small dataset, however, the overhead could be great when we try to dump a polars df more than 1M

  2. hack the parquet path in disk? this is what I tried so far. Both client side and server side, lazyframe is inited via scan_parquet method,

in this case, we can hack the json path Parquetscan.input to create lf which joins server data and clinet operation

(the side-effect is redundant round-trip of parquet serialisation/deserialisation, which is worthwhile)

@cmdlineluser
Copy link
Contributor

There is an interesting test that was recently added:

https://github.com/pola-rs/polars/blob/main/py-polars/tests/unit/lazyframe/cuda/test_node_visitor.py

It hooks into the plan node iteration and replaces DataFrameScan / Join nodes with custom callbacks.

@eitsupi
Copy link
Contributor

eitsupi commented May 24, 2024

Another thing I was thinking about was if the DataFrame was to be embedded in json, could it be in Arrow IPC format instead of embedding the values as they are?

@linlol
Copy link
Author

linlol commented May 28, 2024

hello @ritchie46

As you can see from the discussion in discord and above, probably it's worthwhile to introduce the serialisation/deserialisation of pure logic-plan (without data)

does it make sense to you, or alternatively, do you have any concern if pr with similar feature proposed

@cmdlineluser
Copy link
Contributor

#16624 (comment)

This is important as we want to be able to send the query to another machine.

@kszlim
Copy link
Contributor

kszlim commented Jun 17, 2024

This would be handy for anyone running polars in a loop, ie. you have a ring buffer that you create a dataframe from on each iteration, and then you create a lazyframe from an already optimized logical plan (i'm not sure if the expensive part of optimization is from logical plan or physical plan optimization). But this might mitigate the greatly increased cost of the resolution of schema (that has occurred in the last little while).

Ie. this lets you "emulate" what flink/risingwave/arroyo do by letting you kind of run polars on a streaming data source. Obviously it's not ideal, but might mitigate the cost a little if your actual computations aren't too expensive.

@linlol
Copy link
Author

linlol commented Jun 22, 2024

Two possible idea without hacking

  1. pure Python watcher as a lib, e.g. a decorator/subclass of LazyFrame/a class with LazyFrame composited that would record every operation?

  2. A new LazyFrame API other than current serialize/deserialize, which displays/read DslPlan but have the Scan part ignored?

@linlol
Copy link
Author

linlol commented Jul 2, 2024

Share a solution developed last Fri.

Client Side:

  1. In our case, we have a class called LazyFrameClient, with a lazyframe _lf whose schema coincide with server side plan

  2. When lazy funciton like filter/agg/groupby e.t.c called, it mimic the operation on self._lf and record plan with below property abstracted as a pydantic.BaseModel

  1. function name
  2. function positional args (in terms of List[Any])
  3. function optional args (in terms of Dict[str Any])
  1. develop customised JSONEncoder/hook to serialise/deserialise the logic plan
  2. send request once client.collect() called, it serailsie the recorded plan and send to server

Server Side

  1. once get request, deserialise request JSON to function name/args/kwargs
  2. call function via reflection
  3. serialise the result to parquet, send back to client

@GeorgeGibson01
Copy link

Share a solution developed last Fri.

Client Side:

  1. In our case, we have a class called LazyFrameClient, with a lazyframe _lf whose schema coincide with server side plan

  2. When lazy funciton like filter/agg/groupby e.t.c called, it mimic the operation on self._lf and record plan with below property abstracted as a pydantic.BaseModel

  3. function name

  4. function positional args (in terms of List[Any])

  5. function optional args (in terms of Dict[str Any])

  6. develop customised JSONEncoder/hook to serialise/deserialise the logic plan

  7. send request once client.collect() called, it serailsie the recorded plan and send to server

Server Side

  1. once get request, deserialise request JSON to function name/args/kwargs
  2. call function via reflection
  3. serialise the result to parquet, send back to client

Been looking in to using this for a similar case where we want to save the logic plan and then reapply at a different time or on a different df with the same schema.

Via your plan recorder did you handle the case where you want to do a column operation within the select function for example

@linlol
Copy link
Author

linlol commented Aug 13, 2024

hey @GeorgeGibson01

Do you mean support some polars function inside LazyFrame.select?

I cannot remember correctly but it shall be doable. I shall tried case like

If.select(pl.count()) or something similar to check how many lines

@linlol
Copy link
Author

linlol commented Aug 13, 2024

Shall be similar to the serialise of pl.Expr if I remember correctly, while pl.expr is json-serialisable

@GeorgeGibson01
Copy link

simple example would be

lf= pl.LazyFrame(data)
lf.select([“x”, “y”, pl.col(“x”).alias(“x+y”) + pl.col(“y”)])

@linlol
Copy link
Author

linlol commented Aug 13, 2024

simple example would be

lf= pl.LazyFrame(data)
lf.select([“x”, “y”, pl.col(“x”).alias(“x+y”) + pl.col(“y”)])

haha my test case was simplier... let me have a try

hello, @GeorgeGibson01 I just checked and it is doable on my side.

We know that both string and polars expression are json-serialisable, hence the whole structure recording our function call and args are json serialisable

In this case, we can create a json encoder, and perform below logic:

If list/dict, encode element/item one by one

If pl.Expr, call expr.meta.serialize() method

Else, assume it is json serialisable

Sorry that I am not allowed to share the real code I wrote, hope this general idea helps, feel free to let me know if you have any questions

@linlol
Copy link
Author

linlol commented Aug 14, 2024

also, when you want to deserialise json to a data-class which has polars expr,

you can setup object_hook in json.loads

This is basically what I tried, feel free to let me know if you have any questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants