Copy logic-plan from one LazyFrame to another LazyFrame? #16430

linlol · 2024-05-23T09:05:13Z

Description

Is this possible to serialise/deserialise the logic plan only?

Possible use case:

Suppose that there is

a large LazyFrame on server side with great memory and compute resource (denoted as large_lf)
a small LazyFrame (denoted as small_df whose schema is indifferent with large_df) on client side with limited resource

In this case, user could implement a few actions, and send request to server side to apply those actions on large_df

Had a quick look at document and discord looks like it is not yet supported, is it possible to support it in the future (or is a PR related would be welcomed?)

linlol · 2024-05-23T11:22:19Z

So far, two possible solution is

Hack the input on JSON generated from serialize/deserialize method, it definitly works for small dataset, however, the overhead could be great when we try to dump a polars df more than 1M
hack the parquet path in disk? this is what I tried so far. Both client side and server side, lazyframe is inited via scan_parquet method,

in this case, we can hack the json path Parquetscan.input to create lf which joins server data and clinet operation

(the side-effect is redundant round-trip of parquet serialisation/deserialisation, which is worthwhile)

cmdlineluser · 2024-05-24T12:24:25Z

There is an interesting test that was recently added:

https://github.com/pola-rs/polars/blob/main/py-polars/tests/unit/lazyframe/cuda/test_node_visitor.py

It hooks into the plan node iteration and replaces DataFrameScan / Join nodes with custom callbacks.

eitsupi · 2024-05-24T12:37:08Z

Another thing I was thinking about was if the DataFrame was to be embedded in json, could it be in Arrow IPC format instead of embedding the values as they are?

linlol · 2024-05-28T05:53:32Z

hello @ritchie46

As you can see from the discussion in discord and above, probably it's worthwhile to introduce the serialisation/deserialisation of pure logic-plan (without data)

does it make sense to you, or alternatively, do you have any concern if pr with similar feature proposed

cmdlineluser · 2024-05-31T13:17:31Z

#16624 (comment)

This is important as we want to be able to send the query to another machine.

kszlim · 2024-06-17T09:28:05Z

This would be handy for anyone running polars in a loop, ie. you have a ring buffer that you create a dataframe from on each iteration, and then you create a lazyframe from an already optimized logical plan (i'm not sure if the expensive part of optimization is from logical plan or physical plan optimization). But this might mitigate the greatly increased cost of the resolution of schema (that has occurred in the last little while).

Ie. this lets you "emulate" what flink/risingwave/arroyo do by letting you kind of run polars on a streaming data source. Obviously it's not ideal, but might mitigate the cost a little if your actual computations aren't too expensive.

linlol · 2024-06-22T10:57:49Z

Two possible idea without hacking

pure Python watcher as a lib, e.g. a decorator/subclass of LazyFrame/a class with LazyFrame composited that would record every operation?
A new LazyFrame API other than current serialize/deserialize, which displays/read DslPlan but have the Scan part ignored?

linlol · 2024-07-02T16:59:59Z

Share a solution developed last Fri.

Client Side:

In our case, we have a class called LazyFrameClient, with a lazyframe _lf whose schema coincide with server side plan
When lazy funciton like filter/agg/groupby e.t.c called, it mimic the operation on self._lf and record plan with below property abstracted as a pydantic.BaseModel

function name
function positional args (in terms of List[Any])
function optional args (in terms of Dict[str Any])

develop customised JSONEncoder/hook to serialise/deserialise the logic plan
send request once client.collect() called, it serailsie the recorded plan and send to server

Server Side

once get request, deserialise request JSON to function name/args/kwargs
call function via reflection
serialise the result to parquet, send back to client

GeorgeGibson01 · 2024-08-13T10:07:29Z

Share a solution developed last Fri.

Client Side:

In our case, we have a class called LazyFrameClient, with a lazyframe _lf whose schema coincide with server side plan

When lazy funciton like filter/agg/groupby e.t.c called, it mimic the operation on self._lf and record plan with below property abstracted as a pydantic.BaseModel

function name

function positional args (in terms of List[Any])

function optional args (in terms of Dict[str Any])

develop customised JSONEncoder/hook to serialise/deserialise the logic plan

send request once client.collect() called, it serailsie the recorded plan and send to server

Server Side

once get request, deserialise request JSON to function name/args/kwargs

call function via reflection

serialise the result to parquet, send back to client

Been looking in to using this for a similar case where we want to save the logic plan and then reapply at a different time or on a different df with the same schema.

Via your plan recorder did you handle the case where you want to do a column operation within the select function for example

linlol · 2024-08-13T14:36:27Z

hey @GeorgeGibson01

Do you mean support some polars function inside LazyFrame.select?

I cannot remember correctly but it shall be doable. I shall tried case like

If.select(pl.count()) or something similar to check how many lines

linlol · 2024-08-13T14:40:32Z

Shall be similar to the serialise of pl.Expr if I remember correctly, while pl.expr is json-serialisable

GeorgeGibson01 · 2024-08-13T14:45:04Z

simple example would be

lf= pl.LazyFrame(data)
lf.select([“x”, “y”, pl.col(“x”).alias(“x+y”) + pl.col(“y”)])

linlol · 2024-08-13T14:59:12Z

simple example would be

lf= pl.LazyFrame(data)
lf.select([“x”, “y”, pl.col(“x”).alias(“x+y”) + pl.col(“y”)])

haha my test case was simplier... let me have a try

hello, @GeorgeGibson01 I just checked and it is doable on my side.

We know that both string and polars expression are json-serialisable, hence the whole structure recording our function call and args are json serialisable

In this case, we can create a json encoder, and perform below logic:

If list/dict, encode element/item one by one

If pl.Expr, call expr.meta.serialize() method

Else, assume it is json serialisable

Sorry that I am not allowed to share the real code I wrote, hope this general idea helps, feel free to let me know if you have any questions

linlol · 2024-08-14T06:56:12Z

also, when you want to deserialise json to a data-class which has polars expr,

you can setup object_hook in json.loads

This is basically what I tried, feel free to let me know if you have any questions

linlol added the enhancement New feature or an improvement of an existing feature label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

linlol commented May 23, 2024 •

edited

Loading

linlol commented May 23, 2024 •

edited

Loading

cmdlineluser commented May 24, 2024

eitsupi commented May 24, 2024

linlol commented May 28, 2024

cmdlineluser commented May 31, 2024

kszlim commented Jun 17, 2024

linlol commented Jun 22, 2024

linlol commented Jul 2, 2024

GeorgeGibson01 commented Aug 13, 2024

linlol commented Aug 13, 2024

linlol commented Aug 13, 2024

GeorgeGibson01 commented Aug 13, 2024

linlol commented Aug 13, 2024 •

edited

Loading

linlol commented Aug 14, 2024

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

Copy logic-plan from one LazyFrame to another LazyFrame? #16430

Comments

linlol commented May 23, 2024 • edited Loading

Description

Possible use case:

linlol commented May 23, 2024 • edited Loading

cmdlineluser commented May 24, 2024

eitsupi commented May 24, 2024

linlol commented May 28, 2024

cmdlineluser commented May 31, 2024

kszlim commented Jun 17, 2024

linlol commented Jun 22, 2024

linlol commented Jul 2, 2024

GeorgeGibson01 commented Aug 13, 2024

linlol commented Aug 13, 2024

linlol commented Aug 13, 2024

GeorgeGibson01 commented Aug 13, 2024

linlol commented Aug 13, 2024 • edited Loading

linlol commented Aug 14, 2024

linlol commented May 23, 2024 •

edited

Loading

linlol commented May 23, 2024 •

edited

Loading

linlol commented Aug 13, 2024 •

edited

Loading