Support CSE on python UDFs #16637

kszlim · 2024-05-31T18:42:31Z

Description

I'm guessing CSE isn't supported because python UDFs can potentially be stateful. Could we make it so that map_* methods on pl.Expr can take in a is_pure parameter that will let these get CSE'd?

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-06-01T06:45:04Z

I am not sure about that. We will not call into python for equality of functions and pointers checking failed in the past.

kszlim · 2024-06-01T06:56:14Z

Can these be special cased so that a user can say that it's safe to CSE this expression? This ends up being a pretty big annoyance in some cases and makes certain programming patterns ugly.

kszlim · 2024-06-01T07:06:22Z

At work, I essentially provide a framework where users pass me expressions and I apply them to a base table as well as adding an over to the user provided expressions. The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context. It's okay if the UDF is cheap, but some of them are quite expensive, so having CSE work would be great.

ritchie46 · 2024-06-03T16:38:15Z

I think this can create many bugs, which I don't want open at this point in time. We can look at enabling it for UDF's later.

kszlim · 2024-06-03T16:42:55Z

Even if it's completely opt in? This is a bit of a blocker for me, I'm curious if it'd be possible to bring back the pl.Expr.cache method as an alternative to this instead?

avimallu · 2024-06-03T17:31:15Z

The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.

Does lru_cache not work for your case?

kszlim · 2024-06-03T17:52:05Z

The only way to avoid UDFs from being recomputed would be by referencing them by name in a later context.

Does lru_cache not work for your case?

@avimallu I don't think you understand the feature request.

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())

You'll notice that:

 WITH_COLUMNS:
 [col("a").python_udf().alias("b"), [(col("a").python_udf()) * (2.cast(Unknown(Any)))].alias("c"), [(col("a").python_udf()) * (3.cast(Unknown(Any)))].alias("d")], []
  DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None

Will print out, meaning that the udf gets evaluated 3x.

Contrast it with:

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").mul(2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain())

Which will print out:

 WITH_COLUMNS:
 [col("__POLARS_CSER_0xd39686281a38356a").alias("b"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (2)].alias("c"), [(col("__POLARS_CSER_0xd39686281a38356a")) * (3)].alias("d")], [[(col("a")) * (2)].alias("__POLARS_CSER_0xd39686281a38356a")]
  DF ["a"]; PROJECT */1 COLUMNS; SELECTION: None

This requires polars side changes or you have to explicitly write your query/code like:

import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = pl.col("b").mul(2).alias("c")
derived_expr_1 = pl.col("b").mul(3).alias("d")
ldf = ldf.with_columns(udf_expr)
ldf = ldf.with_columns(derived_expr_0, derived_expr_1)
print(ldf.explain())

But in my use case, users build up trees of expressions which they pass to my framework to evaluate, which becomes very ugly if CSE isn't supported, then it breaks the abstraction.

deanm0000 · 2024-06-05T11:55:59Z

@avimallu I had the same thought on the lru_cache but it doesn't work because neither pl.Series or even np.ndarray are hashable for the underlying cache.

ritchie46 · 2024-06-05T14:00:19Z

Ping me next week. I will see if I can put something behind an env var.

kszlim · 2024-06-10T16:01:05Z

@ritchie46 if you're not swamped this would be great! Or if you can give me some high level guidance I can try to take a crack at this if you don't think it's too complicated.

kszlim added the enhancement New feature or an improvement of an existing feature label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support CSE on python UDFs #16637

Support CSE on python UDFs #16637

kszlim commented May 31, 2024

ritchie46 commented Jun 1, 2024

kszlim commented Jun 1, 2024

kszlim commented Jun 1, 2024

ritchie46 commented Jun 3, 2024

kszlim commented Jun 3, 2024

avimallu commented Jun 3, 2024

kszlim commented Jun 3, 2024

deanm0000 commented Jun 5, 2024

ritchie46 commented Jun 5, 2024

kszlim commented Jun 10, 2024

Support CSE on python UDFs #16637

Support CSE on python UDFs #16637

Comments

kszlim commented May 31, 2024

Description

ritchie46 commented Jun 1, 2024

kszlim commented Jun 1, 2024

kszlim commented Jun 1, 2024

ritchie46 commented Jun 3, 2024

kszlim commented Jun 3, 2024

avimallu commented Jun 3, 2024

kszlim commented Jun 3, 2024

deanm0000 commented Jun 5, 2024

ritchie46 commented Jun 5, 2024

kszlim commented Jun 10, 2024