-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support CSE on python UDFs #16637
Comments
I am not sure about that. We will not call into python for equality of functions and pointers checking failed in the past. |
Can these be special cased so that a user can say that it's safe to CSE this expression? This ends up being a pretty big annoyance in some cases and makes certain programming patterns ugly. |
At work, I essentially provide a framework where users pass me expressions and I apply them to a base table as well as adding an |
I think this can create many bugs, which I don't want open at this point in time. We can look at enabling it for UDF's later. |
Even if it's completely opt in? This is a bit of a blocker for me, I'm curious if it'd be possible to bring back the |
Does |
@avimallu I don't think you understand the feature request. import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain()) You'll notice that:
Will print out, meaning that the udf gets evaluated 3x. Contrast it with: import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").mul(2).alias("b")
derived_expr_0 = udf_expr.mul(2).alias("c")
derived_expr_1 = udf_expr.mul(3).alias("d")
ldf = ldf.with_columns(udf_expr, derived_expr_0, derived_expr_1)
print(ldf.explain()) Which will print out:
This requires polars side changes or you have to explicitly write your query/code like: import polars as pl
ldf = pl.LazyFrame({"a": [1, 2, 3]})
udf_expr = pl.col("a").map_batches(lambda x: x*2).alias("b")
derived_expr_0 = pl.col("b").mul(2).alias("c")
derived_expr_1 = pl.col("b").mul(3).alias("d")
ldf = ldf.with_columns(udf_expr)
ldf = ldf.with_columns(derived_expr_0, derived_expr_1)
print(ldf.explain()) But in my use case, users build up trees of expressions which they pass to my framework to evaluate, which becomes very ugly if CSE isn't supported, then it breaks the abstraction. |
@avimallu I had the same thought on the lru_cache but it doesn't work because neither |
Ping me next week. I will see if I can put something behind an env var. |
@ritchie46 if you're not swamped this would be great! Or if you can give me some high level guidance I can try to take a crack at this if you don't think it's too complicated. |
Description
I'm guessing CSE isn't supported because python UDFs can potentially be stateful. Could we make it so that
map_*
methods onpl.Expr
can take in ais_pure
parameter that will let these get CSE'd?The text was updated successfully, but these errors were encountered: