-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug: Projection pushdown optimization remove wrong column names #12917
Comments
Similar bug with join operations. Consider an examples as following: import polars as pl
# Eager mode
df1 = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df2 = pl.DataFrame({"y_s": [1, 2, 3], "y": ["d", "e", "f"]})
print(
"Eager mode\n",
df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner").select(
[pl.col("y_s")]
),
)
# Lazy mode
df1 = df1.lazy()
df2 = df2.lazy()
print(
"Lazy mode\n",
df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner")
.select([pl.col("y_s")])
.collect(),
) Error log:
Diagnosis: Proposed fix:
|
I think this may also be the underlying cause for #12722 (I thought it was struct specific at the time) |
I tried this join example and it seems that it works with 0.18.4 but not with 0.18.5. Here are the versions I have with pl.show_versions(). Working versions: ----Optional dependencies---- Failing versions: --------Version info--------- ----Optional dependencies---- |
This one is fixed. |
Checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Reproducible example
Log output
Issue description
Description: Lazy mode panic due to error "ColumnNotFound" while eager mode complete successfully.
Diagnosis: The error occur due to a bug in projection pushdown optimization code projection.rs - Line 84.
At this line, the optimizer performs check for any alias expression has name in
projected_names
to remove before propagating down the tree. However,check_double_projection
function not only check for top-level expression but also expressions down below ==> It remove the wrong name.Consider the examples above with commentary notes (Read in order 1 -> 2 -> 3):
Proposed fix: Only remove names of alias expression at top-level.
Further proposal: To prevent bugs like this, should we make the code more functional? i.e Instead of mutating
projected_names
, we traverse expressions which satisfiedexpr_is_projected_upstream
and collect all columns touched by these expressions. Consider the pseudo code as below:Expected behavior
Lazy mode should return the same results as eager mode.
Installed versions
Version 0.19.19
The text was updated successfully, but these errors were encountered: