Bug: Projection pushdown optimization remove wrong column names #12917

Mibu287 · 2023-12-06T15:22:44Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})

# Eager mode
print(
    "Eager mode:\n",
    df.select(
        [
            pl.col("y"),
            pl.col("x") * pl.col("x").alias("y"),
        ]
    ).select([pl.col("y"), pl.col("x")]),
)

# Lazy mode
print(
    "Lazy mode:\n",
    df.lazy()
    .select(
        [
            pl.col("y"),
            pl.col("x") * pl.col("x").alias("y"),
        ]
    )
    .select([pl.col("y"), pl.col("x")])
    .collect(),
)

Log output

Eager mode:
 shape: (3, 2)
┌─────┬─────┐
│ y   ┆ x   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 4   │
│ c   ┆ 9   │
└─────┴─────┘
thread '<unnamed>' panicked at crates/polars-plan/src/utils.rs:402:68:
called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/private/tmp/example.py", line 27, in <module>
    .collect(),
     ^^^^^^^^^
  File "/Users/minhbui/.pyenv/versions/science/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/minhbui/.pyenv/versions/science/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1788, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y"))

Issue description

Description: Lazy mode panic due to error "ColumnNotFound" while eager mode complete successfully.

Diagnosis: The error occur due to a bug in projection pushdown optimization code projection.rs - Line 84.

At this line, the optimizer performs check for any alias expression has name in projected_names to remove before propagating down the tree. However, check_double_projection function not only check for top-level expression but also expressions down below ==> It remove the wrong name.

Consider the examples above with commentary notes (Read in order 1 -> 2 -> 3):

  df.lazy()
  .select(
      [
          pl.col("y"), # 2. no change in projected_names ==> ["x", "y"]
          pl.col("x") * pl.col("x").alias("y"), #3. Remove the name "y" because "y" is the name of an alias sub-expression ==> ["x"]
      ]
  )
  .select([pl.col("y"), pl.col("x")]) #1. propagate projected_names = ["x", "y"]
  .collect()

Proposed fix: Only remove names of alias expression at top-level.

Further proposal: To prevent bugs like this, should we make the code more functional? i.e Instead of mutating projected_names, we traverse expressions which satisfied expr_is_projected_upstream and collect all columns touched by these expressions. Consider the pseudo code as below:

  let projected_names = exprs
      .iter()
      .filter(|e| expr_is_projected_upstream(e, input, lp_arena, expr_arena, &projected_names))
      .flat_map(|e| iter_columns(e))
      .collect();

Expected behavior

Lazy mode should return the same results as eager mode.

Installed versions

Version 0.19.19

The text was updated successfully, but these errors were encountered:

Mibu287 · 2023-12-07T16:14:09Z

Similar bug with join operations. Consider an examples as following:

import polars as pl

# Eager mode
df1 = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df2 = pl.DataFrame({"y_s": [1, 2, 3], "y": ["d", "e", "f"]})
print(
    "Eager mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner").select(
        [pl.col("y_s")]
    ),
)

# Lazy mode
df1 = df1.lazy()
df2 = df2.lazy()
print(
    "Lazy mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner")
    .select([pl.col("y_s")])
    .collect(),
)

Error log:

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y_s"))

Diagnosis:
When pushing down y_s for the right input, the optimizer notice that y_s is in right_on parameter ==> only y_s column in df2 is projected (the y column is mistakenly missed) ==> The error.

Proposed fix:
Consider the algorithm in the pseudo code below:

left_cols = All the columns in left schema
right_cols = All the columns in right schema (Except the right_on columns)
projected_cols = All the columns in projection expression
suffix = Join suffix

function maybe_suffix(col):
    Return col with suffix if col in left_cols otherwise return col

right_cols_maybe_suffix: Apply function maybe_suffix to every columns in right_cols

left_cols and right_cols_maybe_suffix are disjoint sets ==> each valid projection must belong to exactly one set.

Loop over projection expressions:
    Traverse the expression tree:
        At each leaf, get the column name => determine the column name in left or right expressions
            Build up the left and right projections

cmdlineluser · 2023-12-07T16:19:04Z

I think this may also be the underlying cause for #12722 (I thought it was struct specific at the time)

sami-ka · 2024-04-16T08:17:00Z

Similar bug with join operations. Consider an examples as following:

import polars as pl

# Eager mode
df1 = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df2 = pl.DataFrame({"y_s": [1, 2, 3], "y": ["d", "e", "f"]})
print(
    "Eager mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner").select(
        [pl.col("y_s")]
    ),
)

# Lazy mode
df1 = df1.lazy()
df2 = df2.lazy()
print(
    "Lazy mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner")
    .select([pl.col("y_s")])
    .collect(),
)

Error log:

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y_s"))

Diagnosis: When pushing down y_s for the right input, the optimizer notice that y_s is in right_on parameter ==> only y_s column in df2 is projected (the y column is mistakenly missed) ==> The error.

Proposed fix: Consider the algorithm in the pseudo code below:

left_cols = All the columns in left schema
right_cols = All the columns in right schema (Except the right_on columns)
projected_cols = All the columns in projection expression
suffix = Join suffix

function maybe_suffix(col):
    Return col with suffix if col in left_cols otherwise return col

right_cols_maybe_suffix: Apply function maybe_suffix to every columns in right_cols

left_cols and right_cols_maybe_suffix are disjoint sets ==> each valid projection must belong to exactly one set.

Loop over projection expressions:
    Traverse the expression tree:
        At each leaf, get the column name => determine the column name in left or right expressions
            Build up the left and right projections

I tried this join example and it seems that it works with 0.18.4 but not with 0.18.5.

Here are the versions I have with pl.show_versions().

Working versions:
--------Version info---------
Polars: 0.18.4
Index type: UInt32
Platform: macOS-14.4-arm64-arm-64bit
Python: 3.9.18 (main, Mar 7 2024, 12:26:15)
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
numpy: 1.26.4
pandas: 2.2.2
pyarrow: 15.0.2
connectorx:
deltalake:
fsspec:
matplotlib: 3.8.4
xlsx2csv:
xlsxwriter:

Failing versions:

--------Version info---------
Polars: 0.18.5
Index type: UInt32
Platform: macOS-14.4-arm64-arm-64bit
Python: 3.9.18 (main, Mar 7 2024, 12:26:15)
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_sqlite:
connectorx:
deltalake:
fsspec:
matplotlib: 3.8.4
numpy: 1.26.4
pandas: 2.2.2
pyarrow: 15.0.2
pydantic:
sqlalchemy: 2.0.29
xlsx2csv:
xlsxwriter:

ritchie46 · 2024-06-19T08:54:35Z

This one is fixed.

Mibu287 added bug Something isn't working rust Related to Rust Polars labels Dec 6, 2023

kszlim mentioned this issue Dec 7, 2023

feat: Enable partial predicate pushdown past window expressions #12710

Merged

cmdlineluser mentioned this issue Dec 28, 2023

Lazy queryplanner bug causes PanicException #13287

Closed

2 tasks

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

cmdlineluser mentioned this issue Jan 31, 2024

ColumnNotFoundError when running lazy query but not eager query #14141

Closed

2 tasks

cmdlineluser mentioned this issue Feb 8, 2024

Lazy processing throws error when equivalent nonlazy code does not #14382

Closed

2 tasks

owenprough-sift mentioned this issue Apr 5, 2024

ColumnNotFoundError on concat of LazyFrames; works fine if collected first #15467

Open

2 tasks

ericdoerheit mentioned this issue May 17, 2024

Problem with concat and hive partitioned data in combination with in-memory data frames #16285

Open

2 tasks

ritchie46 closed this as completed Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Projection pushdown optimization remove wrong column names #12917

Bug: Projection pushdown optimization remove wrong column names #12917

Mibu287 commented Dec 6, 2023 •

edited

Loading

Mibu287 commented Dec 7, 2023

cmdlineluser commented Dec 7, 2023

sami-ka commented Apr 16, 2024

ritchie46 commented Jun 19, 2024

Bug: Projection pushdown optimization remove wrong column names #12917

Bug: Projection pushdown optimization remove wrong column names #12917

Comments

Mibu287 commented Dec 6, 2023 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

Mibu287 commented Dec 7, 2023

cmdlineluser commented Dec 7, 2023

sami-ka commented Apr 16, 2024

ritchie46 commented Jun 19, 2024

Mibu287 commented Dec 6, 2023 •

edited

Loading