Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Projection pushdown optimization remove wrong column names #12917

Closed
2 tasks done
Mibu287 opened this issue Dec 6, 2023 · 4 comments
Closed
2 tasks done

Bug: Projection pushdown optimization remove wrong column names #12917

Mibu287 opened this issue Dec 6, 2023 · 4 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars

Comments

@Mibu287
Copy link

Mibu287 commented Dec 6, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})

# Eager mode
print(
    "Eager mode:\n",
    df.select(
        [
            pl.col("y"),
            pl.col("x") * pl.col("x").alias("y"),
        ]
    ).select([pl.col("y"), pl.col("x")]),
)

# Lazy mode
print(
    "Lazy mode:\n",
    df.lazy()
    .select(
        [
            pl.col("y"),
            pl.col("x") * pl.col("x").alias("y"),
        ]
    )
    .select([pl.col("y"), pl.col("x")])
    .collect(),
)

Log output

Eager mode:
 shape: (3, 2)
┌─────┬─────┐
│ y   ┆ x   │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════╪═════╡
│ a   ┆ 1   │
│ b   ┆ 4   │
│ c   ┆ 9   │
└─────┴─────┘
thread '<unnamed>' panicked at crates/polars-plan/src/utils.rs:402:68:
called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/private/tmp/example.py", line 27, in <module>
    .collect(),
     ^^^^^^^^^
  File "/Users/minhbui/.pyenv/versions/science/lib/python3.11/site-packages/polars/utils/deprecation.py", line 100, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/minhbui/.pyenv/versions/science/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 1788, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y"))

Issue description

Description: Lazy mode panic due to error "ColumnNotFound" while eager mode complete successfully.

Diagnosis: The error occur due to a bug in projection pushdown optimization code projection.rs - Line 84.

At this line, the optimizer performs check for any alias expression has name in projected_names to remove before propagating down the tree. However, check_double_projection function not only check for top-level expression but also expressions down below ==> It remove the wrong name.

Consider the examples above with commentary notes (Read in order 1 -> 2 -> 3):

  df.lazy()
  .select(
      [
          pl.col("y"), # 2. no change in projected_names ==> ["x", "y"]
          pl.col("x") * pl.col("x").alias("y"), #3. Remove the name "y" because "y" is the name of an alias sub-expression ==> ["x"]
      ]
  )
  .select([pl.col("y"), pl.col("x")]) #1. propagate projected_names = ["x", "y"]
  .collect()

Proposed fix: Only remove names of alias expression at top-level.

Further proposal: To prevent bugs like this, should we make the code more functional? i.e Instead of mutating projected_names, we traverse expressions which satisfied expr_is_projected_upstream and collect all columns touched by these expressions. Consider the pseudo code as below:

  let projected_names = exprs
      .iter()
      .filter(|e| expr_is_projected_upstream(e, input, lp_arena, expr_arena, &projected_names))
      .flat_map(|e| iter_columns(e))
      .collect();

Expected behavior

Lazy mode should return the same results as eager mode.

Installed versions

Version 0.19.19

@Mibu287 Mibu287 added bug Something isn't working rust Related to Rust Polars labels Dec 6, 2023
@Mibu287
Copy link
Author

Mibu287 commented Dec 7, 2023

Similar bug with join operations. Consider an examples as following:

import polars as pl

# Eager mode
df1 = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df2 = pl.DataFrame({"y_s": [1, 2, 3], "y": ["d", "e", "f"]})
print(
    "Eager mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner").select(
        [pl.col("y_s")]
    ),
)

# Lazy mode
df1 = df1.lazy()
df2 = df2.lazy()
print(
    "Lazy mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner")
    .select([pl.col("y_s")])
    .collect(),
)

Error log:

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y_s"))

Diagnosis:
When pushing down y_s for the right input, the optimizer notice that y_s is in right_on parameter ==> only y_s column in df2 is projected (the y column is mistakenly missed) ==> The error.

Proposed fix:
Consider the algorithm in the pseudo code below:

left_cols = All the columns in left schema
right_cols = All the columns in right schema (Except the right_on columns)
projected_cols = All the columns in projection expression
suffix = Join suffix

function maybe_suffix(col):
    Return col with suffix if col in left_cols otherwise return col

right_cols_maybe_suffix: Apply function maybe_suffix to every columns in right_cols

left_cols and right_cols_maybe_suffix are disjoint sets ==> each valid projection must belong to exactly one set.

Loop over projection expressions:
    Traverse the expression tree:
        At each leaf, get the column name => determine the column name in left or right expressions
            Build up the left and right projections

@cmdlineluser
Copy link
Contributor

I think this may also be the underlying cause for #12722 (I thought it was struct specific at the time)

@sami-ka
Copy link

sami-ka commented Apr 16, 2024

Similar bug with join operations. Consider an examples as following:

import polars as pl

# Eager mode
df1 = pl.DataFrame({"x": [1, 2, 3], "y": ["a", "b", "c"]})
df2 = pl.DataFrame({"y_s": [1, 2, 3], "y": ["d", "e", "f"]})
print(
    "Eager mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner").select(
        [pl.col("y_s")]
    ),
)

# Lazy mode
df1 = df1.lazy()
df2 = df2.lazy()
print(
    "Lazy mode\n",
    df1.join(df2, left_on="x", right_on="y_s", suffix="_s", how="inner")
    .select([pl.col("y_s")])
    .collect(),
)

Error log:

pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ColumnNotFound(ErrString("y_s"))

Diagnosis: When pushing down y_s for the right input, the optimizer notice that y_s is in right_on parameter ==> only y_s column in df2 is projected (the y column is mistakenly missed) ==> The error.

Proposed fix: Consider the algorithm in the pseudo code below:

left_cols = All the columns in left schema
right_cols = All the columns in right schema (Except the right_on columns)
projected_cols = All the columns in projection expression
suffix = Join suffix

function maybe_suffix(col):
    Return col with suffix if col in left_cols otherwise return col

right_cols_maybe_suffix: Apply function maybe_suffix to every columns in right_cols

left_cols and right_cols_maybe_suffix are disjoint sets ==> each valid projection must belong to exactly one set.

Loop over projection expressions:
    Traverse the expression tree:
        At each leaf, get the column name => determine the column name in left or right expressions
            Build up the left and right projections

I tried this join example and it seems that it works with 0.18.4 but not with 0.18.5.

Here are the versions I have with pl.show_versions().

Working versions:
--------Version info---------
Polars: 0.18.4
Index type: UInt32
Platform: macOS-14.4-arm64-arm-64bit
Python: 3.9.18 (main, Mar 7 2024, 12:26:15)
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
numpy: 1.26.4
pandas: 2.2.2
pyarrow: 15.0.2
connectorx:
deltalake:
fsspec:
matplotlib: 3.8.4
xlsx2csv:
xlsxwriter:

Failing versions:

--------Version info---------
Polars: 0.18.5
Index type: UInt32
Platform: macOS-14.4-arm64-arm-64bit
Python: 3.9.18 (main, Mar 7 2024, 12:26:15)
[Clang 15.0.0 (clang-1500.1.0.2.5)]

----Optional dependencies----
adbc_driver_sqlite:
connectorx:
deltalake:
fsspec:
matplotlib: 3.8.4
numpy: 1.26.4
pandas: 2.2.2
pyarrow: 15.0.2
pydantic:
sqlalchemy: 2.0.29
xlsx2csv:
xlsxwriter:

@ritchie46
Copy link
Member

This one is fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer rust Related to Rust Polars
Projects
None yet
Development

No branches or pull requests

5 participants