Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

full join with coalesce=True panics if more key expressions are used than columns in a frame #16547

Closed
2 tasks done
wence- opened this issue May 28, 2024 · 0 comments · Fixed by #16551
Closed
2 tasks done
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@wence-
Copy link
Collaborator

wence- commented May 28, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
left = pl.DataFrame({"a": [1, 2, 3], "b": [3, 4, 5], "c": [5, 6, 7]})
right = pl.DataFrame({"a": [2, 3, 4], "c": [4, 5, 6]})

left.join(right, on=[pl.col("a"), pl.col("a") % 2 == 0, pl.col("a") + pl.col("c")], how="full", coalesce=True)
thread '<unnamed>' panicked at crates/polars-ops/src/frame/join/general.rs:90:25:
removal index (is 3) should be < len (is 3)
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/coder/doodles/python/polars-interp/bug2.py", line 5, in <module>
    left.join(right, on=[pl.col("a"), pl.col("a") % 2 == 0, pl.col("a") + pl.col("c")], how="full", coalesce=True)
  File "/home/coder/third-party/polars/py-polars/polars/dataframe/frame.py", line 6549, in join
    self.lazy()
  File "/home/coder/third-party/polars/py-polars/polars/lazyframe/frame.py", line 1855, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: removal index (is 3) should be < len (is 3)

Log output

run JoinExec
join parallel: true

Issue description

This is followup to #16289, that case (repeated identical column expressions in the keys) was fixed in #16329, but this slightly more general case still panics

Expected behavior

No panic, and instead an appropriate error message.

FWIW, I think the case where coalesce=False is provided should also error, since it is (presumably) implementation defined which of the multiple key expressions becomes the concrete key value in the result (since they all have overlapping names).

Installed versions

--------Version info---------
Polars:               0.20.30
Index type:           UInt32
Platform:             Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  1.0.0
cloudpickle:          3.0.0
connectorx:           0.3.3
deltalake:            0.17.4
fastexcel:            0.10.4
fsspec:               2024.5.0
gevent:               24.2.1
hvplot:               0.10.0
matplotlib:           3.9.0
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               2.2.2
pyarrow:              16.1.0
pydantic:             2.7.1
pyiceberg:            <not installed>
pyxlsb:               1.0.10
sqlalchemy:           2.0.30
torch:                2.3.0.post301
xlsx2csv:             0.8.2
xlsxwriter:           3.2.0
@wence- wence- added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 28, 2024
wence- added a commit to wence-/polars that referenced this issue May 28, 2024
The fix for pola-rs#16289 checked for expression identity when validating the
join keys, but if multiple expressions are not identical, they may
still produce matching output key names. Since this is ambiguous,
catch this more general case and raise.

- Fixes pola-rs#16547
wence- added a commit to wence-/polars that referenced this issue May 28, 2024
The fix for pola-rs#16289 checked for expression identity when validating the
join keys, but if multiple expressions are not identical, they may
still produce matching output key names. Since this is ambiguous,
catch this more general case and raise.

- Fixes pola-rs#16547
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant