Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnNotFoundError appears in lazy mode only in version 0.20.28 #16435

Closed
2 tasks done
Bonnevie opened this issue May 23, 2024 · 9 comments · Fixed by #16463
Closed
2 tasks done

ColumnNotFoundError appears in lazy mode only in version 0.20.28 #16435

Bonnevie opened this issue May 23, 2024 · 9 comments · Fixed by #16463
Assignees
Labels
A-optimizer Area: plan optimization accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release

Comments

@Bonnevie
Copy link

Bonnevie commented May 23, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I am not sure how to reproduce this.

Log output

join parallel: true
INNER join dataframes finished
dataframe filtered
FOUND SORTED KEY: running default HASH AGGREGATION
FOUND SORTED KEY: running default HASH AGGREGATION
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
keys/aggregates are not partitionable: running default HASH AGGREGATION
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
keys/aggregates are not partitionable: running default HASH AGGREGATION
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
INNER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
keys/aggregates are not partitionable: running default HASH AGGREGATION
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
DATAFRAME < 1000 rows: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
keys/aggregates are not partitionable: running default HASH AGGREGATION
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
INNER join dataframes finished
found multiple sources; run comm_subplan_elim
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
join parallel: false
join parallel: false
join parallel: false
join parallel: false
join parallel: false
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
LEFT join dataframes finished
FOUND SORTED KEY: running default HASH AGGREGATION
LEFT join dataframes finished
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
CACHE SET: cache id: 0
keys/aggregates are not partitionable: running default HASH AGGREGATION
join parallel: false
CACHE HIT: cache id: 0
join parallel: false
join parallel: false
LEFT join dataframes finished
keys/aggregates are not partitionable: running default HASH AGGREGATION
LEFT join dataframes finished
OUTER join dataframes finished
INNER join dataframes finished

Issue description

This is not a very strong bug report, I'm sorry, just wanted to give a heads-up on a potential issue in the newest version 0.20.28.

We have several complex queries that we can run either in lazy or eager mode, but on upgrade our tests started failing with polars.exceptions.ColumnNotFoundError: Intending to debug, I ran the same tests in eager mode - and then all tests passed.

I tried doing a manual bisect search for the version where the error was introduced, and it seems to be in 0.20.27-0.20.28, (the former was yanked), as the tests pass without issue on 0.20.26.

I will downgrade for now, will alert you if I find more actionable intelligence on the issue.

Exact error is

[... trace of my own code calling df.collect()]
python3.10/site-packages/polars/lazyframe/frame.py", line 1817, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ColumnNotFoundError: column_name

Expected behavior

running with lazyframes should yield identical results to running on normal dataframes.

Installed versions

--------Version info---------
Polars:               0.20.28
Index type:           UInt32
Platform:             Linux-5.10.214-202.855.amzn2.x86_64-x86_64-with-glibc2.35
Python:               3.10.11 (main, May 16 2023, 00:28:57) [GCC 11.2.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         1.5.8
numpy:                1.21.6
openpyxl:             3.0.3
pandas:               1.3.5
pyarrow:              15.0.2
pydantic:             2.6.3
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           3.2.0
@Bonnevie Bonnevie added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 23, 2024
@owenprough-sift
Copy link

collect() has parameters to disable various optimizations. Can you determine which optimization is at fault?

@stinodego stinodego added regression Issue introduced by a new release A-optimizer Area: plan optimization labels May 23, 2024
@stinodego
Copy link
Member

stinodego commented May 23, 2024

It would be much appreciated if you could work out a minimal reproducible example. These types of bugs have very high priority for us, but this report does not give us enough to go on.

@Bonnevie
Copy link
Author

@stinodego I understand, but the query is quite complex and uses a lot of features so hard for me to zero in on the offending part. Obviously even harder for you without anything to go on, but hoped that maybe the spare details would ring a bell given it's at least particular to the latest version.
Will dig a bit and see if I can narrow it down.

@owenprough-sift tried the optimization flags, but no luck, issue still arose with

collect(
            type_coercion=True,
            predicate_pushdown=False,
            projection_pushdown=False,
            simplify_expression=False,
            slice_pushdown=False,
            comm_subplan_elim=False,
            comm_subexpr_elim=False,
            no_optimization=True,
        )

Cannot get rid of type_coercion, if I set it to False I get a number of other issues that I don't think are related.

@Bonnevie
Copy link
Author

@stinodego okay, tried digging a bit, and I have a cursed LazyFrame.
It's an empty (0, 82) dataframe and contains a column called visitNr.
If I do df.collect() all is fine. If I do

df.with_columns(pl.col("visitNr").alias("foo")).collect()

I get polars.exceptions.ColumnNotFoundError: foo
but if I then add a select statement,

df.select(df.columns).with_columns(pl.col("visitNr").alias("foo")).collect()

it works again.
If I do

pl.LazyFrame({"visitNr": []}, schema={"visitNr": str}).with_columns(pl.col("visitNr").alias("foo")).collect()

it works fine, so still not easily reproducible.

@coastalwhite
Copy link
Collaborator

This might be a regression caused by the cluster_with_columns #16274, but I am not sure.

@Bonnevie
Copy link
Author

checked, and df.collect().lazy().with_columns(pl.col("visitNr").alias("foo")).collect() also runs, so it seems like something goes wrong in query planning/optimization. Specifically, it seems to request a column before the alias is applied? Because removing the alias and calling df.with_columns(pl.col("visitNr")).collect() also works.

@Bonnevie
Copy link
Author

Bonnevie commented May 23, 2024

This might be a regression caused by the cluster_with_columns #16274, but I am not sure.

this looked like a very likely culprit, but found out that there is a toggle for it #16446 and setting it to False didn't help unfortunately (edit: nevermind, toggle is not released yet - collect just has **kwargs and accepts every keyword). It's worth noting that using select in place of with_columns does seem to resolve the issue, so the problem must lie with with_columns.

@lukeshingles
Copy link
Contributor

I believe I'm also seeing the same bug with ColumnNotFoundError in when collecting a LazyFrame that has been constructed in a fairly complicated way (I'm also having trouble reducing it to a simple example). Collecting with cluster_with_columns=False prevents the problem.

@lukeshingles
Copy link
Contributor

Here is the smallest example I can come up with:

import polars as pl

df = pl.DataFrame({"a": [1]}).lazy()

df = (
    df.with_columns(b=pl.col("a"))
    .with_columns(c=pl.col("b"))
    .with_columns(col_lit2=pl.lit(2))
    .with_columns(col_lit2_b=pl.col("col_lit2"))
    .with_columns(missingcol=pl.lit(3))
)

dfmodelcollect = df.collect(cluster_with_columns=True)
print(dfmodelcollect)

Result with cluster_with_columns=True:
polars.exceptions.ColumnNotFoundError: missingcol

Result with cluster_with_columns=False (or using DataFrame instead of LazyFrame):

shape: (1, 6)
┌─────┬─────┬─────┬──────────┬────────────┬────────────┐
│ a   ┆ b   ┆ c   ┆ col_lit2 ┆ col_lit2_b ┆ missingcol │
│ --- ┆ --- ┆ --- ┆ ---      ┆ ---        ┆ ---        │
│ i64 ┆ i64 ┆ i64 ┆ i32      ┆ i32        ┆ i32        │
╞═════╪═════╪═════╪══════════╪════════════╪════════════╡
│ 1   ┆ 1   ┆ 1   ┆ 2        ┆ 2          ┆ 3          │
└─────┴─────┴─────┴──────────┴────────────┴────────────┘

@stinodego stinodego added P-high Priority: high and removed needs triage Awaiting prioritization by a maintainer labels May 23, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog May 23, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue May 24, 2024
@github-project-automation github-project-automation bot moved this from Ready to Done in Backlog May 24, 2024
@c-peters c-peters added the accepted Ready for implementation label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-optimizer Area: plan optimization accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
6 participants