Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lazy processing throws error when equivalent nonlazy code does not #14382

Closed
2 tasks done
william-chu-github opened this issue Feb 8, 2024 · 1 comment · Fixed by #14437
Closed
2 tasks done

Lazy processing throws error when equivalent nonlazy code does not #14382

william-chu-github opened this issue Feb 8, 2024 · 1 comment · Fixed by #14437
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars

Comments

@william-chu-github
Copy link

william-chu-github commented Feb 8, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars

polars.Config.set_verbose(True)

DF = polars.DataFrame({
  "b": [1, 6, 8, 7]
})
DF2 = polars.DataFrame({
  "a": [1, 2, 4, 4],  "b": [True, True, True, True]
})
  
DF3 = DF2.lazy().\
  select(
    *["a", "b"], polars.lit("b").alias("b_name"),
    DF.get_column("b").alias("b_old")
  ).\
  filter(polars.col("b") == False).drop("b").collect()

Log output

File "C:\Program Files\Python38\lib\site-packages\polars\lazyframe\frame.py", line 1940, in collect
    return wrap_df(ldf.collect())

ComputeError: column 'b_old' not available in schema Schema:
name: a, data type: Int64
name: b, data type: Boolean

Issue description

Some lazy processing throws exceptions when the equivalent nonlazy code runs fine.

In the reproducible example above, if I remove the lazy processing, the code returns a dataframe with zero rows, as expected:

DF3 = DF2.\
  select(
    *["a", "b"], polars.lit("b").alias("b_name"),
    DF.get_column("b").alias("b_old")
  ).\
  filter(polars.col("b") == False).drop("b")

If I take the lazy code but do not process the filtering code or anything after, it runs fine:

DF3_nofilter = DF2.lazy().\
  select(
    *["a", "b"], polars.lit("b").alias("b_name"),
    DF.get_column("b").alias("b_old")
  ).collect()

DF3_nofilter is a dataframe where b_old definitely exists, so I don't know why the original code is complaining that b_old isn't in the schema, and, more surprisingly, why it asks about b_old at all when the filtering and dropping operations mention only b, not b_old.

shape: (4, 4)
┌─────┬──────┬────────┬───────┐
│ a   ┆ b    ┆ b_name ┆ b_old │
│ --- ┆ ---  ┆ ---    ┆ ---   │
│ i64 ┆ bool ┆ str    ┆ i64   │
╞═════╪══════╪════════╪═══════╡
│ 1   ┆ true ┆ b      ┆ 1     │
│ 2   ┆ true ┆ b      ┆ 6     │
│ 4   ┆ true ┆ b      ┆ 8     │
│ 4   ┆ true ┆ b      ┆ 7     │
└─────┴──────┴────────┴───────┘

Expected behavior

Output of the lazy and nonlazy processing should be the same, and the lazy processing should not throw an exception.

Installed versions

--------Version info---------
Polars:               0.20.7
Index type:           UInt32
Platform:             Windows-10-10.0.19041-SP0
Python:               3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               2023.4.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.6.2
numpy:                1.22.3
openpyxl:             <not installed>
pandas:               2.0.2
pyarrow:              11.0.0
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           1.4.45
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@william-chu-github william-chu-github added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Feb 8, 2024
@cmdlineluser
Copy link
Contributor

Just for debugging purposes: it runs with projection_pushdown disabled.

.collect(projection_pushdown=False)
# shape: (0, 3)
# ┌─────┬────────┬───────┐
# │ a   ┆ b_name ┆ b_old │
# │ --- ┆ ---    ┆ ---   │
# │ i64 ┆ str    ┆ i64   │
# ╞═════╪════════╪═══════╡
# └─────┴────────┴───────┘

I'm not sure if they are all the same issue, but there are a few similar problems currently open:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-high Priority: high python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants