Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug at when().then().otherwise() #17405

Closed
2 tasks done
barak1412 opened this issue Jul 3, 2024 · 4 comments
Closed
2 tasks done

Possible bug at when().then().otherwise() #17405

barak1412 opened this issue Jul 3, 2024 · 4 comments
Labels
bug Something isn't working python Related to Python Polars

Comments

@barak1412
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df = pl.DataFrame({
    's': [{'d': 3, 'i':[1], 'v':[0.8]}]
})

i = 0
df2 = df.with_columns(
    pl.when(pl.col('s').struct.field('i').list.contains(i)) \
      .then(
        pl.col('s').struct.field('v').list.get(
            pl.col('s').struct.field('i').list.eval(pl.arg_where(pl.element() == i)).list.get(0)
        )
      ).otherwise(0)
)
print(df2)

Log output

Traceback (most recent call last):
  File "C:/Users/Barak/PycharmProjects/flowit_benchmark/new_stem.py", line 22, in <module>
    df2 = df.with_columns(
  File "C:\Users\Barak\AppData\Local\Programs\Python\Python38\lib\site-packages\polars\dataframe\frame.py", line 8763, in with_columns
    return self.lazy().with_columns(*exprs, **named_exprs).collect(_eager=True)
  File "C:\Users\Barak\AppData\Local\Programs\Python\Python38\lib\site-packages\polars\lazyframe\frame.py", line 1942, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: get index is out of bounds

Issue description

It seems like the expression inside the then is evaluated (but not returned though), although the condition in the when expression is not met.
Is this intended? it seems like at least a performance issue, if not logical.

Expected behavior

I would expect to get:

shape: (1, 2)
┌───────────────┬─────┐
│ s             ┆ v   │
│ ---           ┆ --- │
│ struct[3]     ┆ f64 │
╞═══════════════╪═════╡
│ {3,[1],[0.8]} ┆ 0.0 │
└───────────────┴─────┘

Installed versions

--------Version info---------
Polars:               1.0.0
Index type:           UInt32
Platform:             Windows-10-10.0.19041-SP0
Python:               3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          2.2.0
connectorx:           0.3.3a1
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.5.1
nest_asyncio:         1.5.1
numpy:                1.24.4
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              13.0.0
pydantic:             1.8.2
pyiceberg:            <not installed>
sqlalchemy:           1.4.32
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

@barak1412 barak1412 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jul 3, 2024
@henryharbeck
Copy link
Contributor

The expressions are all evaluated in parallel to increase performance. This means each expression must be valid on its own.

Docs reference

@stinodego
Copy link
Member

This is intended. We do more work than needed, but we can do it in parallel.

You're seeing an error due to this change:
#16841

Add null_on_oob=True to your get calls and you won't see an error anymore.

@stinodego stinodego closed this as not planned Won't fix, can't repro, duplicate, stale Jul 3, 2024
@stinodego stinodego removed the needs triage Awaiting prioritization by a maintainer label Jul 3, 2024
@barak1412
Copy link
Contributor Author

barak1412 commented Jul 3, 2024

Thanks!

@ritchie46
Copy link
Member

If we count cpu cycles we don't do more work than needed. This allows us to vectorize the expressions. If we would execute them concurrently this would be super branchy and not very optimizable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants