Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PanicException: index out of bounds: the len is 0 but the index is 0 when filter doesn't select any rows #12570

Closed
2 tasks done
ddutt opened this issue Nov 19, 2023 · 5 comments · Fixed by #12575
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@ddutt
Copy link

ddutt commented Nov 19, 2023

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

not a fully reproducible example

pl.scan_parquet('/data/parquet/**/*.parquet').filter(pl.col.colA == "foo").select(['colA', 'colB', 'colC'])

crashes when there are no rows with colA == "foo". Without the filter everything works

Log output

In [2]: pl.scan_parquet('/tmp/demo1/inventory/**/*.parquet').filter(pl.col.hostname=="foo").select(['namespace', 'hostname', 'timestamp']).collect()
thread '<unnamed>' panicked at crates/polars-lazy/src/physical_plan/executors/scan/parquet.rs:305:37:
index out of bounds: the len is 0 but the index is 0
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[2], line 1
----> 1 pl.scan_parquet('/tmp/demo1/inventory/**/*.parquet').filter(pl.col.hostname=="foo").select(['namespace', 'hostname', 'timestamp']).collect()

File ~/work/stardust/enterprise/.venv/lib/python3.11/site-packages/polars/utils/deprecation.py:100, in deprecate_renamed_parameter.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
     95 @wraps(function)
     96 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
     97     _rename_keyword_argument(
     98         old_name, new_name, kwargs, function.__name__, version
     99     )
--> 100     return function(*args, **kwargs)

File ~/work/stardust/enterprise/.venv/lib/python3.11/site-packages/polars/lazyframe/frame.py:1788, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, _eager)
   1775     comm_subplan_elim = False
   1777 ldf = self._ldf.optimization_toggle(
   1778     type_coercion,
   1779     predicate_pushdown,
   (...)
   1786     _eager,
   1787 )
-> 1788 return wrap_df(ldf.collect())

PanicException: index out of bounds: the len is 0 but the index is 0

Issue description

In a deeply nested parquet folder, if I do a pl.scan_parquet on the top level directory, followed by a filter that doesn't select any rows, collect() crashes with the exception reported in the log. if the filter selects any rows or the filter is not applied, everything works.

Expected behavior

An empty dataframe being returned, not a panic exception.

Installed versions

>>> pl.show_versions()
--------Version info---------
Polars:              0.19.14
Index type:          UInt32
Platform:            Linux-6.2.0-36-generic-x86_64-with-glibc2.35
Python:              3.11.6 (main, Oct 13 2023, 14:12:02) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_sqlite:  <not installed>
cloudpickle:         <not installed>
connectorx:          <not installed>
deltalake:           <not installed>
fsspec:              <not installed>
gevent:              <not installed>
matplotlib:          <not installed>
numpy:               <not installed>
openpyxl:            <not installed>
pandas:              <not installed>
pyarrow:             <not installed>
pydantic:            <not installed>
pyiceberg:           <not installed>
pyxlsb:              <not installed>
sqlalchemy:          <not installed>
xlsx2csv:            <not installed>
xlsxwriter:          <not installed>
@ddutt ddutt added bug Something isn't working python Related to Python Polars labels Nov 19, 2023
@ritchie46
Copy link
Member

Thanks, but have you got a repro? E.g. some code that creates the files that lead to the panic.

@ddutt
Copy link
Author

ddutt commented Nov 19, 2023

I have a public repository pointer that you can use. https://github.com/netenglabs/suzieq/tree/develop/tests/data/parquet
You can use pl.scan_parquet('tests/data/parquet/inventory/**/*.parquet').filter(pl.col.namespace=="foo").select(['namespace', 'hostname', 'timestamp']).collect()

@cmdlineluser
Copy link
Contributor

It does appear to run with predicate_pushdown disabled if it helps debugging.

(pl.scan_parquet('tests/data/parquet/inventory/**/*.parquet')
   .filter(pl.col.namespace == 'foo')
   .select('namespace', 'hostname', 'timestamp')
   .collect(predicate_pushdown=False)
)

# shape: (0, 3)
# ┌───────────┬──────────┬───────────┐
# │ namespace ┆ hostname ┆ timestamp │
# │ ---       ┆ ---      ┆ ---       │
# │ str       ┆ str      ┆ i64       │
# ╞═══════════╪══════════╪═══════════╡
# └───────────┴──────────┴───────────┘

@ddutt
Copy link
Author

ddutt commented Nov 19, 2023

Thanks, I knew that it would work with predicate pushdown turned off, but the resulting performance inefficiency is not acceptable.

@ritchie46
Copy link
Member

Actually got nothing to do with parquet. Silly bug, patch coming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants