Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627

dominikpeter · 2023-03-18T12:33:28Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

The limit pushdown feature (#7570) must have introduced a bug when using struct data types.

The example results in the following error message:

thread '<unnamed>' panicked at 'The column lengths in the DataFrame are not equal.',

Reproducible example

import polars as pl
from deltalake import write_deltalake

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)
df = df.with_columns(pl.struct([df["A"], df["B"]]).alias("AB"))

write_deltalake("data/delta/test/fruits", df.to_pandas())

df = pl.scan_delta("data/delta/test/fruits")

# no error
print(df.collect().limit(3))

# error
print(df.limit(3).collect())

Expected behavior

Same result for both approaches

Installed versions

---Version info---
Polars: 0.16.14
Index type: UInt32
Platform: macOS-13.2.1-arm64-arm-64bit
Python: 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)]
---Optional dependencies---
numpy: 1.23.4
pandas: 1.5.3
pyarrow: 11.0.0
connectorx: <not installed>
deltalake: <version not detected>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: <not installed>
xlsxwriter: 3.0.9

The text was updated successfully, but these errors were encountered:

dominikpeter · 2023-03-19T08:46:31Z

Adding .head() to the return DataFrame solves the problem:

    if n_rows:
        pa_tbl = ds.head(n_rows, columns=with_columns, filter=_filter)
        return cast(DataFrame, pl.from_arrow(pa_tbl).head(n_rows))

ritchie46 · 2023-03-19T09:24:27Z

That's strange. So pyarrow doesn't respect to the head call?

dominikpeter · 2023-03-19T10:08:19Z

The shape of the table looks good. If I want 5 rows the pyarrow table shows 5 rows.

With another dataset I can produce another error message:

pyo3_runtime.PanicException: called Result::unwrap()on anErr value: InvalidArgumentError("Chunk require all its arrays to have an equal number of rows")

Adding .head()to the DataFramesolve the problem for me -without any downside I am aware of at the moment.

ritchie46 · 2023-03-21T07:19:16Z

That's strange? Does the arrow table has a different number of chunks per column?

The extra head() call doesn't change the final size, or does it?

dominikpeter · 2023-03-21T07:49:16Z

Yes, very strange.
They have the same size. In my example, I just concat two columns in a struct:

df = df.with_columns(pl.struct([df["A"], df["B"]]).alias("AB"))

By adding head() -> return cast(DataFrame, pl.from_arrow(pa_tbl).head(n_rows)) it returns the expected result, otherwise, I got the error.

The old approach with the arrow batches also only worked because there was the head() call on the DataFrame, otherwise it would result in the same error.

ritchie46 · 2023-03-21T13:14:32Z

Could you explore the struct columns in pyarrow. Do the struct columns have different chunk lengths?

dominikpeter · 2023-03-21T20:01:21Z

No they are the same:

print(pa_tbl.schema)

A: int64
fruits: string
B: int64
cars: string
AB: struct<A: int64, B: int64>
  child 0, A: int64
  child 1, B: int64

# using limit 4 in the example
for c in pa_tbl.itercolumns():
            print(c)

[
  [
    1,
    2,
    3,
    4
  ]
]
[
  [
    "banana",
    "banana",
    "apple",
    "apple"
  ]
]
[
  [
    5,
    4,
    3,
    2
  ]
]
[
  [
    "beetle",
    "audi",
    "beetle",
    "beetle"
  ]
]
[
  -- is_valid: all not null
  -- child 0 type: int64
    [
      1,
      2,
      3,
      4
    ]
  -- child 1 type: int64
    [
      5,
      4,
      3,
      2
    ]
]

The pyarrow table looks totally fine for me.

romanovacca · 2023-10-14T15:01:16Z

According to the MRE, the following caused an error:

>>> print(df.limit(3).collect())

But on the current main branch it gives the following:

shape: (3, 5)
┌─────┬────────┬─────┬────────┬───────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ AB        │
│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---       │
│ i64 ┆ str    ┆ i64 ┆ str    ┆ struct[2] │
╞═════╪════════╪═════╪════════╪═══════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ {1,5}     │
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ {2,4}     │
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ {3,3}     │
└─────┴────────┴─────┴────────┴───────────┘

so i think this can be closed @stinodego

dominikpeter added bug Something isn't working python Related to Python Polars labels Mar 18, 2023

dominikpeter mentioned this issue May 27, 2023

Get rid of polars extensions bmsuisse/lakeapi#47

Closed

stinodego closed this as completed Oct 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627

Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627

dominikpeter commented Mar 18, 2023 •

edited

dominikpeter commented Mar 19, 2023

ritchie46 commented Mar 19, 2023

dominikpeter commented Mar 19, 2023

ritchie46 commented Mar 21, 2023

dominikpeter commented Mar 21, 2023 •

edited

ritchie46 commented Mar 21, 2023

dominikpeter commented Mar 21, 2023 •

edited

romanovacca commented Oct 14, 2023

Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627

Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627

Comments

dominikpeter commented Mar 18, 2023 • edited

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

dominikpeter commented Mar 19, 2023

ritchie46 commented Mar 19, 2023

dominikpeter commented Mar 19, 2023

ritchie46 commented Mar 21, 2023

dominikpeter commented Mar 21, 2023 • edited

ritchie46 commented Mar 21, 2023

dominikpeter commented Mar 21, 2023 • edited

romanovacca commented Oct 14, 2023

dominikpeter commented Mar 18, 2023 •

edited

dominikpeter commented Mar 21, 2023 •

edited

dominikpeter commented Mar 21, 2023 •

edited