Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627

Closed
2 tasks done
dominikpeter opened this issue Mar 18, 2023 · 8 comments
Closed
2 tasks done
Labels
bug Something isn't working python Related to Python Polars

Comments

@dominikpeter
Copy link

dominikpeter commented Mar 18, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

The limit pushdown feature (#7570) must have introduced a bug when using struct data types.

The example results in the following error message:

thread '<unnamed>' panicked at 'The column lengths in the DataFrame are not equal.',

Reproducible example

import polars as pl
from deltalake import write_deltalake

df = pl.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)
df = df.with_columns(pl.struct([df["A"], df["B"]]).alias("AB"))

write_deltalake("data/delta/test/fruits", df.to_pandas())

df = pl.scan_delta("data/delta/test/fruits")

# no error
print(df.collect().limit(3))

# error
print(df.limit(3).collect())

Expected behavior

Same result for both approaches

Installed versions

---Version info---
Polars: 0.16.14
Index type: UInt32
Platform: macOS-13.2.1-arm64-arm-64bit
Python: 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)]
---Optional dependencies---
numpy: 1.23.4
pandas: 1.5.3
pyarrow: 11.0.0
connectorx: <not installed>
deltalake: <version not detected>
fsspec: <not installed>
matplotlib: <not installed>
xlsx2csv: <not installed>
xlsxwriter: 3.0.9
@dominikpeter dominikpeter added bug Something isn't working python Related to Python Polars labels Mar 18, 2023
@dominikpeter
Copy link
Author

Adding .head() to the return DataFrame solves the problem:

    if n_rows:
        pa_tbl = ds.head(n_rows, columns=with_columns, filter=_filter)
        return cast(DataFrame, pl.from_arrow(pa_tbl).head(n_rows))

@ritchie46
Copy link
Member

That's strange. So pyarrow doesn't respect to the head call?

@dominikpeter
Copy link
Author

The shape of the table looks good. If I want 5 rows the pyarrow table shows 5 rows.

With another dataset I can produce another error message:

pyo3_runtime.PanicException: called Result::unwrap()on anErr value: InvalidArgumentError("Chunk require all its arrays to have an equal number of rows")

Adding .head()to the DataFramesolve the problem for me -without any downside I am aware of at the moment.

@ritchie46
Copy link
Member

That's strange? Does the arrow table has a different number of chunks per column?

The extra head() call doesn't change the final size, or does it?

@dominikpeter
Copy link
Author

dominikpeter commented Mar 21, 2023

Yes, very strange.
They have the same size. In my example, I just concat two columns in a struct:

df = df.with_columns(pl.struct([df["A"], df["B"]]).alias("AB"))

By adding head() -> return cast(DataFrame, pl.from_arrow(pa_tbl).head(n_rows)) it returns the expected result, otherwise, I got the error.

The old approach with the arrow batches also only worked because there was the head() call on the DataFrame, otherwise it would result in the same error.

@ritchie46
Copy link
Member

Could you explore the struct columns in pyarrow. Do the struct columns have different chunk lengths?

@dominikpeter
Copy link
Author

dominikpeter commented Mar 21, 2023

No they are the same:

print(pa_tbl.schema)

A: int64
fruits: string
B: int64
cars: string
AB: struct<A: int64, B: int64>
  child 0, A: int64
  child 1, B: int64
# using limit 4 in the example
for c in pa_tbl.itercolumns():
            print(c)

[
  [
    1,
    2,
    3,
    4
  ]
]
[
  [
    "banana",
    "banana",
    "apple",
    "apple"
  ]
]
[
  [
    5,
    4,
    3,
    2
  ]
]
[
  [
    "beetle",
    "audi",
    "beetle",
    "beetle"
  ]
]
[
  -- is_valid: all not null
  -- child 0 type: int64
    [
      1,
      2,
      3,
      4
    ]
  -- child 1 type: int64
    [
      5,
      4,
      3,
      2
    ]
]

The pyarrow table looks totally fine for me.

@romanovacca
Copy link
Contributor

According to the MRE, the following caused an error:

>>> print(df.limit(3).collect())

But on the current main branch it gives the following:

shape: (3, 5)
┌─────┬────────┬─────┬────────┬───────────┐
│ A   ┆ fruits ┆ B   ┆ cars   ┆ AB        │
│ --- ┆ ---    ┆ --- ┆ ---    ┆ ---       │
│ i64 ┆ str    ┆ i64 ┆ str    ┆ struct[2] │
╞═════╪════════╪═════╪════════╪═══════════╡
│ 1   ┆ banana ┆ 5   ┆ beetle ┆ {1,5}     │
│ 2   ┆ banana ┆ 4   ┆ audi   ┆ {2,4}     │
│ 3   ┆ apple  ┆ 3   ┆ beetle ┆ {3,3}     │
└─────┴────────┴─────┴────────┴───────────┘

so i think this can be closed @stinodego

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

4 participants