New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit pushdown (scan_delta) results in error if dataframe has struct data types #7627
Comments
Adding if n_rows:
pa_tbl = ds.head(n_rows, columns=with_columns, filter=_filter)
return cast(DataFrame, pl.from_arrow(pa_tbl).head(n_rows)) |
That's strange. So pyarrow doesn't respect to the head call? |
The shape of the table looks good. If I want 5 rows the pyarrow table shows 5 rows. With another dataset I can produce another error message:
Adding |
That's strange? Does the arrow table has a different number of chunks per column? The extra |
Yes, very strange.
By adding The old approach with the arrow batches also only worked because there was the |
Could you explore the struct columns in pyarrow. Do the struct columns have different chunk lengths? |
No they are the same: print(pa_tbl.schema)
A: int64
fruits: string
B: int64
cars: string
AB: struct<A: int64, B: int64>
child 0, A: int64
child 1, B: int64 # using limit 4 in the example
for c in pa_tbl.itercolumns():
print(c)
[
[
1,
2,
3,
4
]
]
[
[
"banana",
"banana",
"apple",
"apple"
]
]
[
[
5,
4,
3,
2
]
]
[
[
"beetle",
"audi",
"beetle",
"beetle"
]
]
[
-- is_valid: all not null
-- child 0 type: int64
[
1,
2,
3,
4
]
-- child 1 type: int64
[
5,
4,
3,
2
]
] The pyarrow table looks totally fine for me. |
According to the MRE, the following caused an error:
But on the current main branch it gives the following:
so i think this can be closed @stinodego |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
The limit pushdown feature (#7570) must have introduced a bug when using struct data types.
The example results in the following error message:
thread '<unnamed>' panicked at 'The column lengths in the DataFrame are not equal.',
Reproducible example
Expected behavior
Same result for both approaches
Installed versions
The text was updated successfully, but these errors were encountered: