-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scan_pyarrow_dataset not filtering on partitions #16300
Comments
You can check the plan with df.explain. You should see the filter being pushed down into the scan as a pyarrow compute expression. If it's correctly showing pushed down pyarrow compute expressions, then it rather points to an issue in pyarrow, where filters are not converted to partition filters |
Yes, we just pass the predicates to pyarrow. So I think this should be taken upstream. |
I don't think the issue is with pyarrow, as when running I suspect the issue is the predicates are not being passed in to The query plan looks correct to me however from the output of
|
So filtering on non-date/datetime columns works, see below: Run this code as-is import polars as pl
df = pl.DataFrame({
"foo": [1,2,3],
"bar": [1,2,3],
"baz": [1,2,3],
}, schema={"foo": pl.Int64, "bar": pl.Date, "baz": pl.Int64,})
df.write_delta('test_table_scan',
mode='overwrite',
delta_write_options={"partition_by": ["foo", "bar"], "engine":"rust"}, overwrite_schema=True)
print(
pl.scan_delta('test_table_scan').filter(pl.col('foo')==2).collect()
) However, a predicate that contains a date or datetime breaks the predicate pushdown into pyarrow, similar issue: #16248 import polars as pl
df = pl.DataFrame({
"foo": [1,2,3],
"bar": [1,2,2],
"baz": [1,2,3],
}, schema={"foo": pl.Int64, "bar": pl.Date, "baz": pl.Int64,})
df.write_delta('test_table_scan',
mode='overwrite',
delta_write_options={"partition_by": ["foo", "bar"], "engine":"rust"}, overwrite_schema=True)
print(
pl.scan_delta('test_table_scan').filter(pl.col('foo')==2, pl.col('bar')== pl.date(1970,1,3)).collect()
) |
Seems like the pushdown is not working when it includes date/datetimes @ritchie46
This issue is related: #11152 |
Thank you very much for the replies! Out of curiosity what exactly is it about dates that break the predicate pushdown? This would be a very nice feature to have as it makes |
Checks
Reproducible example
Log output
No response
Issue description
I have a large dataset on S3 consisting of a large amount of .arrow files. We are using directory partitioning by an integer id and a date, which looks like this:
We are using pyarrow to write the entirety of this dataset. On the read side polars is much preferred because of it's expressiveness. I want to use the
scan_pyarrow_dataset
function in order to read and perform filtering with predicate pushdown. However, it seems that polars is not filtering out the partitions defined in the polars query. When I run using pyarrow it takes less than a second to read in the data of a single file, but when I use polarsscan_pyarrow_dataset
, this never completes and hangs forever. I am assuming because this is not actually filtering out the partitions and it is trying to read in everything.Expected behavior
I would expect this to filter out the irrelevant partitions from the reads, and push any predicates down to the scan level just as pyarrow does, but that does not seem to be the case.
Installed versions
The text was updated successfully, but these errors were encountered: