New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: explicit filters parameter in pd.read_parquet #53212
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead since we still support both engines, I think linking to the docs for both fastparquet and pyarrow in **kwargs
is sufficient #52238 (comment)
Given that the |
@mroeschke OK with adding this? |
Okay yeah that makes sense to add this then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for filters
for both engines?
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
@mrastgoo would you have time to add a test? We have an existing test for the columns keyword (
I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the |
@jorisvandenbossche , yes I would do it asap, sorry for the delay in this issue and thanks for the info I will take that into account. |
I just pushed for the test.
the error which was raised was not very clear for missing columns, which is not a pandas issue, but thought to mention it.
|
Could you merge in main once more? |
pandas/io/parquet.py
Outdated
@@ -483,6 +489,7 @@ def read_parquet( | |||
path: FilePath | ReadBuffer[bytes], | |||
engine: str = "auto", | |||
columns: list[str] | None = None, | |||
filters: list[tuple] | list[list[tuple]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this as the last argument? (before **kwargs
)?
pandas/io/parquet.py
Outdated
Using this argument will NOT result in row-wise filtering of the final | ||
partitions unless ``engine="pyarrow"`` is also specified. For | ||
other engines, filtering is only performed at the partition level, that is, | ||
to prevent the loading of some row-groups and/or files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a .. versionadded:: 2.1.0
at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs a whatsnew entry in 2.1.0.rst
in the IO section
Thanks @mrastgoo |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.added filters as a new parameter in pd.read_parquet