-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
ENH: explicit filters parameter in pd.read_parquet #53212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead since we still support both engines, I think linking to the docs for both fastparquet and pyarrow in **kwargs is sufficient #52238 (comment)
|
Given that the |
|
@mroeschke OK with adding this? |
|
Okay yeah that makes sense to add this then |
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for filters for both engines?
|
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
|
@mrastgoo would you have time to add a test? We have an existing test for the columns keyword ( I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the |
|
@jorisvandenbossche , yes I would do it asap, sorry for the delay in this issue and thanks for the info I will take that into account. |
|
I just pushed for the test. the error which was raised was not very clear for missing columns, which is not a pandas issue, but thought to mention it. |
|
Could you merge in main once more? |
pandas/io/parquet.py
Outdated
| path: FilePath | ReadBuffer[bytes], | ||
| engine: str = "auto", | ||
| columns: list[str] | None = None, | ||
| filters: list[tuple] | list[list[tuple]] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this as the last argument? (before **kwargs)?
pandas/io/parquet.py
Outdated
| Using this argument will NOT result in row-wise filtering of the final | ||
| partitions unless ``engine="pyarrow"`` is also specified. For | ||
| other engines, filtering is only performed at the partition level, that is, | ||
| to prevent the loading of some row-groups and/or files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a .. versionadded:: 2.1.0 at the end?
mroeschke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs a whatsnew entry in 2.1.0.rst in the IO section
|
Thanks @mrastgoo |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.added filters as a new parameter in pd.read_parquet