Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: explicit filters parameter in pd.read_parquet #53212

Merged
merged 9 commits into from Aug 2, 2023

Conversation

mrastgoo
Copy link
Contributor

@mrastgoo mrastgoo commented May 13, 2023

added filters as a new parameter in pd.read_parquet

@jorisvandenbossche jorisvandenbossche changed the title ENH:filters parameters in pd.read_parqeut ENH: explicit filters parameter in pd.read_parqeut May 13, 2023
@mrastgoo mrastgoo changed the title ENH: explicit filters parameter in pd.read_parqeut ENH: explicit filters parameter in pd.read_parquet May 14, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead since we still support both engines, I think linking to the docs for both fastparquet and pyarrow in **kwargs is sufficient #52238 (comment)

@jorisvandenbossche
Copy link
Member

Given that the filters keyword is supported by both engines and works the same, I think it would certainly be useful to explicitly document this keyword, and not hide it in the description of the kwargs. We actually already do the same for the columns keyword.

@jorisvandenbossche
Copy link
Member

@mroeschke OK with adding this?

@mroeschke
Copy link
Member

Okay yeah that makes sense to add this then

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for filters for both engines?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jun 30, 2023
@jorisvandenbossche
Copy link
Member

@mrastgoo would you have time to add a test?

We have an existing test for the columns keyword (test_read_columns), so could add an equivalent test_read_filters test case next to it. Something like:

    def test_read_filters(self, engine, tmp_path):
        df = pd.DataFrame({"int": list(range(4)), "part": list("aabb"),})

        expected = pd.DataFrame({"int": [0, 1]})
        check_round_trip(
            df,
            engine,
            path=tmp_path,
            expected=expected,
            write_kwargs={"partition_cols": ["part"]},
            read_kwargs={"filters": [("string", "==", "a")], "columns":["int"]},
            repeat=1,
        )

I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the repeat=1 is needed to not add additional files to the directory when writing a second time (with partition_cols, it does not just overwrite the single file).

@mrastgoo
Copy link
Contributor Author

mrastgoo commented Jul 4, 2023

@jorisvandenbossche , yes I would do it asap, sorry for the delay in this issue and thanks for the info I will take that into account.

@mrastgoo
Copy link
Contributor Author

mrastgoo commented Jul 5, 2023

I just pushed for the test.
I changed the following line for the test

read_kwargs={"filters": [("part", "==", "a")], "columns":["int"]},

the error which was raised was not very clear for missing columns, which is not a pandas issue, but thought to mention it.

pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(string) in int: int64
part: dictionary<values=string, indices=int32, ordered=0>

@mroeschke
Copy link
Member

Could you merge in main once more?

@@ -483,6 +489,7 @@ def read_parquet(
path: FilePath | ReadBuffer[bytes],
engine: str = "auto",
columns: list[str] | None = None,
filters: list[tuple] | list[list[tuple]] | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this as the last argument? (before **kwargs)?

Using this argument will NOT result in row-wise filtering of the final
partitions unless ``engine="pyarrow"`` is also specified. For
other engines, filtering is only performed at the partition level, that is,
to prevent the loading of some row-groups and/or files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a .. versionadded:: 2.1.0 at the end?

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs a whatsnew entry in 2.1.0.rst in the IO section

@mroeschke mroeschke added this to the 2.1 milestone Aug 2, 2023
@mroeschke mroeschke merged commit 7cbf949 into pandas-dev:main Aug 2, 2023
34 of 37 checks passed
@mroeschke
Copy link
Member

Thanks @mrastgoo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs IO Parquet parquet, feather
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: Document the filters argument in read_parquet
3 participants