ENH: explicit filters parameter in pd.read_parquet #53212

mrastgoo · 2023-05-13T14:14:09Z

closes DOC: Document the filters argument in read_parquet #52238 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

added filters as a new parameter in pd.read_parquet

mroeschke

Instead since we still support both engines, I think linking to the docs for both fastparquet and pyarrow in **kwargs is sufficient #52238 (comment)

jorisvandenbossche · 2023-05-16T06:39:41Z

Given that the filters keyword is supported by both engines and works the same, I think it would certainly be useful to explicitly document this keyword, and not hide it in the description of the kwargs. We actually already do the same for the columns keyword.

jorisvandenbossche · 2023-05-29T14:29:08Z

@mroeschke OK with adding this?

mroeschke · 2023-05-30T17:05:17Z

Okay yeah that makes sense to add this then

mroeschke

Could you add a test for filters for both engines?

github-actions · 2023-06-30T00:06:01Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

jorisvandenbossche · 2023-07-04T10:43:18Z

@mrastgoo would you have time to add a test?

We have an existing test for the columns keyword (test_read_columns), so could add an equivalent test_read_filters test case next to it. Something like:

    def test_read_filters(self, engine, tmp_path):
        df = pd.DataFrame({"int": list(range(4)), "part": list("aabb"),})

        expected = pd.DataFrame({"int": [0, 1]})
        check_round_trip(
            df,
            engine,
            path=tmp_path,
            expected=expected,
            write_kwargs={"partition_cols": ["part"]},
            read_kwargs={"filters": [("string", "==", "a")], "columns":["int"]},
            repeat=1,
        )

I used partition cols to actually see the effect of the filter also for fastparquet (to ensure it is correctly passed through). And with that, the repeat=1 is needed to not add additional files to the directory when writing a second time (with partition_cols, it does not just overwrite the single file).

mrastgoo · 2023-07-04T11:09:33Z

@jorisvandenbossche , yes I would do it asap, sorry for the delay in this issue and thanks for the info I will take that into account.

mrastgoo · 2023-07-05T19:59:27Z

I just pushed for the test.
I changed the following line for the test

read_kwargs={"filters": [("part", "==", "a")], "columns":["int"]},

the error which was raised was not very clear for missing columns, which is not a pandas issue, but thought to mention it.

pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(string) in int: int64
part: dictionary<values=string, indices=int32, ordered=0>

mroeschke · 2023-08-01T17:29:09Z

Could you merge in main once more?

mroeschke · 2023-08-01T20:09:47Z

pandas/io/parquet.py

@@ -483,6 +489,7 @@ def read_parquet(
    path: FilePath | ReadBuffer[bytes],
    engine: str = "auto",
    columns: list[str] | None = None,
+    filters: list[tuple] | list[list[tuple]] | None = None,


Could you put this as the last argument? (before **kwargs)?

mroeschke · 2023-08-01T20:10:18Z

pandas/io/parquet.py

+        Using this argument will NOT result in row-wise filtering of the final
+        partitions unless ``engine="pyarrow"`` is also specified.  For
+        other engines, filtering is only performed at the partition level, that is,
+        to prevent the loading of some row-groups and/or files.


Could you add a .. versionadded:: 2.1.0 at the end?

mroeschke

Also needs a whatsnew entry in 2.1.0.rst in the IO section

mroeschke · 2023-08-02T15:39:38Z

Thanks @mrastgoo

filters parameters in pd.read_parqeut

db9a456

jorisvandenbossche changed the title ~~ENH:filters parameters in pd.read_parqeut~~ ENH: explicit filters parameter in pd.read_parqeut May 13, 2023

jorisvandenbossche added Docs IO Parquet parquet, feather labels May 13, 2023

linter

4624bee

mrastgoo changed the title ~~ENH: explicit filters parameter in pd.read_parqeut~~ ENH: explicit filters parameter in pd.read_parquet May 14, 2023

docstring validation

9b4439d

mroeschke requested changes May 15, 2023

View reviewed changes

mroeschke reviewed May 30, 2023

View reviewed changes

github-actions bot added the Stale label Jun 30, 2023

jorisvandenbossche removed the Stale label Jul 4, 2023

mrastgoo added 2 commits July 5, 2023 20:08

Merge branch 'main' into issue_52238

a5301b5

test for filter args in pd.read_parquet

4e94179

black

36cbfe2

Merge branch 'main' into issue_52238

cd35718

mroeschke reviewed Aug 1, 2023

View reviewed changes

mrastgoo added 2 commits August 2, 2023 09:53

addressing reviews

d324e1b

Merge branch 'main' into issue_52238

05d4677

mroeschke added this to the 2.1 milestone Aug 2, 2023

mroeschke approved these changes Aug 2, 2023

View reviewed changes

mroeschke merged commit 7cbf949 into pandas-dev:main Aug 2, 2023
34 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: explicit filters parameter in pd.read_parquet #53212

ENH: explicit filters parameter in pd.read_parquet #53212

mrastgoo commented May 13, 2023 •

edited by mroeschke

mroeschke left a comment

jorisvandenbossche commented May 16, 2023

jorisvandenbossche commented May 29, 2023

mroeschke commented May 30, 2023

mroeschke left a comment

github-actions bot commented Jun 30, 2023

jorisvandenbossche commented Jul 4, 2023

mrastgoo commented Jul 4, 2023

mrastgoo commented Jul 5, 2023

mroeschke commented Aug 1, 2023

mroeschke Aug 1, 2023

mroeschke Aug 1, 2023

mroeschke left a comment

mroeschke commented Aug 2, 2023

ENH: explicit filters parameter in pd.read_parquet #53212

ENH: explicit filters parameter in pd.read_parquet #53212

Conversation

mrastgoo commented May 13, 2023 • edited by mroeschke

mroeschke left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 16, 2023

jorisvandenbossche commented May 29, 2023

mroeschke commented May 30, 2023

mroeschke left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 30, 2023

jorisvandenbossche commented Jul 4, 2023

mrastgoo commented Jul 4, 2023

mrastgoo commented Jul 5, 2023

mroeschke commented Aug 1, 2023

mroeschke Aug 1, 2023

Choose a reason for hiding this comment

mroeschke Aug 1, 2023

Choose a reason for hiding this comment

mroeschke left a comment

Choose a reason for hiding this comment

mroeschke commented Aug 2, 2023

mrastgoo commented May 13, 2023 •

edited by mroeschke