Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): extend filter capabilities with new support for *args predicates, **kwargs constraints, and chained boolean masks #11740

Merged
merged 9 commits into from Oct 16, 2023

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Oct 14, 2023

Closes #11642, supersedes #11674.

Ended up going a little deeper down the rabbit hole so moved it into my own branch to finish off as I was having issues rebasing / force-pushing to a branch I don't own (some sort of credentials/token issue? 🀷); kudos to @mcrumiller for getting things started with the initial implementation!

Miscellaneous

  • Turns out we can cleanly disambiguate new/old usage of the predicate keyword, so only warns on old-style usage.
  • Noticed a fast-path for recognising numpy-based boolean masks (and int sequences), so integrated this too.

New Features

Adds support for:

  • *predicates (one or more predicates as positional arguments).
  • **constraints (col == value matches as keyword arguments).
  • resolving chains of boolean masks (previously only allowed one).

Examples

More than one predicate, now we have a positional args option.
Implicitly combines predicates with all_horizontal:

df.filter( 
    pl.col("a") >= 0, 
    pl.col("b").is_not_null(),
    pl.col("c").str.starts_with("!"),
 )

Splat predicates assembled elsewhere straight into filter without having to combine them yourself:

df.filter( *conditions )

Constrain rows using **kwargs format (essentially the same API that we offer in with_columns):

df.filter( 
    x = 0,
    y = "xxx",
    z = date.today(),
)

Mix and match the two formats freely:

df.filter( *conditions, value=0, start_date=date.today() )

Easily filter against a Pydantic/FastAPI Model object:

from pydantic import BaseModel
from datetime import date
import polars as pl

class DataModel(BaseModel):
    x: date
    y: str
    z: int
    
model_data = [
    DataModel( x=date.today(), y="aa", z=123 ),
    DataModel( x=date.today(), y="bb", z=456 ),
    DataModel( x=date.today(), y="cc", z=789 ),
]

df = pl.DataFrame( data=model_data )
# shape: (3, 3)
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
# β”‚ x          ┆ y   ┆ z   β”‚
# β”‚ ---        ┆ --- ┆ --- β”‚
# β”‚ date       ┆ str ┆ i64 β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════β•ͺ═════║
# β”‚ 2023-10-14 ┆ aa  ┆ 123 β”‚
# β”‚ 2023-10-14 ┆ bb  ┆ 456 β”‚
# β”‚ 2023-10-14 ┆ cc  ┆ 789 β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

df.filter( **model_data[1].model_dump() )
# shape: (1, 3)
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
# β”‚ x          ┆ y   ┆ z   β”‚
# β”‚ ---        ┆ --- ┆ --- β”‚
# β”‚ date       ┆ str ┆ i64 β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════β•ͺ═════║
# β”‚ 2023-10-14 ┆ bb  ┆ 456 β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

Chain boolean masks (previously only allowed one per filter invocation):

mask1 = [True, False, True]
mask2 = [False, True, True]

df.filter( mask1, mask2 )
# shape: (1, 3)
# β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
# β”‚ x          ┆ y   ┆ z   β”‚
# β”‚ ---        ┆ --- ┆ --- β”‚
# β”‚ date       ┆ str ┆ i64 β”‚
# β•žβ•β•β•β•β•β•β•β•β•β•β•β•β•ͺ═════β•ͺ═════║
# β”‚ 2023-10-14 ┆ cc  ┆ 789 β”‚
# β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜

…s` predicates, `**kwargs` constraints, and chained boolean masks
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Oct 14, 2023
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feature will be a nice quality-of-life upgrade to an often-used method!

I wonder if we can't just use parse_as_list_of_expressions here? There seems to be a lot of duplicated parsing logic here.

py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/dataframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/lazyframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/lazyframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/lazyframe/frame.py Show resolved Hide resolved
py-polars/polars/utils/_parse_expr_input.py Outdated Show resolved Hide resolved
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Oct 15, 2023

I wonder if we can't just use parse_as_list_of_expressions here?

We can't hand off to that completely, as the boolean masks need to be identified separately (update: it can be used to unify parsing of a single-list passed to *predicates without splatting though, so I smoothed that outπŸ‘Œ).

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff, Alex!

@alexander-beedie alexander-beedie merged commit 3372808 into pola-rs:main Oct 16, 2023
12 checks passed
@alexander-beedie alexander-beedie deleted the filter-args-kwargs branch October 16, 2023 04:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add .locate method as an affordance for filtering dataframes
2 participants