Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): additional multi-column support for pl.<function> entries #13336

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Dec 30, 2023

All of the following pl.<function> entries have been extended with support for multiple column names (and remain backwards compatible with existing single-column calling syntax, so there is no breakage):

  • approx_n_unique,
  • count
  • first
  • implode
  • last
  • mean
  • median
  • n_unique

In addition, all functions in the updated module (and vertical aggregations) can now take bare pl.col("name") expressions as well as "name" strings, bringing them a little closer to more generic functions such as pl.all_horizontal.

Added new docstring examples for each, and extended test coverage accordingly.

Examples

import polars as pl

df = pl.DataFrame({
    "a": [1, 2, None],
    "b": [9, 8, 7],
    "c": ["foo", "bar", "foo"],
})
df.select( pl.count("a", "b") )

Before:

TypeError: 
  count() takes from 0 to 1 positional arguments but 2 were given

After:

shape: (1, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 2   ┆ 3   │
└─────┴─────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Dec 30, 2023
@alexander-beedie alexander-beedie changed the title feat(python): additional multi-column support for pl.<func> functions feat(python): additional multi-column support for pl.<function> entries Dec 30, 2023
Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been meaning to add this support, so having this is nice!

With regards to the implementation: we already have the parse_as_expression / parse_as_list_of_expressions utils. Those should be utilized here.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 31, 2023

With regards to the implementation: we already have the parse_as_expression / parse_as_list_of_expressions utils. Those should be utilized here.

Nope; there's a reason I did this way - it keeps the return type and usage completely seamless, while still expanding it to handle multiple columns (including bare col exprs as a bonus, now that they can be trivially identified using the new expr.meta.is_column method) 😉

Using parse_as_list_of_expressions instead would break all follow-on usage (broadcast, chaining, etc).

Before and after this PR, all the enhanced functions continue to return a pl.col(…) expression, so...

✅ can chain
✅ can alias
✅ can broadcast
✅ can drop-in to existing code

Returning a list of expressions instead results in...

❌ cannot chain
❌ cannot alias
❌ cannot broadcast
❌ cannot always drop-in to existing code

...which would make these functions dramatically less seamless/useful; if they're going to return a list of expressions, you may as well just write and use those expressions directly instead of using these convenience methods.

We would need a new first-class primitive (something like pl.exprs(…)) that can bundle a series of expressions and continue to broadcast/chain/etc to them before supporting that here. This has been asked for before1, but until it exists these functions should continue to return pl.col(…) for maximum utility/compatibility with existing usage.

Footnotes

  1. https://github.com/pola-rs/polars/issues/12262

@stinodego
Copy link
Member

Right, that makes a lot of sense. Using parse_as_list_of_expressions isn't the way to go here.

However, between the existing expression parsing utils, the column selectors, and this new parsing utility, it feels like our input parsing is becoming way too disjointed. I don't really feel comfortable adding more complexity before we address some of this technical debt.

I also really dislike allowing some expressions but not others. That goes against the idea of expressions being composable. col("a") should not be treated differently than col("a") * 2.

To continue improvements to input parsing I think addressing #10594 would be most important as well as #12262 for this specific case.

@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented Dec 31, 2023

I also really dislike allowing some expressions but not others. That goes against the idea of expressions being composable. col("a") should not be treated differently than col("a") * 2.

If we want to keep it as a pure extension of what is currently there then we can omit the "added value" of supporting pl.col objects and just support multiple names; then there is no additional input-parsing complexity and we can drop normalise_column_names while retaining the primary benefit of supporting multiple col names.

I'm completely fine with that; supporting pl.col but not generic expressions does feel like an awkward half-measure (and possibly sets a bad precedent). We can further enhance these functions with more generic support as & when we have a pl.exprs bundled-expression option 🤔

@stinodego
Copy link
Member

I'm all in favor of accepting multiple columns where possible - indeed let's just take strings for now, and we'll revisit when we have a better way to bundle expressions.

@alexander-beedie alexander-beedie force-pushed the multicolumn-lazy-funcs branch 2 times, most recently from 5952d03 to d7a675c Compare January 4, 2024 12:02
@alexander-beedie
Copy link
Collaborator Author

indeed let's just take strings for now, and we'll revisit when we have a better way to bundle expressions.

Done ;)

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we accept that people should not use keyword arguments in these shortcut functions, the whole thing can be a lot simpler, I believe. Curious what you think.

py-polars/polars/utils/various.py Outdated Show resolved Hide resolved
py-polars/polars/functions/lazy.py Outdated Show resolved Hide resolved
py-polars/polars/functions/lazy.py Show resolved Hide resolved
py-polars/polars/functions/lazy.py Outdated Show resolved Hide resolved
py-polars/polars/functions/lazy.py Outdated Show resolved Hide resolved
@alexander-beedie
Copy link
Collaborator Author

Curious what you think.

The two-arg approach matches what we have in select & co., so it's consistent with what we do elsewhere; however I agree that a single *columns param is better. If you're ok with it I'll make the change; leads directly to a cleaner implementation (might need a new deprecation decorator going arg -> *arg; will see if what we have is sufficient) 🤔

Copy link
Member

@stinodego stinodego left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great - this is going to be very useful!

@ritchie46 ritchie46 merged commit 43a01a5 into pola-rs:main Jan 10, 2024
14 checks passed
@alexander-beedie alexander-beedie deleted the multicolumn-lazy-funcs branch January 10, 2024 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants