-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make selectors available under pl.cols
#13757
Comments
This is fabulous! Besides avoiding the need for an extra import, it makes the important functionality in the selectors module more understandable for new users. One potential counter-argument is that having
and so on. I still really like the idea of changing to |
@stinodego: Thanks for bringing this back to the fore! I've gone back and forth on this one but I think, on balance, we should probably do it, but... if we do, we should do two other things at the same time:
I do not think we should reach a point (outside of a deprecation period) where we have both I've already found it tricky to fully explain why selectors and col expressions are not the same thing and should not be merged into the expressions API, despite them being related (and one using the other under the hood in the current implementation). Having versions of So I conditionally support it, assuming we move to a much cleaner separation of responsibilities between Note that we could make the Python-side |
I was mulling this over while laying in bed last night (life of a Polars maintainer 😬), and maybe it can be even simpler: why don't we make everything available under
A selector is an expression that references existing columns and does not modify them This gives it some special properties over non-selector expressions:
# Original uses of `col` - can be kept intact
pl.col("a")
pl.col("a", "b")
pl.col(pl.String)
pl.col(pl.String, pl.Datetime)
# Selector functionality
pl.col.by_name("a", "b")
pl.col.string()
pl.col.first()
...
# Attribute short syntax
pl.col.MyColumn
pl.col.start_date
...
# Combine them
pl.col("a") & pl.col.string()
pl.col("a") | pl.col("b")
pl.col.MyColumn | pl.col.string()
... The only thing that's iffy here is that the attribute short syntax competes with the selectors now. That is unfortunate, but seeing as we have clearly documented that syntax as non-idiomatic, I think it's OK. You still have the parentheses to tell them apart, e.g.: pl.col.time # Selector for "time" column name
pl.col.time() # Selector for Time data type I think this would cleanly unify things. |
I was just about to post this but then Stijn pre-empted me with a similar solution ;) If I'm understanding correctly, all the properties that make selectors unique (like set ops, being able to be passed to functions like I'd argue that a lot of the complications we're discussing in this thread stem from the fact that expressions can represent sets of columns that are actually in a dataframe, but they can also represent transformations of those columns. One option is to replace selectors with a subtype of expression that represents columns that are actually in a dataframe, which could for instance be called
The reason for having both
|
@alexander-beedie if you introduce pl.cols being the only one that can do multi cols, why would one ever use pl.col still? :) |
The conflict between set operations and boolean operations is something Stijn and I haven't discussed yet in this thread. It's important to be able to support both bitwise operations (especially for boolean columns) and set operations. Unfortunately, my original proposal of using On another note, I really like this suggestion from Stijn: pl.col.time # Selector for "time" column name
pl.col.time() # Selector for Time data type As Stijn points out, that would avoid the need for having both |
Oof... I strongly disagree with the various suggestion(s) to blend everything into a single "it does all of the things" object 🤣 One of the reasons selectors took off is precisely because they separated themselves out from Selectors are based on expressions but are not exactly the same as expressions; they represent a late-bound set of potential expressions that are not resolved until evaluation. As such they also use operators in set context (unlike non-selector expressions): Selectors:
Expressions use these operators very differently: Conflating expressions and selectors into a single object means you now have to think more carefully about every usage of an operator to know what you're going to get. And then there's just the plain meaning of the word; having an object called "col" (singular) offering up lots of different ways to get multiple columns (plural) is just... odd. There isn't anything much simpler than:
Blending everything together looks like a recipe for edge-cases and lack of clarity 😅 |
@alexander-beedie Agreed that the set vs boolean operations thing is crucial. If Similar to how we currently have It's definitely not as clean as |
Thanks @alexander-beedie, this definitely helps me understand the concept of selectors better. Again, things are not as simple as they seem at first.
Selectors literally are expressions, though. This has surprised me before when I did an So they are expressions but override certain behavior to make the composability work. This leads to unexpected behavior as sometimes they behave like expressions and sometimes they don't: import polars as pl
import polars.selectors as cs
df = pl.DataFrame({"a": [True, True, False], "b": [False, False, True]})
expr1 = pl.col("a")
expr2 = pl.col("b")
expr3 = cs.last()
# Equivalent
df.select(expr1, expr2)
df.select(expr1, expr3)
# Also equivalent
df.select(expr1.sum(), expr2.sum())
df.select(expr1.sum(), expr3.sum())
# Not equivalent ?!
df.select((expr1 | expr2).sum())
df.select((expr1 | expr3).sum()) I would argue this is problematic. We import and use the selector separately, but later on in my program, I don't really know whether I am dealing with a selector or with an expression. I see two ways to solve this:
If we were to go with 1, then they should be under I believe we must solve this before implementing things in Rust, as Rust is not as lenient with boundaries as Python is, and I think selectors currently operate in a grey area. |
Indeed; more than anything I wanted to draw attention to that :) They are expressions when it comes to the class hierarchy (a necessary part of the current implementation), but functionally speaking they are not "just" expressions. Might be tricky to segregate them completely from expressions without making everything a bit clunky and losing the clean syntax, hmm...🤔 Given that it's primarily operator usage/ambiguity that causes a problem I think we could achieve a reasonable balance of strictness vs convenience by raising an error if you try to combine a selector expression and a non-selector expression with an operator. In this case you would then have to call
We could implement this in the next breaking release; would lay some logical groundwork for any further changes, and is likely a good idea as-is. Would address the issue of inadvertently combining the two and the error could inform the user about the existence of (The |
I was thinking along the same lines. A lot of the time, an explicit import polars as pl
import polars.selectors as cs
df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
expr = pl.col("a") # Expr object
selector = cs.last() # Selector object
# 1. Binary operators between expressions and selectors throw an error
selector + selector # OK
expr + expr # OK
expr + selector # Error - ambiguous operation
expr + selector.to_expr() # OK
# 2. Selectors can be parsed as expressions when needed
df.select(expr, selector) # OK
pl.sum_horizontal(expr, selector) # OK
expr.clip(upper_bound=selector) # OK
# 3. Selectors no longer have expression methods
df.select((selector | selector).sum()) # Error - selector has no method `sum`
df.select((selector | selector).to_expr().sum()) # OK
selector.clip(upper_bound=expr) # Error - selector has no method `clip`
selector.into_expr().clip(upper_bound=expr) # OK
# 4. Selectors can be used where column names are expected
df.drop(selector) # OK
df.drop(expr) # Error Point 3 does feel like a loss of usability, but it would be the consequence of having a separate Selector type that does not subclass Expr. There is probably a way we can make this work nicely while keeping Selector a subclass of Expr - you have probably already mapped out how 😄 Regarding implementation, it would probably be easy enough to nicely deprecate the ambiguous cases that should error. We have to detect these cases anyway. Back on the original topic: seems like we agree that joining the selector functionality under
That is clear enough, just requires some documentation. |
Lol... I'll work something out ;)) |
I haven't really used the Selectors API much so it's possible I'm being silly. expr + selector.as_expr()
df.select((selector | selector).as_expr().sum()) It seems like they are something that could be given to expr + pl.col(selector)
df.select(pl.col(selector | selector).sum()) |
That would work 🤔 |
As cmdlineluser, I don't use selectors much and I may be silly. But why
|
We would be able to do this for example, which feel correct: df = pl.DataFrame(
{
"foo": ["one", "one", "two", "two", "one", "two"],
"bar": ["y", "y", "y", "x", "x", "x"],
"baz": [1, 2, 3, 4, 5, 6],
}
)
df.pivot(values=pl.col("baz"), index=pl.col("foo"), columns=pl.col("bar"), aggregate_function="sum") |
Stijn's original solution here seems ideal to me. The one limitation is that you'd have to deprecate Personally I don't like using >>> str(~pl.selectors.string().sum())
'dtype_columns([String]).sum().not()'
>>> str((~pl.selectors.string()).sum())
'SELECTOR.sum()' Here's a real-life example: >>> pl.DataFrame({'name': ['one', 'two'], 'value': [1, 2]}).select(~pl.selectors.string().sum())
Traceback (most recent call last):
...
polars.exceptions.SchemaError: invalid series dtype: expected `Boolean`, got `str`
>>> pl.DataFrame({'name': ['one', 'two'], 'value': [1, 2]}).select((~pl.selectors.string()).sum())
shape: (1, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 3 │
└───────┘ This is why I always prefer |
@gab23r The issue is i.e. df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})
df.select(pl.col('a') | pl.col('b'))
# shape: (2, 1)
# ┌─────┐
# │ a │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 3 │
# │ 6 │
# └─────┘ But as a Selector, it would be the union of the column names i.e. cs.expand_selector(df, cs.by_name('a') | cs.by_name('b'))
# ('a', 'b')
It would end up as cs.expand_selector(df, cs.by_name('a') - cs.by_name('b'))
# ('a',) |
I think gab23r's point is that the current behavior, where pl.col doesn't return a selector, is unintuitive. It's unintuitive because, as Stijn points out, anything that "references existing columns and does not modify them" should be a selector. |
Ah yes I missed that, IMHO it would be worth |
We are talking about expressions vs. selectors, but I think we actually need to deal with 3 different cases. Here's some design talk. Distiction of types1 - SelectorsA selector is a way to effectively reference a set of existing columns. Examples import polars.selectors as cs
selector = cs.by_name("a")
selector = ~cs.first()
selector = cs.string() & cs.last() Behavior Selectors can be used wherever column names can be used, as they can be converted into a list of column names: df.drop(selector)
df.select(selector)
pl.col(selector) However, a selector cannot be treated as an expression, as they override some core functionality of expressions ( # NOT possible
df.select(selector.sum())
df.select(selector & pl.col("a"))
# Possible
df.select(selector.to_expr().sum())
df.select(selector.to_expr() & pl.col("a")) 2 - Column expressionsA column expression is an expression that references existing columns and does not modify them. Examples col = pl.col("a")
col = pl.all()
col = pl.first() Behavior Column expressions can be used wherever column names or expressions can be used, as they can be converted into a list of column names, but can also function as expressions. df.drop(col)
df.select(col.sum()) Column expressions are expressions. They function like expressions in every way. However, because we have not called any operations on them, we get the additional benefit of being able to decompose them into existing column names. So we are able to pass a column expression to 3 - ExpressionsAn expression represents a set of operations to perform on a set of existing columns. Examples expr = pl.col("a", "b").sum()
expr = pl.lit(5)
expr = pl.col("a") & pl.col("b") Behavior Expressions no longer represent a clear reference to an existing column, as they may have been transformed or aliased in some way. As such, they cannot be converted into column names. We can use them anywhere expressions are allowed. # NOT possible
df.drop(expr)
# Possible
df.select(expr)
df.with_columns(expr.clip(expr)) Proposed changesWith the distinction between these types in mind, we could do the following: Selectors
Column expressions
BenefitsIf we were to implement the proposed changes, you would gain a lot in strictness, discoverability and usability.
Thoughts are welcome! |
Totally agree! A nitpick: I think Could
Do selectors have other functionality that would conflict with Like I mentioned above, the specific set operator syntax ( |
It is - I was just showcasing that you can directly call expression methods on a I think there is a place for selectors still. Apparently you do not value the invert operator but that doesn't mean it's not useful. Adding selector = ~(cs.string() | cs.time()) # with operators
selector = cs.string().union(cs.time()).invert() # without operators Having them available as a separate entity can be useful - just probably not under |
I do value the invert operator! It's already implemented via Fair point about name clashes with |
I did my best to read through the whole thread but there's quite a bit. Right now we support
|
@mcrumiller Stijn suggested having |
I'm not sure how related this is, but I think I recently got tripped up by this. As a new user the dichotomy is confusing. |
It seems like the line between a "column expression" and an "expression" is blurry. For example, it is not clear to me (as in: might be difficult to teach users) whether |
Description
Currently, to use selectors, you need to import them separately. This feels awkward to me. An example of how we currently recommend using them:
The shorthand
cs
is nice and short, but it doesn't really tell you it will return an expression matching a number of columns.I think it would make sense to have the selectors available under
pl.cols
. That way, the parallel withpl.col
is immediately apparent. This is easily achievable by doing a re-export. The resulting syntax would be cleaner and more readable:I have discussed this with @alexander-beedie before, but we never followed up on it. So I'm opening it for discussion here.
The text was updated successfully, but these errors were encountered: