Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make selectors available under pl.cols #13757

Open
stinodego opened this issue Jan 15, 2024 · 28 comments
Open

Make selectors available under pl.cols #13757

stinodego opened this issue Jan 15, 2024 · 28 comments
Labels
A-api Area: changes to the public API A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars

Comments

@stinodego
Copy link
Member

stinodego commented Jan 15, 2024

Description

Currently, to use selectors, you need to import them separately. This feels awkward to me. An example of how we currently recommend using them:

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

result = df.select(pl.col("a"), cs.string())

The shorthand cs is nice and short, but it doesn't really tell you it will return an expression matching a number of columns.

I think it would make sense to have the selectors available under pl.cols. That way, the parallel with pl.col is immediately apparent. This is easily achievable by doing a re-export. The resulting syntax would be cleaner and more readable:

import polars as pl

df = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

result = df.select(pl.col("a"), pl.cols.string())

I have discussed this with @alexander-beedie before, but we never followed up on it. So I'm opening it for discussion here.

@stinodego stinodego added enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars labels Jan 15, 2024
@Wainberg
Copy link
Contributor

Wainberg commented Jan 15, 2024

This is fabulous! Besides avoiding the need for an extra import, it makes the important functionality in the selectors module more understandable for new users.

One potential counter-argument is that having pl.col and pl.cols as two parallel APIs for selection could potentially cause confusion. For instance, you'd have:

  • pl.col(pl.String) (expression) vs pl.cols.string() (selector)
  • pl.col('^foo$') (expression) vs pl.cols.matches('foo') (selector)
  • pl.col.a (expression) vs pl.cols.by_name('a') (selector)

and so on. I still really like the idea of changing to pl.cols, but maybe there would be a way to merge some of the functionality in these two soon-to-be-similarly-named APIs, as a future direction.

@stinodego stinodego added the A-api Area: changes to the public API label Jan 16, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 16, 2024

@stinodego: Thanks for bringing this back to the fore! I've gone back and forth on this one but I think, on balance, we should probably do it, but... if we do, we should do two other things at the same time:

  • Have a first-class selector object (that is not col) that lives in Rust .
  • Deprecate multi-column selection from col 🤔

I do not think we should reach a point (outside of a deprecation period) where we have both cols and a version of col that can also select multiple cols, as that is a really confusing place to be. The current state provides a cleaner (though not perfect) separation of responsibilities, with only the the slight annoyance of requiring a separate import.

I've already found it tricky to fully explain why selectors and col expressions are not the same thing and should not be merged into the expressions API, despite them being related (and one using the other under the hood in the current implementation). Having versions of col and cols that have a giant overlap will only make things worse.

So I conditionally support it, assuming we move to a much cleaner separation of responsibilities between col and the proposed cols. If we get to that state I think it will be a clear win 👌


Note that we could make the Python-side cols object callable, taking as input some of what we currently allow for col to support a really straightforward migration path, eg: cols("a","b","c") would be a convenience shortcut for cols.by_name("a","b","c"), and cols(dtype) could redirect through to cols.by_dtype(dtype).

@stinodego
Copy link
Member Author

I was mulling this over while laying in bed last night (life of a Polars maintainer 😬), and maybe it can be even simpler: why don't we make everything available under pl.col?

col is really also just a selector if you think about it. I would use the following definition:

A selector is an expression that references existing columns and does not modify them

This gives it some special properties over non-selector expressions:

  • It can be cleanly deconstructed into a list of column names.
  • It can be combined with other selectors using & and |.

col does exactly that. In fact, it is the OG of selectors as it existed before selectors were a thing and selectors dispatch to it. So why don't we allow:

# Original uses of `col` - can be kept intact
pl.col("a")
pl.col("a", "b")
pl.col(pl.String)
pl.col(pl.String, pl.Datetime)

# Selector functionality
pl.col.by_name("a", "b")
pl.col.string()
pl.col.first()
...

# Attribute short syntax
pl.col.MyColumn
pl.col.start_date
...

# Combine them
pl.col("a") & pl.col.string()
pl.col("a") | pl.col("b")
pl.col.MyColumn | pl.col.string()
...

The only thing that's iffy here is that the attribute short syntax competes with the selectors now. That is unfortunate, but seeing as we have clearly documented that syntax as non-idiomatic, I think it's OK. You still have the parentheses to tell them apart, e.g.:

pl.col.time  # Selector for "time" column name
pl.col.time()  # Selector for Time data type 

I think this would cleanly unify things.

@Wainberg
Copy link
Contributor

I was just about to post this but then Stijn pre-empted me with a similar solution ;)

If I'm understanding correctly, all the properties that make selectors unique (like set ops, being able to be passed to functions like drop, being able to use expand_selector on them, etc.) are a consequence of the fact that selectors represent sets of columns that are actually in a dataframe, rather than transformations of those columns.

I'd argue that a lot of the complications we're discussing in this thread stem from the fact that expressions can represent sets of columns that are actually in a dataframe, but they can also represent transformations of those columns.

One option is to replace selectors with a subtype of expression that represents columns that are actually in a dataframe, which could for instance be called Columns:

pl.col.a -> Columns
pl.col('a') -> Columns
pl.cols('a', 'b') -> Columns
pl.cols.string() -> Columns
pl.cols.matches('foo') -> Columns

Columns would have the same special behaviors that selectors currently have: they would support set ops, could be passed to functions like drop, and would support expand_selector (which could be renamed to e.g. expand_columns). Any transformation on a Columns (like pl.col.a.sum() or pl.cols.string().str.reverse()) would produce a regular Expr, which wouldn't support any of these special behaviors.

The reason for having both col and cols is to be able to distinguish pl.col.string from pl.cols.string(). If you're alright with the selector functions not being in their own namespace, you could also do:

pl.col.a -> Columns
pl.col('a') -> Columns
pl.col('a', 'b') -> Columns
pl.string() -> Columns
pl.matches('foo') -> Columns

@ion-elgreco
Copy link
Contributor

@alexander-beedie if you introduce pl.cols being the only one that can do multi cols, why would one ever use pl.col still? :)

@Wainberg
Copy link
Contributor

The conflict between set operations and boolean operations is something Stijn and I haven't discussed yet in this thread. It's important to be able to support both bitwise operations (especially for boolean columns) and set operations. Unfortunately, my original proposal of using and, or and not for set operations wouldn't work because those operators aren't overloadable in Python.

On another note, I really like this suggestion from Stijn:

pl.col.time  # Selector for "time" column name
pl.col.time()  # Selector for Time data type 

As Stijn points out, that would avoid the need for having both pl.col and pl.cols. It could be implemented by implementing __call__ on whatever class pl.col is going to be called.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 16, 2024

Oof... I strongly disagree with the various suggestion(s) to blend everything into a single "it does all of the things" object 🤣

One of the reasons selectors took off is precisely because they separated themselves out from col and became their own thing. This is conceptually clean and provides a clearer separation of responsibilities; it is also easier to consistently reason about behaviour.

Selectors are based on expressions but are not exactly the same as expressions; they represent a late-bound set of potential expressions that are not resolved until evaluation. As such they also use operators in set context (unlike non-selector expressions):

Selectors:

  • ~cs.by_name("a","b") => the set of cols that are not "a" or "b"
  • cs.ends_with("usd") - cs.string() => set of cols ending 'usd' excluding strings
  • cs.temporal() | cs.float() => the set of cols that are temporal or float

Expressions use these operators very differently:
~ is a boolean inverse, - is arithmetic, | is logical or bitwise, etc.

Conflating expressions and selectors into a single object means you now have to think more carefully about every usage of an operator to know what you're going to get.

And then there's just the plain meaning of the word; having an object called "col" (singular) offering up lots of different ways to get multiple columns (plural) is just... odd.

There isn't anything much simpler than:

  • col: expression that gives me one column
  • cols: selector that gives me one or more columns :)

Blending everything together looks like a recipe for edge-cases and lack of clarity 😅

@Wainberg
Copy link
Contributor

Wainberg commented Jan 16, 2024

@alexander-beedie Agreed that the set vs boolean operations thing is crucial. If a and b are boolean columns, then does pl.col.a | pl.col.b select two columns, or take the bitwise or of two columns?

Similar to how we currently have exclude(), what if we also had union() and intersection()? Then you could do pl.col.a.union(pl.col.string()) to select all string columns alongside the a column, and pl.col.a | pl.col.b for bitwise or.

It's definitely not as clean as pl.col.a | pl.col.string(), but I think this option would address your concerns about edge cases and clarity. It's also worth keeping in mind that having &, |, ^ and ~ mean different things for selectors and expressions is also confusing, and this option would avoid that confusion.

@stinodego
Copy link
Member Author

stinodego commented Jan 16, 2024

Thanks @alexander-beedie, this definitely helps me understand the concept of selectors better. Again, things are not as simple as they seem at first.

Selectors are based on expressions but are not exactly the same as expressions

Selectors literally are expressions, though. This has surprised me before when I did an isinstance(input, Expr) and it returned True when a selector was passed.

So they are expressions but override certain behavior to make the composability work. This leads to unexpected behavior as sometimes they behave like expressions and sometimes they don't:

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({"a": [True, True, False], "b": [False, False, True]})

expr1 = pl.col("a")
expr2 = pl.col("b")
expr3 = cs.last()

# Equivalent
df.select(expr1, expr2)
df.select(expr1, expr3)

# Also equivalent
df.select(expr1.sum(), expr2.sum())
df.select(expr1.sum(), expr3.sum())

# Not equivalent ?!
df.select((expr1 | expr2).sum())
df.select((expr1 | expr3).sum())

I would argue this is problematic. We import and use the selector separately, but later on in my program, I don't really know whether I am dealing with a selector or with an expression.

I see two ways to solve this:

  1. Selectors do not subclass expressions - they are really their own thing and not constrained by the Expr API. To use them as expressions you would first call .to_expr() or similar on it. Then the selector is turned in to an expression and functions like one.
  2. Selectors do subclass expressions, but they do not override existing expression behavior. They add specialized methods for combining with other selectors/expressions, such as cs_union or cs_invert. Once you call a non-specialized method on them, they are no longer selectors.

If we were to go with 1, then they should be under cols or something different, definitely not under col. I think this is probably the best way to go.
If we were to go with 2, then grouping them under col would be fine as they really are expressions, but they have some additional functionality.

I believe we must solve this before implementing things in Rust, as Rust is not as lenient with boundaries as Python is, and I think selectors currently operate in a grey area.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Jan 16, 2024

I think selectors currently operate in a grey area.

Indeed; more than anything I wanted to draw attention to that :) They are expressions when it comes to the class hierarchy (a necessary part of the current implementation), but functionally speaking they are not "just" expressions.

Might be tricky to segregate them completely from expressions without making everything a bit clunky and losing the clean syntax, hmm...🤔

Given that it's primarily operator usage/ambiguity that causes a problem I think we could achieve a reasonable balance of strictness vs convenience by raising an error if you try to combine a selector expression and a non-selector expression with an operator. In this case you would then have to call as_expr() if that's the behaviour you want:

  • cs.first() - pl.col("xyz") → Error (attempt to combine exr/selector)
  • cs.numeric() - cs.last() → Selector
  • pl.col("abc") - pl.col("xyz") → Expr
  • (cs.last().as_expr() - pl.col("xyz")) → Expr

We could implement this in the next breaking release; would lay some logical groundwork for any further changes, and is likely a good idea as-is. Would address the issue of inadvertently combining the two and the error could inform the user about the existence of as_expr().

(The cs import may be slightly annoying, but it does result in a natural segregation between cs.<something> and pl.<something> in typical use - unless, as indicated, you alias selectors to some other variable and lose that linkage).

@stinodego
Copy link
Member Author

Given that it's primarily operator usage/ambiguity that causes a problem I think we could achieve a reasonable balance of strictness vs convenience by raising an error if you try to combine a selector expression and a non-selector expression with an operator. In this case you would then have to call as_expr() if that's the behaviour you want:

I was thinking along the same lines. A lot of the time, an explicit to_expr will not be necessary, as we can infer the intended usage. So the usability does not have to suffer too much. When there is ambiguity, we throw an error.

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

expr = pl.col("a")  # Expr object
selector = cs.last()  # Selector object

# 1. Binary operators between expressions and selectors throw an error
selector + selector  # OK
expr + expr  # OK
expr + selector  # Error - ambiguous operation
expr + selector.to_expr()  # OK

# 2. Selectors can be parsed as expressions when needed
df.select(expr, selector)  # OK
pl.sum_horizontal(expr, selector)  # OK
expr.clip(upper_bound=selector)  # OK

# 3. Selectors no longer have expression methods
df.select((selector | selector).sum())  # Error - selector has no method `sum`
df.select((selector | selector).to_expr().sum())  # OK

selector.clip(upper_bound=expr)  # Error - selector has no method `clip`
selector.into_expr().clip(upper_bound=expr)  # OK

# 4. Selectors can be used where column names are expected
df.drop(selector)  # OK
df.drop(expr)  # Error

Point 3 does feel like a loss of usability, but it would be the consequence of having a separate Selector type that does not subclass Expr. There is probably a way we can make this work nicely while keeping Selector a subclass of Expr - you have probably already mapped out how 😄

Regarding implementation, it would probably be easy enough to nicely deprecate the ambiguous cases that should error. We have to detect these cases anyway.

Back on the original topic: seems like we agree that joining the selector functionality under col was a bad idea. I still think pl.cols is nice enough. We would have to document properly that it returns a Selector object (not an Expr). I don't see the need to deprecate pl.col("a", "b") though as this is still useful if you want an actual Expr and not a Selector.

  • pl.col returns Expr
  • pl.cols returns Selector

That is clear enough, just requires some documentation.

@alexander-beedie
Copy link
Collaborator

There is probably a way we can make this work nicely while keeping Selector a subclass of Expr - you have probably already mapped out how 😄

Lol... I'll work something out ;))

@cmdlineluser
Copy link
Contributor

I haven't really used the Selectors API much so it's possible I'm being silly.

expr + selector.as_expr()  
df.select((selector | selector).as_expr().sum())  

It seems like they are something that could be given to pl.col instead of replacing it?

expr + pl.col(selector)
df.select(pl.col(selector | selector).sum())

@stinodego
Copy link
Member Author

That would work 🤔

@gab23r
Copy link
Contributor

gab23r commented Jan 16, 2024

As cmdlineluser, I don't use selectors much and I may be silly. But why pl.col doesn't return a Selector ?

pl.col('a') | pl.col('b') & pl.col.string() would return a Selector
pl.col('a').sum() would return a Expr

@gab23r
Copy link
Contributor

gab23r commented Jan 16, 2024

We would be able to do this for example, which feel correct:

df = pl.DataFrame(
    {
        "foo": ["one", "one", "two", "two", "one", "two"],
        "bar": ["y", "y", "y", "x", "x", "x"],
        "baz": [1, 2, 3, 4, 5, 6],
    }
)
df.pivot(values=pl.col("baz"), index=pl.col("foo"), columns=pl.col("bar"), aggregate_function="sum")

@Wainberg
Copy link
Contributor

Stijn's original solution here seems ideal to me. The one limitation is that you'd have to deprecate ~, ^ and & for set operations for selectors, and define union and intersection to work alongside exclude.

Personally I don't like using ~, ^ and & for set operations on selectors anyway, because you always have to remember to put brackets around the selector before operating on it:

>>> str(~pl.selectors.string().sum())
'dtype_columns([String]).sum().not()'
>>> str((~pl.selectors.string()).sum())
'SELECTOR.sum()'

Here's a real-life example:

>>> pl.DataFrame({'name': ['one', 'two'], 'value': [1, 2]}).select(~pl.selectors.string().sum())
Traceback (most recent call last):
...
polars.exceptions.SchemaError: invalid series dtype: expected `Boolean`, got `str`
>>> pl.DataFrame({'name': ['one', 'two'], 'value': [1, 2]}).select((~pl.selectors.string()).sum())
shape: (1, 1)
┌───────┐
│ value │
│ ---   │
│ i64   │
╞═══════╡
│ 3     │
└───────┘

This is why I always prefer exclude() over ~ in my own data analysis code. For the same reason, I would prefer to use union and intersection over | and &, if they existed.

@cmdlineluser
Copy link
Contributor

pl.col('a') | pl.col('b') & pl.col.string() would return a Selector

@gab23r The issue is col already performs the operation on the values contained inside the columns.

i.e. bitwise_or(pl.col('a'), pl.col('b'))

df = pl.DataFrame({'a': [1, 2], 'b': [3, 4]})

df.select(pl.col('a') | pl.col('b')) 
# shape: (2, 1)
# ┌─────┐
# │ a   │
# │ --- │
# │ i64 │
# ╞═════╡
# │ 3   │
# │ 6   │
# └─────┘

But as a Selector, it would be the union of the column names i.e. pl.col('a', 'b')

cs.expand_selector(df, cs.by_name('a') | cs.by_name('b'))
# ('a', 'b')

pl.col('a') - pl.col('b') is perhaps a more obvious example.

It would end up as pl.col('a') if treated as a Selector:

cs.expand_selector(df, cs.by_name('a') - cs.by_name('b'))
# ('a',)

@Wainberg
Copy link
Contributor

I think gab23r's point is that the current behavior, where pl.col doesn't return a selector, is unintuitive. It's unintuitive because, as Stijn points out, anything that "references existing columns and does not modify them" should be a selector.

@gab23r
Copy link
Contributor

gab23r commented Jan 18, 2024

The one limitation is that you'd have to deprecate ~, ^ and & for set operations for selectors, and define union and intersection to work alongside exclude.

Ah yes I missed that, IMHO it would be worth

@stinodego
Copy link
Member Author

stinodego commented Jan 22, 2024

We are talking about expressions vs. selectors, but I think we actually need to deal with 3 different cases. Here's some design talk.

Distiction of types

1 - Selectors

A selector is a way to effectively reference a set of existing columns.

Examples

import polars.selectors as cs

selector = cs.by_name("a")
selector = ~cs.first()
selector = cs.string() & cs.last()

Behavior

Selectors can be used wherever column names can be used, as they can be converted into a list of column names:

df.drop(selector)
df.select(selector)
pl.col(selector)

However, a selector cannot be treated as an expression, as they override some core functionality of expressions (~ / & / | operators, among others). Explicit conversion to an expression is required. Possibly through calling selector.to_expr() (exact API to be determined).
(Currently selectors subclass Expr, this should be fixed, see comment above)

# NOT possible
df.select(selector.sum())
df.select(selector & pl.col("a"))

# Possible
df.select(selector.to_expr().sum())
df.select(selector.to_expr() & pl.col("a"))

2 - Column expressions

A column expression is an expression that references existing columns and does not modify them.

Examples

col = pl.col("a")
col = pl.all()
col = pl.first()

Behavior

Column expressions can be used wherever column names or expressions can be used, as they can be converted into a list of column names, but can also function as expressions.

df.drop(col)
df.select(col.sum())

Column expressions are expressions. They function like expressions in every way. However, because we have not called any operations on them, we get the additional benefit of being able to decompose them into existing column names. So we are able to pass a column expression to df.drop, for example.
(Currently this is not yet possible, as the concept of column expressions doesn't really exist yet. Though this behavior feels natural and seems like it should work.)

3 - Expressions

An expression represents a set of operations to perform on a set of existing columns.

Examples

expr = pl.col("a", "b").sum()
expr = pl.lit(5)
expr = pl.col("a") & pl.col("b")

Behavior

Expressions no longer represent a clear reference to an existing column, as they may have been transformed or aliased in some way. As such, they cannot be converted into column names. We can use them anywhere expressions are allowed.

# NOT possible
df.drop(expr)

# Possible
df.select(expr)
df.with_columns(expr.clip(expr))

Proposed changes

With the distinction between these types in mind, we could do the following:

Selectors

  • Update selectors to no longer subclass Expr, add a to_expr() method for conversion into an expression.
  • Update parsing logic to accept a Selector wherever column names are accepted.
  • Make selectors more easily available by including it under the pl import, e.g. pl.selector.first() or pl.sel.first() or pl.cols.first() (the original point of this issue).

Column expressions

  • Make pl.col return a specialized Expr, e.g. ColExpr. This class subclasses Expr and has a method for converting it into a list of column names, e.g. to_column_names().
  • Update methods like drop to allow column expressions as inputs (in addition to selectors and strings).
  • Update functions like pl.all(), pl.first() etc. to also return a ColExpr.
  • Make those functions available under col, e.g. pl.col.first(), pl.col.string(), etc. and deprecate the top level ones pl.first() etc.

Benefits

If we were to implement the proposed changes, you would gain a lot in strictness, discoverability and usability.

  • Selectors can no longer be confused with expressions and lead to surprising results. Explicit conversion is required.
  • pl.col as the central hub for creating column expressions, with some of the nice selector functionality such as cs.string() very accessible under pl.col.string()
  • Power users that require advanced column selection can still do so under a more accessible name pl.selector (or just do from polars import selector as cs if you prefered that shorthand).
  • The various parts of the API integrate more nicely, e.g. being able to do df.drop(pl.col("a")) makes sense.

Thoughts are welcome!

@Wainberg
Copy link
Contributor

If we were to implement the proposed changes, you would gain a lot in strictness, discoverability and usability.

Totally agree!

A nitpick: I think col.sum() should be an Expr not a ColExpr, e.g. it doesn't make sense to df.drop(col.sum()).

Could ColExpr have union and intersection methods to go alongside the existing Expr.exclude method (which would presumably become ColExpr.exclude)? It seems like there's no particular reason not to support those operations for ColExpr.

However, a selector cannot be treated as an expression, as they override some core functionality of expressions (~ / & / | operators, among others).

Do selectors have other functionality that would conflict with ColExprs, or is it just the set operations? Because if it's just the set operations, then implementing ColExpr.union and ColExpr.intersection would avoid the conflict, allowing you to get rid of selectors entirely.

Like I mentioned above, the specific set operator syntax (~ / & / | etc.) is a footgun anyways because bitwise operators have such low precedence in Python. For instance, in your proposed formulation, you would never be able to use ~ in combination with to_expr() without enclosing it in brackets. ~pl.selectors.string().to_expr().sum() is a bug because the ~ isn't applied until the very end; you would have to do (~pl.selectors.string()).to_expr().sum().

@stinodego
Copy link
Member Author

stinodego commented Jan 23, 2024

A nitpick: I think col.sum() should be an Expr not a ColExpr, e.g. it doesn't make sense to df.drop(col.sum()).

It is - I was just showcasing that you can directly call expression methods on a ColExpr.

I think there is a place for selectors still. Apparently you do not value the invert operator but that doesn't mean it's not useful. Adding union/intersection/invert methods is verbose and it potentially clashes with Expr methods if we wanted to add those in the future. So they could become sel_intersection / ... but then it becomes even more verbose.

selector = ~(cs.string() | cs.time())  # with operators
selector = cs.string().union(cs.time()).invert()  # without operators

Having them available as a separate entity can be useful - just probably not under pl.cols as to avoid confusion with pl.col.

@Wainberg
Copy link
Contributor

I do value the invert operator! It's already implemented via pl.exclude and I use it fairly often. I prefer it to ~ because it reads cleaner without the brackets. I also prefer Expr.exclude over the - operator for clarity.

Fair point about name clashes with union/intersection. I think distinguishing between union_elements/intersect_elements and union_columns/intersect_columns would be clearest, by analogy with map_elements/map_batches (the latter of which I'd still argue should be called map_columns, even though Ritchie points out that sometimes you only get part of a column). It's verbose, but not much more so than with_columns which is used way more often.

@mcrumiller
Copy link
Contributor

mcrumiller commented Jan 29, 2024

I did my best to read through the whole thread but there's quite a bit.

Right now we support pl.col.string to mean pl.col("string"). If we take the pandas approach and say "don't use dot notation to reference columns because it creates all sorts of issues," couldn't we just put all of the selectors operations in the pl.col namespace and be done with it? This is way simpler and provides no ambiguity as far as I can tell:

  • Single namespace
  • pl.selectors.x() becaomes pl.col.x()
  • Virtually no code changes required (simply rename selectors namespace to col and move the current column function as a constructor).

@Wainberg
Copy link
Contributor

@mcrumiller Stijn suggested having pl.col.string refer to a column named "string" and pl.col.string() select all string columns, which is admittedly a little weird on the surface but avoids issues. I think the dot notation was pretty overwhelmingly popular, so I'd expect that'll stay regardless.

@paddymul
Copy link
Contributor

paddymul commented Feb 6, 2024

I'm not sure how related this is, but I think I recently got tripped up by this.
I can call lit(5) on an Expr but not a Selector. I'm still figuring out a workaround.

As a new user the dichotomy is confusing.

@legendre6891
Copy link

It seems like the line between a "column expression" and an "expression" is blurry. For example, it is not clear to me (as in: might be difficult to teach users) whether pl.col("a").alias("b") is a column expression or not...?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

9 participants