Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Following a selector with .exclude() is not considered a selector #16448

Closed
2 tasks done
machow opened this issue May 23, 2024 · 7 comments · Fixed by #16479
Closed
2 tasks done

Following a selector with .exclude() is not considered a selector #16448

machow opened this issue May 23, 2024 · 7 comments · Fixed by #16479
Labels
A-selectors Area: column selectors python Related to Python Polars

Comments

@machow
Copy link
Contributor

machow commented May 23, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Following a selector with .exclude() breaks cs.expand_selector():

import polars.selectors as cs
import polars as pl

df = pl.DataFrame({"a": [], "ab": []})

cs.expand_selector(df, cs.starts_with("a").exclude("ab"))
TypeError: expected a selector; found <Expr ['col("^(a).*$").exclude([Name("…'] at 0x1379DD250> instead.

However, this succeeds:

cs.expand_selector(df, cs.exclude("ab"))

Log output

~/.virtualenvs/gt-python/lib/python3.9/site-packages/polars/selectors.py:121, in expand_selector(target, selector)
    119 if not is_selector(selector):
    120     msg = f"expected a selector; found {selector!r} instead."
--> 121     raise TypeError(msg)
    123 if isinstance(target, Mapping):
    124     from polars.dataframe import DataFrame

Issue description

.exclude() currently returns a selector only if it's called off of cs.exclude(), but not if it follows a selector. I wonder if this is related to their also being a polars.exclude() method?

Expected behavior

It seems like .exclude() should return a selector, whether called from cs.exclude() (which currently returns a selector) or after a selector.

Installed versions

--------Version info---------
Polars:               0.20.28
Index type:           UInt32
Platform:             macOS-14.4.1-arm64-arm-64bit
Python:               3.9.5 (default, Jul 13 2022, 16:30:47) 
[Clang 13.0.0 (clang-1300.0.29.30)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.5.8
numpy:                1.26.0
openpyxl:             3.1.2
pandas:               2.2.1
pyarrow:              14.0.0
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.22
torch:                <not installed>
xlsx2csv:             0.8.2
xlsxwriter:           <not installed>
@machow machow added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels May 23, 2024
@alexander-beedie alexander-beedie removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels May 23, 2024
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 23, 2024

This is actually the expected behaviour as the expression-method .exclude(..) is not itself a selector, and does not return a selector. Calling a method on a selector will always result in an expression, and that expression will be broadcast over the selected columns.

To instead perform set operations with selectors (intersection, union, difference, complement) you can use the &, |, -, and ~ operators, as illustrated in the docs here (the operands must themselves both be selectors):

In your example what you actually need is the following:

cs.expand_selector(df, cs.starts_with("a") - cs.by_name("ab"))
# ('a',)

This keeps things consistent - you don't really want expression methods to change their return type based on what they were called on, but if you want to do set operations on selectors, you can absolutely do that 👍

@alexander-beedie alexander-beedie added the A-selectors Area: column selectors label May 23, 2024
@machow
Copy link
Contributor Author

machow commented May 23, 2024

you don't really want expression methods to change their return type based on what they were called on

Can you say a bit more about this? To my understanding, this is fairly common (e.g. made easy to hint through the Self type).

from typing import Self

class Expr:
    def exclude(self, ...) -> Self:
        return self.__class__(...)

edit: a good example is the suggested selection, where the return type of __sub__ depends on what it's called on, and the other object:

  • selector: cs.starts_with("a") - cs.by_name("ab")
  • expression: cs.starts_with("a") - 1, cs.starts_with("a") - pl.col("b")

The bigger picture for Great Tables

I think the challenge right now in Great Tables is that we want to do these things:

  • Accept some set of polars operations that perform simple selection
    • Let's say simple selection is something that specifies columns of data to return, without specifying any transformations/operations on them (either their data or names).
  • Return the names of the columns that would be selected.

I think we are running into three issues:

  • cognitive: there are operations I can do as a user that perform simple selection, but that will be rejected by Great Tables. (e.g. cs.starts_with("a").exclude("ab")).
  • cognitive: if someone uses the DataFrame.select() interface, the suggested method, cs.starts_with("a") - cs.by_name("ab") is one way to specify the selection, but might not be what they gravitate towards.
  • computer: Great Tables is using expand_selector(), which is maybe the wrong way to model this situation?

We just need to know they are not transforming data, and the column names they'd select. So if .exclude() pushes us out of being a selector, but there's a better way to still recognize it as a simple selection, we can def go that way!

Thanks for coming on this funky selection adventure

I realize a lot of this is a funkier use of the polars API. Thanks for being so quick to help with this stuff :)

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 23, 2024

Can you say a bit more about this? To my understanding, this is fairly common

In this case it's about API consistency more than anything technical - it's cleanest if expression methods on selectors consistently broadcast as expressions (a selector is an expression of course, but it's also a little special so... arguable).

I'm very amenable to finding ways to make this work for your use-case though; exclude is in a slight grey area as it doesn't really broadcast, it modifies the preceding multiple-output col expression (it existed before selectors, which is why it probably looks like it should work, but doesn't).

  • Accept some set of polars operations that perform simple selection
  • Let's say simple selection is something that specifies columns of data to return, without specifying any transformations/operations on them (either their data or names).

So... there are definitely no renaming operations happening at this point?
eg: no pl.col("xyz").alias("abc") or cs.starts_with("a").name.prefix("foo_")?

I realize a lot of this is a funkier use of the polars API. Thanks for being so quick to help with this stuff :)

😎👍

@machow
Copy link
Contributor Author

machow commented May 24, 2024

Simple selector rules of thumb

it's cleanest if expression methods on selectors consistently broadcast as expressions

Are the rules of thumb for selectors something like this?

  1. every top-level function in cs that can be used for selection returns a selector (e.g. cs.starts_with())
  2. in order to keep things as selectors, you need to use infix operators (e.g. cs.starts_with(...) - cs.by_name(...))
  3. infix operators are only guaranteed to return selectors if all operands are selectors
  • e.g. selector: cs.starts_with(...) - cs.by_name(...)
  • e.g. expression: cs.starts_with("...") - "a"
  1. method calls off selectors return expressions

The rule of thumb (1) explains why cs.exclude(...) is a selector, even though pl.exclude(...) isn't. Rule of thumbs (3) and (4) explain why cs.starts_with(...).exclude(...) isn't a selector.

(I'm not too worried about the model, so much as humans being able to guess what will work and what won't; hedging against the risk that they don't know how to rewrite the things they're used to in DataFrame.select() in less common results-in-a-selector-type code.)

Simple selection vs renaming

So... there are definitely no renaming operations happening at this point?

Yeah, exactly:

  1. no renaming operations in simple selector cases (and we'd be okay raising an error for renames)
  2. for the one renaming situation in Great Tables, we can just use pl.DataFrame.rename() as the contract ✨

Example

Simple selection is really helpful for cases like the coffee data table .

import polars as pl
import polars.selectors as cs
from great_tables import GT, loc, style

coffee_sales = pl.read_json("data/coffee-sales.json")

sel_rev = cs.starts_with("revenue")
sel_prof = cs.starts_with("profit")


coffee_table = (
    GT(coffee_sales)
    .tab_header("Sales of Coffee Equipment")

    # Case 1: simple selection ----
    # create spanner columns "Revenue" and "Profit" (super common activity)

    .tab_spanner(label="Revenue", columns=sel_rev)
    .tab_spanner(label="Profit", columns=sel_prof)

    # Case 2: renaming ----
    # renaming basically only happens here
    .cols_label(
        revenue_dollars="Amount",
        profit_dollars="Amount",
        revenue_pct="Percent",
        profit_pct="Percent",
        icon="",
        product="Product",
    )
)
image

@alexander-beedie
Copy link
Collaborator

You have nailed the rules of thumb - enough so that I am genuinely tempted to crib that breakdown for a small section in the selectors documentation to make it more widely available 🤣 Anyway, with the information above in mind, I think I have a solution, and have made a PR. Take a look and see if you think it'll cover everything?

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented May 26, 2024

Available in the new release: 0.20.30 ✌️

@machow
Copy link
Contributor Author

machow commented May 29, 2024

Thanks! This is super helpful, and the addition of meta.is_selection_column() really nails our use-case! (def feel free to crib the rules of thumb :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors python Related to Python Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants