Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Expr.forget() and assertions #16311

Closed
KDruzhkin opened this issue May 18, 2024 · 2 comments
Closed

feat(python): Expr.forget() and assertions #16311

KDruzhkin opened this issue May 18, 2024 · 2 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@KDruzhkin
Copy link
Contributor

KDruzhkin commented May 18, 2024

Description

Assertions by themselves are simple. But fitting them into data processing pipelines is subtle.

TL;DR

Conceptually, I propose to separate:

  • making assertions about data,
  • dropping data, sending them to /dev/null, if you want to be poetic.

Technically, I propose to add several methods to Expr:

  • some trivial assertion methods, which return their input unchanged,
  • a special method for dropping/forgetting/ignoring expression results.

In the future, we may find ergonomic ways to combine those, but now I think we should introduce them separately.

This feature should be unstable/experimental and python-only, until we explore the solution space.

Some context

Data pipelines

There are two scenarios with assertions:

  • create a value, check its properties, pass it on;
  • create a value, check its properties, forget it.

For concreteness, let's take a look at this code:

Checking and passing on:

s0 = pl.Series("s", ["a", "bb", "ccc", "dddd"])
s1 = s0.str.lengths()    # [1, 2, 3, 4]
s2 = s1 > 3              # [False, False, False, True]
assert s2.any()
s3 = s2.cast(pl.Int32)
s0 -- s1 -- s2 -- [check and pass on] -- s3 -- ... [the work continues]

Checking and forgetting:

s0 = pl.Series("s", ["a", "bb", "ccc", "dddd"])
s1 = s0.str.lengths()    # [1, 2, 3, 4]
s2 = s1 > 3              # [False, False, False, True]
assert s2.any()
# now forget `s1` and `s2`
# go on with `s0`
s3 = s0.to_uppercase()
    s1 -- s2  [check and forget]
   / 
s0
   \
    s3 -- ... [the work continues]

Fluent interfaces

Syntactically, data pipelines are presented as fluent interfaces: a sequence of expressions/methods connected with the dot operator.

We could re-imagine one of the examples above as:

(
    pl.Series("s", ["a", "bb", "ccc", "dddd"])
    .str.lengths()
    .gt(3)
    .assert_any()  # [placeholder name]
    .cast(pl.Int32)
)

(But what do we do with the second example?)

(Of course, it really gets interesting when we go from Series to DataFrames and expressons.)

The proposal

Expr.forget()

This method works for expressions of any type.
The result is also an expression which evaluates to nothing.

For example, pl.col(pl.Boolean) evaluates to nothing if the context has no boolean columns.
The result of .forget() evaluates to nothing unconditionally.

Some trivial assertion methods

They return ("pass on") the original value.

Two methods for columns of data type Boolean (based on .any() and .all()):

  • assert_any,
  • assert_all.

One method for non-columnar values of data type Boolean:

  • assert.

An example

When we have this:

df = pl.DataFrame(
    {
        "s": ["a", "bb", "ccc", "dddd"]
    }
)

we can write this:

(
    df.with_columns(   # [abuse of notation, as no columns are added]
        pl.col("s")
        .str.lengths()
        .gt(3)
        .assert_any()  # [check and pass on]
        .forget()      # [and forget]
    )
)
@KDruzhkin KDruzhkin added the enhancement New feature or an improvement of an existing feature label May 18, 2024
@KDruzhkin KDruzhkin changed the title feat(python): Expr.forget() feat(python): Expr.forget() and assertions May 18, 2024
@sm-Fifteen
Copy link

In #11064 (which you cross-linked to this issue, I wouldn't have noticed otherwise, so thank you), everyone seemed to be going for some sort of raise_when/expect/assert_every dataframe/series method which gets passed a list of expressions, possibly with assertion error messages, but is fully transparent to the data and can be removed without changing the result. That seemed to be like it would take adventage of the laizily evaluated nature of Polars expressions and fit both the lazy and eager flows well, while leaving the door open for the optimizer to just take them out or change the evaluation method with a simple flag.

Here, though, you're suggesting making assertions into expressions, and I'm not so sure about that. Wouldn't that mean you have to include "phantom expressions" in any subsequent select() or with_columns() statement for the data validation to happen and make sure to not forget it yourself? The name .forget() conveys the intended meaning well, but I'm not sure about the ergonomics, especially if you're including multiple data checks, possibly at various steps of your pipeline.

@KDruzhkin
Copy link
Contributor Author

Today I learned about pipes: #14620 (comment).

They definitely solve my problem with assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants