Feature Proposal: Introducing `pl.testing.raise_` in Polars #11064

tmrusec · 2023-09-12T06:49:29Z

Description

Objective

Introduce a new expression, pl.testing.raise, that can be used within Polars operations to raise errors based on specified conditions.

Optimization Strategy

During the optimization phase, any pl.testing.raise condition would be prioritized.
If the condition for pl.testing.raise is met, Polars would halt the operation before it goes into the actual evaluation, ensuring computational resources are not wasted on a doomed process.

Proposed Syntax

pl.testing.raise_(ErrorType, "Error Message")

Example

Imagine you're calculating a derived column based on two other columns. If one of the columns has a zero and might cause a division error, you'd want to catch it.

col_name = "B"
df.with_column(
    pl.when(pl.col(col_name) == 0)
    .then(pl.testing.raise_(ZeroDivisionError, f"Column '{col_name}' has a zero."))
    .otherwise(pl.col("A") / pl.col(col_name))
)

I think this matches well with Polars' vision of being the lower-end library that other libraries can build on top of. Let me know what do you think :)

The text was updated successfully, but these errors were encountered:

orlp · 2023-09-12T07:56:49Z

I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.

tmrusec · 2023-09-12T08:33:37Z

I'm actually not familiar with the optimization priority.

I was wondering maybe it should be one of the methods of LazyFrame or DataFrame instead of expression. And what if we still prioritize the filter above the raise method?

So it can be:

df = pl.LazyFrame({
  "A": [1,2,3,4,5],
  "B": [1,2,0,4,5],
  "C": [1,2,0,4,5],
})

col_name = "B"

(
    df
    .raise_when(
        pl.col(col_name) == 0,
        ZeroDivisionError,
        f"Column '{col_name}' has a zero.",
    )
    .with_columns(
      pl.col("A") / pl.col(col_name)
    )
    .filter(pl.col("C") != 0)
    .collect()
)

But I don't know if it's going to work or improve the overall testing usecase?

JulianCologne · 2023-09-13T08:37:08Z

Nice! I recently had something very similar in mind ;)

I was thinking about an assert or expect method on DataFrames and Series as follows:

df.with_columns(
    # ... some calculations ...
).expect(
    pl.col('a') > 0,
    pl.col('b') < 100,
).with_columns(
    # ... some calculations ...
).expect(
    pl.col('c').null_count() == 0,
).with_columns(
    # ... some calculations ...
)

Instead of

df1 = df.with_columns(
    # ... some calculations ...
)

assert (df1.get_column('a') <= 0).sum() != 0
assert (df1.get_column('b') <= 0).sum() != 0

df2 = df1.with_columns(
    # ... some calculations ...
)

assert df2.get_column('c').null_count() != 0

df3 = df2.with_columns(
    # ... some calculations ...
)

I was also thinking about a config option to specify how to handle assert/expect failures like
pl.Config.set_assert_mode('fail'|'warn'|'ignore')

fail: print/log problems and crash
warn: print/log problems but continue execution
ignore: completely ignore asserts (full optimization possible)

tmrusec · 2023-09-13T11:02:12Z

@JulianCologne
Yes, and the purpose of this method is to utilize the query optimization. The position of the raise_when or expect method in the sequence of operations would be crucial. If placed early in a chain of operations, it could prevent unnecessary computations on data that would eventually trigger an exception. On the other hand, if placed at the end, it would act as a final check after all transformations. The optimizer would need to respect the position of raise_when to ensure that exceptions are raised at the expected times.

Immediate Evaluation vs Lazy Evaluation

If raise_when were to be executed immediately, it would break the laziness of the evaluation, as it would require an immediate check on the data. To coexist with lazy evaluation, raise_when would need to be integrated into the logical plan. It would represent an operation that, when the plan is executed, checks the data and raises an exception if the condition is met.

Optimization Considerations

The optimizer would need to be aware that any subsequent operations might not be executed if the raise_when condition is met. However, some optimizations might still be possible. For instance, if raise_when is checking a condition on a column that is later filtered out, the check can be moved after the filter operation in the optimized plan.

Interaction with Other Operations

Say there's a sort operation after raise_when, and the raise_when condition is met, the sort operation should never be executed. This might mean that raise_when acts as a "barrier" to certain optimizations, limiting the reordering of operations around it.

Performance Implications

Checking conditions on data can be computationally expensive, especially on large datasets. The optimizer would need to consider the cost of these checks when optimizing the query plan.

Consider the following example

I just want to simply check if the columns I'm considering are in a One-Hot Encoding format for this from_dummies function I made. If the col value is neither 1 or 0, it should raise a ValueError. And I can do that as simple as using a raise_when operation.

def _coalesce_expr(col_value_pairs):
    return pl.coalesce(
        pl.when(pl.col(col) == 1).then(pl.lit(value))
        for col, value in col_value_pairs
    )

def from_dummies(
    df: pl.DataFrame, cols: list[str], separator: str = "_"
) -> pl.DataFrame:
    col_exprs: dict = {}

    for col in cols:
        name, value = col.rsplit(separator, maxsplit=1)
        col_exprs.setdefault(name, []).append((col, value))

    return (
        df
        .raise_when(
            pl.any_horizontal(pl.col(cols).ne(1).and_(pl.col(cols).ne(0))),
            ValueError,
            "Dummy DataFrame contains multi-assignment(s)"
        )
        .select(
            pl.all().exclude(cols),
            *[
                _coalesce_expr(exprs).alias(name)
                for name, exprs in col_exprs.items()
            ],
        )
    )

sm-Fifteen · 2023-11-02T20:36:51Z

I was also thinking about a config option to specify how to handle assert/expect failures like pl.Config.set_assert_mode('fail'|'warn'|'ignore')
* `fail`: print/log problems and crash

* `warn`: print/log problems but continue execution

* `ignore`: completely ignore asserts (full optimization possible)

@JulianCologne: Would your "fail" mode be more of a "fail_eager" or a "fail_lazy" mode? When running checks on your data like this, it's often useful to know just how much of it fails such sanity checks, rather than immediately stopping on the first bad row. I'm thinking something like 500/123456 lines failed the assertion "Column B is an SQL count(*) and can never be non-zero".

I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.

The optimisation barrier concern does sound reasonnable, although most other such issues are usually just listed in the doc as suboptimal for most cases and to be used with care. Adding a similar warning to user errors/assertions would probably be good enough to mitigate that.

Personally I would go with a list of expressions that evaluate to booleans, similar to Julian's idea, where rows get tagged separately on whether or not they pass or fail each and the tally is reported as part of the exception raised if any of them have failed.

lazy_df
    .operationA()
    .operationB()
    .assert_every(
        (pl.col("foo") > 0, "Foo must be positive"),
        (pl.col("my_list_cnt") == pl.col("my_list").list.len(), "Reported list length must match its actual length"),
        # What to do with None values is an open question
        (pl.when(pl.col("vehicle_type") == "car").then(pl.col("wheel_count") == 4), "All cars must have exactly 4 wheels"),
        ((pl.col("unitv_x") ** 2 + pl.col("unitv_y") ** 2 + pl.col("unitv_z") ** 2).is_between(0.99, 1.01), "Components of unit vector must add up to length 1."),
        (pl.col("sample_count").n_unique() == 1, "sample_count must be uniform for the entire dataframe")
    )
    .collect()

JulianCologne · 2023-11-03T07:34:51Z

@sm-Fifteen so "ignore" would just remove this branch entirely for full speed optimization. You might use "fail"/"warn" in testing/staging and then switch the config to "ignore" for production if you require speed only (you also might keep "fail"/"warn" if your workflow is not stable enough). This way there is no code change required other than the polars config.

One might also consider an additional "threshold" for "fail" specifying and total (e.g. 10_000) or percental (2%) failure rate that is allowed before stopping.

All in all in think we have a very similar idea of how this feature might work ;)

tmrusec added the enhancement New feature or an improvement of an existing feature label Sep 12, 2023

JulianCologne mentioned this issue Feb 21, 2024

df.assert_schema(expected_schema) #14620

Open

cmdlineluser mentioned this issue Feb 26, 2024

Add aggregation function to convert groups of identical values to that value #14692

Closed

stinodego added the A-api Area: changes to the public API label Feb 26, 2024

stinodego mentioned this issue Mar 21, 2024

Lazily evaluated error expression #15184

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Proposal: Introducing `pl.testing.raise_` in Polars #11064

Feature Proposal: Introducing `pl.testing.raise_` in Polars #11064

tmrusec commented Sep 12, 2023

orlp commented Sep 12, 2023

tmrusec commented Sep 12, 2023

JulianCologne commented Sep 13, 2023

tmrusec commented Sep 13, 2023

sm-Fifteen commented Nov 2, 2023 •

edited

JulianCologne commented Nov 3, 2023

Feature Proposal: Introducing pl.testing.raise_ in Polars #11064

Feature Proposal: Introducing pl.testing.raise_ in Polars #11064

Comments

tmrusec commented Sep 12, 2023

Description

Objective

Optimization Strategy

Proposed Syntax

Example

orlp commented Sep 12, 2023

tmrusec commented Sep 12, 2023

JulianCologne commented Sep 13, 2023

tmrusec commented Sep 13, 2023

Immediate Evaluation vs Lazy Evaluation

Optimization Considerations

Interaction with Other Operations

Performance Implications

Consider the following example

sm-Fifteen commented Nov 2, 2023 • edited

JulianCologne commented Nov 3, 2023

Feature Proposal: Introducing `pl.testing.raise_` in Polars #11064

Feature Proposal: Introducing `pl.testing.raise_` in Polars #11064

sm-Fifteen commented Nov 2, 2023 •

edited