Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Proposal: Introducing pl.testing.raise_ in Polars #11064

Open
tmrusec opened this issue Sep 12, 2023 · 6 comments
Open

Feature Proposal: Introducing pl.testing.raise_ in Polars #11064

tmrusec opened this issue Sep 12, 2023 · 6 comments
Labels
A-api Area: changes to the public API enhancement New feature or an improvement of an existing feature

Comments

@tmrusec
Copy link

tmrusec commented Sep 12, 2023

Description

Objective

Introduce a new expression, pl.testing.raise, that can be used within Polars operations to raise errors based on specified conditions.

Optimization Strategy

  • During the optimization phase, any pl.testing.raise condition would be prioritized.
  • If the condition for pl.testing.raise is met, Polars would halt the operation before it goes into the actual evaluation, ensuring computational resources are not wasted on a doomed process.

Proposed Syntax

pl.testing.raise_(ErrorType, "Error Message")

Example

Imagine you're calculating a derived column based on two other columns. If one of the columns has a zero and might cause a division error, you'd want to catch it.

col_name = "B"
df.with_column(
    pl.when(pl.col(col_name) == 0)
    .then(pl.testing.raise_(ZeroDivisionError, f"Column '{col_name}' has a zero."))
    .otherwise(pl.col("A") / pl.col(col_name))
)

I think this matches well with Polars' vision of being the lower-end library that other libraries can build on top of. Let me know what do you think :)

@tmrusec tmrusec added the enhancement New feature or an improvement of an existing feature label Sep 12, 2023
@orlp
Copy link
Collaborator

orlp commented Sep 12, 2023

I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.

@tmrusec
Copy link
Author

tmrusec commented Sep 12, 2023

I'm actually not familiar with the optimization priority.

I was wondering maybe it should be one of the methods of LazyFrame or DataFrame instead of expression. And what if we still prioritize the filter above the raise method?

So it can be:

df = pl.LazyFrame({
  "A": [1,2,3,4,5],
  "B": [1,2,0,4,5],
  "C": [1,2,0,4,5],
})

col_name = "B"

(
    df
    .raise_when(
        pl.col(col_name) == 0,
        ZeroDivisionError,
        f"Column '{col_name}' has a zero.",
    )
    .with_columns(
      pl.col("A") / pl.col(col_name)
    )
    .filter(pl.col("C") != 0)
    .collect()
)

But I don't know if it's going to work or improve the overall testing usecase?

@JulianCologne
Copy link
Contributor

Nice! I recently had something very similar in mind ;)

I was thinking about an assert or expect method on DataFrames and Series as follows:

df.with_columns(
    # ... some calculations ...
).expect(
    pl.col('a') > 0,
    pl.col('b') < 100,
).with_columns(
    # ... some calculations ...
).expect(
    pl.col('c').null_count() == 0,
).with_columns(
    # ... some calculations ...
)

Instead of

df1 = df.with_columns(
    # ... some calculations ...
)

assert (df1.get_column('a') <= 0).sum() != 0
assert (df1.get_column('b') <= 0).sum() != 0

df2 = df1.with_columns(
    # ... some calculations ...
)

assert df2.get_column('c').null_count() != 0

df3 = df2.with_columns(
    # ... some calculations ...
)

I was also thinking about a config option to specify how to handle assert/expect failures like
pl.Config.set_assert_mode('fail'|'warn'|'ignore')

  • fail: print/log problems and crash
  • warn: print/log problems but continue execution
  • ignore: completely ignore asserts (full optimization possible)

@tmrusec
Copy link
Author

tmrusec commented Sep 13, 2023

@JulianCologne
Yes, and the purpose of this method is to utilize the query optimization. The position of the raise_when or expect method in the sequence of operations would be crucial. If placed early in a chain of operations, it could prevent unnecessary computations on data that would eventually trigger an exception. On the other hand, if placed at the end, it would act as a final check after all transformations. The optimizer would need to respect the position of raise_when to ensure that exceptions are raised at the expected times.

Immediate Evaluation vs Lazy Evaluation

If raise_when were to be executed immediately, it would break the laziness of the evaluation, as it would require an immediate check on the data. To coexist with lazy evaluation, raise_when would need to be integrated into the logical plan. It would represent an operation that, when the plan is executed, checks the data and raises an exception if the condition is met.

Optimization Considerations

The optimizer would need to be aware that any subsequent operations might not be executed if the raise_when condition is met. However, some optimizations might still be possible. For instance, if raise_when is checking a condition on a column that is later filtered out, the check can be moved after the filter operation in the optimized plan.

Interaction with Other Operations

Say there's a sort operation after raise_when, and the raise_when condition is met, the sort operation should never be executed. This might mean that raise_when acts as a "barrier" to certain optimizations, limiting the reordering of operations around it.

Performance Implications

Checking conditions on data can be computationally expensive, especially on large datasets. The optimizer would need to consider the cost of these checks when optimizing the query plan.

Consider the following example

I just want to simply check if the columns I'm considering are in a One-Hot Encoding format for this from_dummies function I made. If the col value is neither 1 or 0, it should raise a ValueError. And I can do that as simple as using a raise_when operation.

def _coalesce_expr(col_value_pairs):
    return pl.coalesce(
        pl.when(pl.col(col) == 1).then(pl.lit(value))
        for col, value in col_value_pairs
    )

def from_dummies(
    df: pl.DataFrame, cols: list[str], separator: str = "_"
) -> pl.DataFrame:
    col_exprs: dict = {}

    for col in cols:
        name, value = col.rsplit(separator, maxsplit=1)
        col_exprs.setdefault(name, []).append((col, value))

    return (
        df
        .raise_when(
            pl.any_horizontal(pl.col(cols).ne(1).and_(pl.col(cols).ne(0))),
            ValueError,
            "Dummy DataFrame contains multi-assignment(s)"
        )
        .select(
            pl.all().exclude(cols),
            *[
                _coalesce_expr(exprs).alias(name)
                for name, exprs in col_exprs.items()
            ],
        )
    )

@sm-Fifteen
Copy link

sm-Fifteen commented Nov 2, 2023

I was also thinking about a config option to specify how to handle assert/expect failures like pl.Config.set_assert_mode('fail'|'warn'|'ignore')

* `fail`: print/log problems and crash

* `warn`: print/log problems but continue execution

* `ignore`: completely ignore asserts (full optimization possible)

@JulianCologne: Would your "fail" mode be more of a "fail_eager" or a "fail_lazy" mode? When running checks on your data like this, it's often useful to know just how much of it fails such sanity checks, rather than immediately stopping on the first bad row. I'm thinking something like 500/123456 lines failed the assertion "Column B is an SQL count(*) and can never be non-zero".

I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation.

The optimisation barrier concern does sound reasonnable, although most other such issues are usually just listed in the doc as suboptimal for most cases and to be used with care. Adding a similar warning to user errors/assertions would probably be good enough to mitigate that.


Personally I would go with a list of expressions that evaluate to booleans, similar to Julian's idea, where rows get tagged separately on whether or not they pass or fail each and the tally is reported as part of the exception raised if any of them have failed.

lazy_df
    .operationA()
    .operationB()
    .assert_every(
        (pl.col("foo") > 0, "Foo must be positive"),
        (pl.col("my_list_cnt") == pl.col("my_list").list.len(), "Reported list length must match its actual length"),
        # What to do with None values is an open question
        (pl.when(pl.col("vehicle_type") == "car").then(pl.col("wheel_count") == 4), "All cars must have exactly 4 wheels"),
        ((pl.col("unitv_x") ** 2 + pl.col("unitv_y") ** 2 + pl.col("unitv_z") ** 2).is_between(0.99, 1.01), "Components of unit vector must add up to length 1."),
        (pl.col("sample_count").n_unique() == 1, "sample_count must be uniform for the entire dataframe")
    )
    .collect()

@JulianCologne
Copy link
Contributor

@sm-Fifteen so "ignore" would just remove this branch entirely for full speed optimization. You might use "fail"/"warn" in testing/staging and then switch the config to "ignore" for production if you require speed only (you also might keep "fail"/"warn" if your workflow is not stable enough). This way there is no code change required other than the polars config.

One might also consider an additional "threshold" for "fail" specifying and total (e.g. 10_000) or percental (2%) failure rate that is allowed before stopping.

All in all in think we have a very similar idea of how this feature might work ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

5 participants