New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Proposal: Introducing pl.testing.raise_
in Polars
#11064
Comments
I've thought about this, the problem is that such a raise operator blocks most optimizations. For example, if you had a filter after this condition we would not be able to reorder anymore and place the filter before the computation. |
I'm actually not familiar with the optimization priority. I was wondering maybe it should be one of the methods of LazyFrame or DataFrame instead of expression. And what if we still prioritize the filter above the raise method? So it can be:
But I don't know if it's going to work or improve the overall testing usecase? |
Nice! I recently had something very similar in mind ;) I was thinking about an df.with_columns(
# ... some calculations ...
).expect(
pl.col('a') > 0,
pl.col('b') < 100,
).with_columns(
# ... some calculations ...
).expect(
pl.col('c').null_count() == 0,
).with_columns(
# ... some calculations ...
) Instead of df1 = df.with_columns(
# ... some calculations ...
)
assert (df1.get_column('a') <= 0).sum() != 0
assert (df1.get_column('b') <= 0).sum() != 0
df2 = df1.with_columns(
# ... some calculations ...
)
assert df2.get_column('c').null_count() != 0
df3 = df2.with_columns(
# ... some calculations ...
) I was also thinking about a config option to specify how to handle assert/expect failures like
|
@JulianCologne Immediate Evaluation vs Lazy EvaluationIf Optimization ConsiderationsThe optimizer would need to be aware that any subsequent operations might not be executed if the Interaction with Other OperationsSay there's a sort operation after Performance ImplicationsChecking conditions on data can be computationally expensive, especially on large datasets. The optimizer would need to consider the cost of these checks when optimizing the query plan. Consider the following exampleI just want to simply check if the columns I'm considering are in a One-Hot Encoding format for this
|
@JulianCologne: Would your "fail" mode be more of a "fail_eager" or a "fail_lazy" mode? When running checks on your data like this, it's often useful to know just how much of it fails such sanity checks, rather than immediately stopping on the first bad row. I'm thinking something like
The optimisation barrier concern does sound reasonnable, although most other such issues are usually just listed in the doc as suboptimal for most cases and to be used with care. Adding a similar warning to user errors/assertions would probably be good enough to mitigate that. Personally I would go with a list of expressions that evaluate to booleans, similar to Julian's idea, where rows get tagged separately on whether or not they pass or fail each and the tally is reported as part of the exception raised if any of them have failed. lazy_df
.operationA()
.operationB()
.assert_every(
(pl.col("foo") > 0, "Foo must be positive"),
(pl.col("my_list_cnt") == pl.col("my_list").list.len(), "Reported list length must match its actual length"),
# What to do with None values is an open question
(pl.when(pl.col("vehicle_type") == "car").then(pl.col("wheel_count") == 4), "All cars must have exactly 4 wheels"),
((pl.col("unitv_x") ** 2 + pl.col("unitv_y") ** 2 + pl.col("unitv_z") ** 2).is_between(0.99, 1.01), "Components of unit vector must add up to length 1."),
(pl.col("sample_count").n_unique() == 1, "sample_count must be uniform for the entire dataframe")
)
.collect() |
@sm-Fifteen so "ignore" would just remove this branch entirely for full speed optimization. You might use "fail"/"warn" in testing/staging and then switch the config to "ignore" for production if you require speed only (you also might keep "fail"/"warn" if your workflow is not stable enough). This way there is no code change required other than the polars config. One might also consider an additional "threshold" for "fail" specifying and total (e.g. 10_000) or percental (2%) failure rate that is allowed before stopping. All in all in think we have a very similar idea of how this feature might work ;) |
Description
Objective
Introduce a new expression, pl.testing.raise, that can be used within Polars operations to raise errors based on specified conditions.
Optimization Strategy
Proposed Syntax
pl.testing.raise_(ErrorType, "Error Message")
Example
Imagine you're calculating a derived column based on two other columns. If one of the columns has a zero and might cause a division error, you'd want to catch it.
I think this matches well with Polars' vision of being the lower-end library that other libraries can build on top of. Let me know what do you think :)
The text was updated successfully, but these errors were encountered: