-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic behaviour when using is_null()
in LazyFrame
#14595
Comments
Can you please create a reproducable example? There is nothing we can do without. |
Like I said, unfortunately I couldn't get a reproducible example (see in the post for a few things I tried). If you or anyone else has suggestions that might help me get one, I'm definitely willing to try. |
Could you also show the output of |
Yes of course! So, I changed the loop a little bit to add the explanation: print("shapes")
for _ in range(10):
tmp2_plan = tmp.filter(pl.col("code").is_not_null())
tmp1 = tmp.collect().filter(pl.col("code").is_not_null())
tmp2 = tmp2_plan.collect()
tmp3 = tmp.filter(pl.col("code_not_null")).collect()
print(tmp2_plan.explain())
print(tmp1.shape, tmp2.shape, tmp3.shape)
print('=' * 100) here's the (long) output (note that the data changed very slightly, but the issue persists): Long output
The plan seems to be the same every time; I'm not sure what can be concluded from that. |
Thanks @dcferreira , the query plans show that predicate pushdown optimizing differently, there are 2 variations in the SELECTION:
This is also likely the cause of the differing outputs. But in fact predicates should not be getting pushed past the |
Good spotting! Thanks so much for the PR @nameexhaustion, that was super fast! Your fix/everything around the planning goes a bit over my head, but while trying to understand it I ran into something a bit weird: This code outputs 2 different results, with the only difference being that the optimizations are enabled/disabled: ids_set = {
'02d0927b-77ea-400f-8adc-22474e45d6d5',
'03094785-91c7-4e98-9072-3336ff67c222',
'031d9dfb-38c2-4229-92b3-d397b6d0313b',
'033d8ebd-07ca-467a-8833-f5c23138746b',
'0347c17f-dd73-43fc-969d-2e46b6406dea'
}
df_filtered = df.filter(pl.col("label_id").is_in(ids_set))
tmp = df_filtered.select('label_id', 'code', pl.col("code").is_not_null().alias("code_not_null"))
print('"original" dataframe')
print(tmp.explain(optimized=False))
print(tmp.collect(no_optimization=True))
print("#" * 100)
print(tmp.explain(optimized=True))
print(tmp.collect(no_optimization=False)) Long output
Is this covered in the test you added with |
With the existing release, I expect if you run with |
Checks
Reproducible example
Unfortunately I couldn't get a reproducible example without my data (though I tried quite hard!), but I am willing to spend some time on this if someone has an idea of how to get one.
df
is a LazyFrame read from a delta table.Outputs:
Log output
No response
Issue description
Filtering by
pl.col().is_null()
orpl.col().is_not_null()
before collecting gives me a non-deterministic wrong result.I really tried to get a completely reproducible example, but did not succeed.
Here's what I tried:
pl.scan_delta
-> run the code above also 1000s of timesIn both these cases, the results were consistently correct.
However, for the example in my data, something is clearly wrong.
Expected behavior
In the code snippet above, I'm filtering a lazyframe by null values in a column, and printing out the shape of the output.
I'm doing that in 3 different ways:
I expected that all 3 of these to give the exact same result.
However, the filtering in nr 2 only works sometimes.
Installed versions
The text was updated successfully, but these errors were encountered: