-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bug: merging selections combines filters in incorrect way #9058
Comments
Thanks for the report @NickCrews -- we'll probably disable merging filters across selects to avoid this happening. In the interim (and we won't release 9.0 until we sort this out), you can probably use |
The produced expression then the sql specific The minimal problem is the following: t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0) gets reduced to t.filter(_.votes != "*", _.votes.cast("int64") > 0) and the latter fails due to the fallible cast operation. By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well. This is not something we can make foolproof since the user may use the compound predicate directly The predicate which should be really used for this case is: t.filter((_.votes != "*").ifelse(_.votes.cast("int64") > 0), False) Or rather a new API allowing to chain ands more conveniently: t.filter((_.votes != "*").and_then(_.votes.cast("int64") > 0) |
Removing filters for this case removes assumptions, it doesn't add any. Are you saying something else? |
This can't be addressed statically (which means definitely not in any tool that generates code without knowing what the data are), since it's data dependent. Not fusing doesn't make that particular issue worse or better, but fusing definitely makes it worse. |
I kinda disagree with this, I think fusing makes it just as bad (not better not worse [semantically]). |
The workarounds aren't that clear IMO. It's not like we'd recognize the pattern and suggest an alternative. However, fusing makes the solution of "spell things out bit by bit" impossible to write. |
Correct, as it should. I don't think we need a new method here. An ibis So t.filter(_.x != "*").filter(_.x.cast("int64") > 0) should compile to something semantically identical to (all sql typos mine): SELECT * FROM (
SELECT * FROM t WHERE x <> '*'
) WHERE CAST(x AS int64) > 0 Whereas t.filter(_.x != "*", _.x.cast("int64") > 0) should be semantically identical to: SELECT * FROM t WHERE x <> '*" AND CAST(x as int64) > 0 These aren't semantically identical queries, and the latter may not work appropriately in this case since it's gating the We can fuse if we want to produce prettier sql, with the caveat that we don't want to change the semantics. In the case of an operation like |
I would prefer catching that my query is fallible early. Imagine that the nested query successfully executes then a new record gets inserted into the data source (not a parquet file, but a live database table) with |
Just checked and this has been the behaviour for earlier ibis versions as well: In [16]: t = ibis.read_parquet(
...: "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
...: )
...: t = t.cache()
100% ▕████████████████████████████████████████████████████████████▏
In [17]: t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)
Out[17]:
r0 := DatabaseTable: ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu
year int16
date date
state_po string
county_name string
county_fips string
jurisdiction_name string
jurisdiction_fips string
district string
office string
magnitude int64
special boolean
stage string
precinct string
writein boolean
candidate string
party_detailed string
mode string
votes string
readme_check boolean
county_fips2 string
jurisdiction_fips2 string
Selection[r0]
predicates:
r0.votes != '*'
Cast(r0.votes, to=int64) > 0
In [19]: print(t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0).compile())
SELECT t0.year, t0.date, t0.state_po, t0.county_name, t0.county_fips, t0.jurisdiction_name, t0.jurisdiction_fips, t0.district, t0.office, t0.magnitude, t0.special, t0.stage, t0.precinct, t0.writein, t0.candidate, t0.party_detailed, t0.mode, t0.votes, t0.readme_check, t0.county_fips2, t0.jurisdiction_fips2
FROM ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu AS t0
WHERE t0.votes != :param_1 AND CAST(t0.votes AS BIGINT) > CAST(:param_2 AS INTEGER) |
I don't think that matters too much, given that this is incorrect behavior given what the user has written. I'm still not understanding what the rationale is for keeping the merging given all the issues we have encountered so far. |
In [1]: import polars as pl
...:
...: lf = pl.LazyFrame(
...: {
...: "foo": [1, 2, 3],
...: "bar": [6, 7, 8],
...: "ham": ["a", "b", "c"],
...: }
...: )
...: hamint = pl.col("ham").cast(pl.Int64)
...: expr = lf.filter(pl.col("foo") > 1).filter(hamint > 2)
...:
...: print(expr.explain())
DF ["foo", "bar", "ham"]; PROJECT */3 COLUMNS; SELECTION: "[([(col(\"ham\").strict_cast(Int64)) > (2)]) & ([(col(\"foo\")) > (1)])]"
|
I'm not sure I understand, can you give a more concrete example? |
Consider def load():
...
t = t.distinct()
t = t.filter(t.votes != "*")
return t
# many lines
t = load()
t = t.filter(t.votes.cast(int) > 0)
|
Wouldn't it only succeed if the votes column contains valid integer strings or "*"? What I am trying to say that it is entirely data dependent and Do you have an example where the filters are not fallible end the merge alters the behaviour? |
Yes, that is the scenario I am considering here, the data only contains valid ints or "*". In that scenario:
|
The second part of your example looks fallible to me, so no surprise that I am getting a cast error since I cannot have any knowledge about what load does but I do know that t = load()
t = t.filter(t.votes.cast(int) > 0) Apparently this only appears to be an issue for duckdb. At least I am unable to reproduce the problem on other backends which suggest that our I would still like to have an example where the problem occurs without calling |
Reposting my comment from #9065 (comment) I took the weekend to think about some of the issues here and did a bit of research on the SQL standard specification of predicate short-circuiting.
For example, an engine can evaluate a fallible expression first, even if a subsequent expression with short circuiting would avoid the fallible expression evaluation failing. In practice, DuckDB merges a fallible cast with an infallible operation in the same way producing the exact same plan regardless of how the SQL is specified:
This means that even if Ibis decided to generate subqueries that map one-to-one to SQL selects it provides absolutely zero guarantee that generating the SQL in that way produces specific behavior. IMO this means that Ibis should be free to merge predicates, assuming our implementation of that process is correct. Now, I'm not trying to gloss over the fact that we have had a number of issues with select merging, and so I think we should provide a way to disable it if it can potentially help unblock users. I think the opt-in behavior would be kind of pointless, since people will never opt-in without lots of documentation and if we're going to go that route we should disable it. On the other hand, if we think this is long-term a thing we want enabled, then we definitely need feedback on it from users. The only way to get that feedback AFAICT is to enable this feature. That does mean we have to accept that the current implementation may not be 100% finished and will elicit bug reports. The opt-out behavior IMO seems like an acceptable balance here: people can disable select merging on an as-needed basis, and we can continuously evaluate whether keeping it around is causing overhead that isn't worth it. Disabling select merging it turns out is also not without issue: SQLite fails due to a query we generate being too large for the parser stack, and MS SQL loses the ability to specify order by in many cases as well as |
What happened?
I do a filter-mutate-filter, and the order needs to stay in that order to be correct, but ibis appears to jumble the order of these steps:
This gives the sql
I think we need to keep these as two separate relations. I think this is somewhat related to #9014 (comment)
What version of ibis are you using?
main
What backend(s) are you using, if any?
duckdb
Relevant log output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: