bug: merging selections combines filters in incorrect way #9058

NickCrews · 2024-04-25T23:26:02Z

What happened?

I do a filter-mutate-filter, and the order needs to stay in that order to be correct, but ibis appears to jumble the order of these steps:

import ibis
from ibis import _

ibis.options.interactive = True

t = ibis.read_parquet(
    "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
)
t = t.cache()
t = t.filter(_.votes != "*").mutate(votes=_.votes.cast("int64")).filter(_.votes > 0)
# occasionally, but not always, I get
# ConversionException: Conversion Error: Could not convert string '*' to INT64
t.votes.mean()

This gives the sql

SELECT
  AVG("t1"."votes") AS "Mean(votes)"
FROM (
  SELECT
    "t0"."year",
    "t0"."date",
    "t0"."state_po",
    "t0"."county_name",
    "t0"."county_fips",
    "t0"."jurisdiction_name",
    "t0"."jurisdiction_fips",
    "t0"."district",
    "t0"."office",
    "t0"."magnitude",
    "t0"."special",
    "t0"."stage",
    "t0"."precinct",
    "t0"."writein",
    "t0"."candidate",
    "t0"."party_detailed",
    "t0"."mode",
    CAST("t0"."votes" AS BIGINT) AS "votes",
    "t0"."readme_check",
    "t0"."county_fips2",
    "t0"."jurisdiction_fips2"
  FROM "ibis_cache_hys22fcdbnbmpdoasprivdb3pm" AS "t0"
  WHERE
    "t0"."votes" <> '*' AND CAST("t0"."votes" AS BIGINT) > CAST(0 AS TINYINT)
) AS "t1"

I think we need to keep these as two separate relations. I think this is somewhat related to #9014 (comment)

What version of ibis are you using?

main

What backend(s) are you using, if any?

duckdb

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

gforsyth · 2024-04-26T17:59:25Z

Thanks for the report @NickCrews -- we'll probably disable merging filters across selects to avoid this happening. In the interim (and we won't release 9.0 until we sort this out), you can probably use try_cast to handle this without the failure mode.

kszucs · 2024-04-26T19:12:53Z

The produced expression then the sql specific Select operations looks correct to me. I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.

The minimal problem is the following:

t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)

gets reduced to

t.filter(_.votes != "*", _.votes.cast("int64") > 0)

and the latter fails due to the fallible cast operation.

By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well. This is not something we can make foolproof since the user may use the compound predicate directly t.filter(_.votes != "*", _.votes.cast("int64") > 0).

The predicate which should be really used for this case is:

t.filter((_.votes != "*").ifelse(_.votes.cast("int64") > 0), False)

Or rather a new API allowing to chain ands more conveniently:

t.filter((_.votes != "*").and_then(_.votes.cast("int64") > 0)

cpcloud · 2024-04-26T19:20:18Z

I understand that this data-dependent case works if the filters are not fused, but ibis expressions shouldn't have assumptions about the underlying data.

Removing filters for this case removes assumptions, it doesn't add any. Are you saying something else?

cpcloud · 2024-04-26T19:22:50Z

By not fusing the operations these two will behave differently depending on the data which might be considered a downside as well.

This can't be addressed statically (which means definitely not in any tool that generates code without knowing what the data are), since it's data dependent. Not fusing doesn't make that particular issue worse or better, but fusing definitely makes it worse.

kszucs · 2024-04-26T19:24:43Z

but fusing definitely makes it worse

I kinda disagree with this, I think fusing makes it just as bad (not better not worse [semantically]).
Additionally there are clear workarounds to either use .ifelse() or .try_cast().

cpcloud · 2024-04-26T19:27:55Z

The workarounds aren't that clear IMO. It's not like we'd recognize the pattern and suggest an alternative. However, fusing makes the solution of "spell things out bit by bit" impossible to write.

jcrist · 2024-04-26T19:28:58Z

and the latter fails due to the fallible cast operation.

Correct, as it should. I don't think we need a new method here. An ibis .filter call should semantically appear to the user to run as a subquery with a separate WHERE clause. The backend is free to fuse steps as needed, but we need to ensure the semantically the results are the same.

So

t.filter(_.x != "*").filter(_.x.cast("int64") > 0)

should compile to something semantically identical to (all sql typos mine):

SELECT * FROM (
  SELECT * FROM t WHERE x <> '*'
) WHERE CAST(x AS int64) > 0

Whereas

t.filter(_.x != "*", _.x.cast("int64") > 0)

should be semantically identical to:

SELECT * FROM t WHERE x <> '*" AND CAST(x as int64) > 0

These aren't semantically identical queries, and the latter may not work appropriately in this case since it's gating the CAST operation on an AND in the same filter (and backends provide no guarantee on AND shortcircuiting).

We can fuse if we want to produce prettier sql, with the caveat that we don't want to change the semantics. In the case of an operation like CAST which can error we can't fuse filters like this. Enumerating the cases where we can fuse properly seems tricky to do and probably not something we should introduce in 9.0 given all the recent bugs. I agree with @cpcloud here - we should just disable fusing to avoid this issue.

kszucs · 2024-04-26T19:50:42Z

I would prefer catching that my query is fallible early.

Imagine that the nested query successfully executes then a new record gets inserted into the data source (not a parquet file, but a live database table) with _.x = "uncastable", the ibis query will fail the same way just later.

kszucs · 2024-04-26T21:46:46Z

Just checked and this has been the behaviour for earlier ibis versions as well:

In [16]: t = ibis.read_parquet(
    ...:     "https://github.com/NickCrews/election-data/releases/download/v2024-04-05_0/cleaned.parquet"
    ...: )
    ...: t = t.cache()
100% ▕████████████████████████████████████████████████████████████▏

In [17]: t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0)
Out[17]:
r0 := DatabaseTable: ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu
  year               int16
  date               date
  state_po           string
  county_name        string
  county_fips        string
  jurisdiction_name  string
  jurisdiction_fips  string
  district           string
  office             string
  magnitude          int64
  special            boolean
  stage              string
  precinct           string
  writein            boolean
  candidate          string
  party_detailed     string
  mode               string
  votes              string
  readme_check       boolean
  county_fips2       string
  jurisdiction_fips2 string

Selection[r0]
  predicates:
    r0.votes != '*'
    Cast(r0.votes, to=int64) > 0

In [19]: print(t.filter(_.votes != "*").filter(_.votes.cast("int64") > 0).compile())
SELECT t0.year, t0.date, t0.state_po, t0.county_name, t0.county_fips, t0.jurisdiction_name, t0.jurisdiction_fips, t0.district, t0.office, t0.magnitude, t0.special, t0.stage, t0.precinct, t0.writein, t0.candidate, t0.party_detailed, t0.mode, t0.votes, t0.readme_check, t0.county_fips2, t0.jurisdiction_fips2
FROM ibis_cache_7s6eudf4vbfqzhs7dpm52nqafu AS t0
WHERE t0.votes != :param_1 AND CAST(t0.votes AS BIGINT) > CAST(:param_2 AS INTEGER)

cpcloud · 2024-04-26T22:07:21Z

I don't think that matters too much, given that this is incorrect behavior given what the user has written.

I'm still not understanding what the rationale is for keeping the merging given all the issues we have encountered so far.

kszucs · 2024-04-27T13:29:03Z

I still think that merging is correct. For example polars does the same thing (and I assume other systems as well):

In [1]: import polars as pl
   ...:
   ...: lf = pl.LazyFrame(
   ...:     {
   ...:         "foo": [1, 2, 3],
   ...:         "bar": [6, 7, 8],
   ...:         "ham": ["a", "b", "c"],
   ...:     }
   ...: )
   ...: hamint = pl.col("ham").cast(pl.Int64)
   ...: expr = lf.filter(pl.col("foo") > 1).filter(hamint > 2)
   ...:
   ...: print(expr.explain())
DF ["foo", "bar", "ham"]; PROJECT */3 COLUMNS; SELECTION: "[([(col(\"ham\").strict_cast(Int64)) > (2)]) & ([(col(\"foo\")) > (1)])]"

Even if we consider it as an incorrect functionality, it is not a regression.
I tried disabling the merge in refactor(sql): make select merging opt-in #9066 and there are new issues arising from it, especially with mssql, so we need to implement workaround rewrites for those cases.

NickCrews · 2024-04-27T15:49:23Z

_.x = "uncastable"

I'm not sure I understand, can you give a more concrete example?

NickCrews · 2024-04-27T16:09:31Z

Consider

def load():
    ...
    t = t.distinct()
    t = t.filter(t.votes != "*")
    return t

# many lines

t = load()
t = t.filter(t.votes.cast(int) > 0)

If you look at each part of this program locally, it appears correct. It only breaks if these two distant parts of the program interact in an unlucky way.
If you switched the order of the .distinct() to be the last statement in load(), it would succeed, even though semantically that yields the same data out of load(). Correctness hinging upon an implementation detail like this smells bad.

kszucs · 2024-04-27T18:08:47Z

If you switched the order of the .distinct() to be the last statement in load(), it would succeed

Wouldn't it only succeed if the votes column contains valid integer strings or "*"? What I am trying to say that it is entirely data dependent and t.votes.cast(int) is fallible hence the query may or may not fail.

Do you have an example where the filters are not fallible end the merge alters the behaviour?

NickCrews · 2024-04-27T18:54:46Z

Wouldn't it only succeed if the votes column contains valid integer strings or "*"?

Yes, that is the scenario I am considering here, the data only contains valid ints or "*". In that scenario:

t.filter(_.votes != "*").distinct().filter(_.votes.cast(int) > 0): This SHOULD never error, and in the current implementation, it never errors.
t.distinct().filter(_.votes != "*").filter(_.votes.cast(int) > 0): again, this SHOULD never error (per my above argument of "things that look locally correct should be globally correct" argument, and phillip's "spell things out bit by bit" argument), but in the current implementation, it sometimes errors.

kszucs · 2024-04-29T07:22:51Z

things that look locally correct should be globally correct

The second part of your example looks fallible to me, so no surprise that I am getting a cast error since I cannot have any knowledge about what load does but I do know that t.votes is a string column.

t = load()
t = t.filter(t.votes.cast(int) > 0)

Apparently this only appears to be an issue for duckdb. At least I am unable to reproduce the problem on other backends which suggest that our .cast() behaviour if fairly inconsistent.

I would still like to have an example where the problem occurs without calling .cast().

cpcloud · 2024-04-29T14:01:53Z

Reposting my comment from #9065 (comment)

I took the weekend to think about some of the issues here and did a bit of research on the SQL standard specification of predicate short-circuiting.

According to this SO Q&A the SQL standard does not specify the evaluation order of predicates. That means that an engine is free to reorder (or not) predicates in any way AFAICT, including if it would change the semantics of the query.

For example, an engine can evaluate a fallible expression first, even if a subsequent expression with short circuiting would avoid the fallible expression evaluation failing.

In practice, DuckDB merges a fallible cast with an infallible operation in the same way producing the exact same plan regardless of how the SQL is specified:

D explain select * from (select * from t where cast(x as int) != 0) where x <> '*';

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘
D explain select * from (select * from t where x <> '*') where cast(x as int) != 0;

┌─────────────────────────────┐
│┌───────────────────────────┐│
││       Physical Plan       ││
│└───────────────────────────┘│
└─────────────────────────────┘
┌───────────────────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ ((x != '*') AND (CAST(x AS│
│       INTEGER) != 0))     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 1           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         SEQ_SCAN          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             t             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             x             │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 5           │
└───────────────────────────┘

This means that even if Ibis decided to generate subqueries that map one-to-one to SQL selects it provides absolutely zero guarantee that generating the SQL in that way produces specific behavior.

IMO this means that Ibis should be free to merge predicates, assuming our implementation of that process is correct.

Now, I'm not trying to gloss over the fact that we have had a number of issues with select merging, and so I think we should provide a way to disable it if it can potentially help unblock users.

I think the opt-in behavior would be kind of pointless, since people will never opt-in without lots of documentation and if we're going to go that route we should disable it.

On the other hand, if we think this is long-term a thing we want enabled, then we definitely need feedback on it from users. The only way to get that feedback AFAICT is to enable this feature.

That does mean we have to accept that the current implementation may not be 100% finished and will elicit bug reports.

The opt-out behavior IMO seems like an acceptable balance here: people can disable select merging on an as-needed basis, and we can continuously evaluate whether keeping it around is causing overhead that isn't worth it.

Disabling select merging it turns out is also not without issue: SQLite fails due to a query we generate being too large for the parser stack, and MS SQL loses the ability to specify order by in many cases as well as LIKE functionality in certain positions.

Add a global option to enable or disable merging select statements where possible. This can serve as an escape route if the optimization fails. It is currently enabled by default, we may change to make it opt-in. xref #9064 and #9058 Closes #9058.

NickCrews added the bug Incorrect behavior inside of ibis label Apr 25, 2024

cpcloud added this to the 9.0 milestone Apr 26, 2024

kszucs mentioned this issue Apr 26, 2024

fix(sql): don't fuse subsequent filter operations #9064

Closed

kszucs mentioned this issue Apr 26, 2024

feat(sql): add option to enable/disable select merging #9065

Merged

cpcloud closed this as completed in #9065 Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: merging selections combines filters in incorrect way #9058

bug: merging selections combines filters in incorrect way #9058

NickCrews commented Apr 25, 2024

gforsyth commented Apr 26, 2024

kszucs commented Apr 26, 2024

cpcloud commented Apr 26, 2024

cpcloud commented Apr 26, 2024

kszucs commented Apr 26, 2024 •

edited

Loading

cpcloud commented Apr 26, 2024 •

edited

Loading

jcrist commented Apr 26, 2024 •

edited

Loading

kszucs commented Apr 26, 2024

kszucs commented Apr 26, 2024

cpcloud commented Apr 26, 2024

kszucs commented Apr 27, 2024

NickCrews commented Apr 27, 2024

NickCrews commented Apr 27, 2024

kszucs commented Apr 27, 2024 •

edited

Loading

NickCrews commented Apr 27, 2024

kszucs commented Apr 29, 2024 •

edited

Loading

cpcloud commented Apr 29, 2024

bug: merging selections combines filters in incorrect way #9058

bug: merging selections combines filters in incorrect way #9058

Comments

NickCrews commented Apr 25, 2024

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

gforsyth commented Apr 26, 2024

kszucs commented Apr 26, 2024

cpcloud commented Apr 26, 2024

cpcloud commented Apr 26, 2024

kszucs commented Apr 26, 2024 • edited Loading

cpcloud commented Apr 26, 2024 • edited Loading

jcrist commented Apr 26, 2024 • edited Loading

kszucs commented Apr 26, 2024

kszucs commented Apr 26, 2024

cpcloud commented Apr 26, 2024

kszucs commented Apr 27, 2024

NickCrews commented Apr 27, 2024

NickCrews commented Apr 27, 2024

kszucs commented Apr 27, 2024 • edited Loading

NickCrews commented Apr 27, 2024

kszucs commented Apr 29, 2024 • edited Loading

cpcloud commented Apr 29, 2024

kszucs commented Apr 26, 2024 •

edited

Loading

cpcloud commented Apr 26, 2024 •

edited

Loading

jcrist commented Apr 26, 2024 •

edited

Loading

kszucs commented Apr 27, 2024 •

edited

Loading

kszucs commented Apr 29, 2024 •

edited

Loading