bug: inlining expressions leads to wrong results for non-pure functions #8921

NickCrews · 2024-04-09T21:27:57Z

What happened?

See this:

import ibis
import random

ibis.options.interactive = True


@ibis.udf.scalar.python(side_effects=True)
def my_random(x: float) -> float:
    return random.random()


t = ibis.memtable({"x": [1, 2, 3, 4, 5]})
# this happens with both the builtin random, and UDFs:
# t = t.mutate(y=ibis.random())
t = t.mutate(y=my_random(t.x))
t = t.mutate(size=(t.y > 0.5).ifelse("big", "small"))
t
# ┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
# ┃ x     ┃ y        ┃ size   ┃
# ┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
# │ int64 │ float64  │ string │
# ├───────┼──────────┼────────┤
# │     1 │ 0.208385 │ big    │
# │     2 │ 0.566183 │ big    │
# │     3 │ 0.836448 │ small  │
# │     4 │ 0.530235 │ small  │
# │     5 │ 0.526687 │ small  │
# └───────┴──────────┴────────┘

Notice that the size column is sometimes incorrect. This is because size is getting calculated from a totally uncorrelated RANDOM() in the generated SQL. ibis.to_sql(t) shows:

SELECT
  "t0"."x",
  RANDOM() AS "y",
  CASE WHEN RANDOM() > CAST(0.5 AS DOUBLE) THEN 'big' ELSE 'small' END AS "size"
FROM "ibis_pandas_memtable_m5jukuqv2rcrhkpg2vsdu6k5fe" AS "t0"

For this to work as expected, there can only be a single RANDOM() in the generated SQL. I think that the agressive inlining that we do to merge selects/mutates together is over-aggressive. After a short talk with @cpcloud, this inlining is only correct when the entire expression is pure. Either we need to somehow know when an expression is pure (most are, except for (some) UDFs, random(), maybe now(), and maybe others I'm not thinking of?), or just be conservative and only do this inlining when the subsequent selects/mutates are just reselecting/renaming columns with no other changes.

What version of ibis are you using?

main

What backend(s) are you using, if any?

I'm not sure if this is present on all backends, but I think so.

Relevant log output

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

kszucs · 2024-04-09T21:30:56Z

We can add these impure types to https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/rewrites.py#L155 in order to prevent merging those selects.

NickCrews · 2024-04-09T21:38:13Z

That would work, if we are sure we can get all the types. What about ops.ArrayDistinct though? In duckdb that is an impure function, because the order of the result is not guaranteed. But in other backends it might be a pure function. Do we just have to be conservative and call that a blocking op?

NickCrews · 2024-04-12T00:23:59Z

ibis.uuid() is another impure expression to worry about.

What do you think is the counter-argument against only merging on re-selects? Is it just the uglier/more verbose SQL? Unless we have evidence that it is slower, I would want to lean towards the less-likely-to-be-incorrect route, even if the generated SQL is uglier for human consumption.

kszucs · 2024-04-13T09:28:00Z

The problem doesn't just occur in case of merging Select nodes. Consider the following case:

In [1]: import ibis

In [2]: uid = ibis.uuid()

In [3]: t = ibis.table({"a": "int", "b": "string"})

In [4]: t1 = t.select(t, uid)

In [7]: t1 = t.select(t, uid=uid)

In [8]: t2 = t1.select(uid)

In [9]: t2
Out[9]:
r0 := UnboundTable: unbound_table_0
  a int64
  b string

r1 := Project[r0]
  a:   r0.a
  b:   r0.b
  uid: RandomUUID()

Project[r1]
  RandomUUID(): r1.uid

In [10]: t3 = t1.select(ibis.uuid())

In [11]: t3
Out[11]:
r0 := UnboundTable: unbound_table_0
  a int64
  b string

r1 := Project[r0]
  a:   r0.a
  b:   r0.b
  uid: RandomUUID()

Project[r1]
  RandomUUID(): r1.uid

This "rewrite" happens at the API level as well during value dereferencing which we certainly would like to keep. In the first case we use the same variable so we mean the same value whereas in the latter we create a new uuid assuming that the user intention is to create a new one.

I think this issue can be resolved by better modelling of inpure expressions like uuid() and random() and making them actually unique by a virtual field, then we would need to mark those expressions solving both the dereferencing and Select merge cases:

class Unique(Value):
    _counter = itertools.count()
    uid: Optional[int] = None

    def __init__(self, uid, **kwargs):
        if uid is None:
            uid = next(self._counter)
        super().__init__(uid=uid, **kwargs)


class RandomUUID(Unique):
    dtype = dt.uuid


class RandomScalar(Unique):
    dtype = dt.float64

cpcloud · 2024-04-13T10:22:59Z

What's the issue with disabling the merging? Otherwise we'll be playing whack-a-mole with discovering impure functions.

kszucs · 2024-04-14T01:54:51Z

Disabling merging would solve the original problem in the issue but wouldn't solve the dereferencing one shown by me. My proposed solution would solve both.

NickCrews · 2024-04-18T03:23:19Z

Sorry I went on vacation for a few days, therefore radio silence.

I'm hesitant that the linked PR wasn't the right solution:

still doesn't address the ops.ArrayDistinct point above
still doesn't address the UDF point above
More significantly, the added test case isn't actually what I would expect. There is some "spooky action at a distance" happening here: as soon as I do t.mutate(y=impure), now all of a sudden impure has become tainted, it is now referencing the first table that it is bound to? I find that very surprising.

If I wanted y and z to be correlated, I would have referenced the parent column directly:

    impure = func()
    t = ibis.table({"x": "int64"}, name="t")
    t1 = t.mutate(y=impure)
    t2 = t1.mutate(z=t1.y.cast("string")) 
    expected = ops.Project(
        parent=t1,
        values={"x": t1.x, "y": t1.y, "z": t1.y.cast("string")},
    )

If I don't reference the parent table in subsequent selects, then I expect the columns to be uncorrelated:

    impure = func()
    t = ibis.table({"x": "int64"}, name="t")
    t1 = t.mutate(y=impure)
    t2 = t1.mutate(z=impure.cast("string"))
    expected = ops.Project(
        parent=t1,
        values={"x": t1.x, "y": t1.y, "z": impure.cast("string")},
    )

I think we need to rely on the user doing explicit .selects and .mutates to be the boundary locations where expressions are separated as correlated/uncorrelated.

NickCrews · 2024-04-18T03:30:38Z

The problem doesn't just occur in case of merging Select nodes. Consider the following case:

I see what you're saying here, that does look like a problem. Just to be sure here, what I would expect is for both cases to be something like the very simple

UnboundTable:
   RandomUUID(): RandomUUID()

(ie there is literally NO reference to the parent tables, because we only selected a single UUID expression, which doesn't depend on any parents).

Is this what you would expect too?

cpcloud · 2024-04-18T10:23:09Z

The ordering of ArrayDistinct is not part of its purity, because of the relational model, which doesn't guarantee the ordering of any output without an ORDER BY. The same is true for array aggregation (Ibis calls this collect). Under a model like the one you're alluding to, we'd also have to consider parallel floating point arithmetic to be impure, since the bitwise output can be different for summation (for example) depending on the order in which the summands are processed. I don't think we should consider array distinct impure.
UDFs will be done in a follow-up.

More significantly, the added test case isn't actually what I would expect.

Well, yeah :) See my comment here: #8967 (comment)

NickCrews · 2024-04-18T19:00:10Z

In this comment, @kszucs says:

# semantically this should produce two different columns
t.select(x=ibis.random(), y=ibis.random())
# semantically this should produce identical columns
rand = ibis.random()
t.select(x=rand, y=rand)

I think that both of these should actually produce two different columns. If you want to get two identical columns, I think you should have to explicitly re-select from a parent relation:

t.select(common=ibis.random()).select(x=_.common, y=_.common)

Consider this example (maybe I'm not understanding the implementation, correct me if so):

# @functools.cache()
def generate_noise(amplitude):
    return (ibis.random() - .5) * amplitude


def add_noise(t, columns: list[str]):
    return t.mutate(**{col:t[col] + generate_noise(100) for col in columns})

If there is a @functools.cache() on the helper function, then the id of the returned expression is the same, and so you get correlated noise in each of the columns. If there is no @functools.cache(), then the noise will be uncorrelated. I don't think this behavior should depend on such an implementation detail.

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

cpcloud · 2024-04-18T20:36:40Z

@kszucs Perhaps we should consider restricting impurity further, to effectively require a projection to get the same value.

I agree with Nick that t.select(x=rand, y=rand) should produce two random() invocations, and I don't think we should make assumptions about impurity based on whether two variables are the same.

A variable isn't the same as a column in a projection.

kszucs · 2024-04-18T21:50:18Z

I can accept that as the desired behaviour, though we need to prevent dereferencing in that case. Going to take a look.

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews · 2024-04-19T01:09:07Z

This comment is maybe getting towards the use cases we need to support?

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added the bug Incorrect behavior inside of ibis label Apr 9, 2024

kszucs mentioned this issue Apr 15, 2024

fix(ir): make impure ibis.random() and ibis.uuid() functions return unique node instances #8967

Merged

cpcloud closed this as completed in #8967 Apr 15, 2024

NickCrews reopened this Apr 18, 2024

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 18, 2024

test: add test for impure function correlation behavior

0992a3f

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews mentioned this issue Apr 18, 2024

test: add test for impure function correlation behavior #9014

Open

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 18, 2024

test: add test for impure function correlation behavior

48d4a51

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 19, 2024

test: add test for impure function correlation behavior

1fea07f

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 19, 2024

test: add test for impure function correlation behavior

5d9a1ec

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 19, 2024

test: add test for impure function correlation behavior

21b180c

Need to fix the UDF test case. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024

test: add test for impure function correlation behavior

7700941

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024

test: add test for impure function correlation behavior

d14384c

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024

test: add test for impure function correlation behavior

2afcaee

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024

test: add test for impure function correlation behavior

53c6941

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024

test: add test for impure function correlation behavior

ee0ae0c

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024

test: add test for impure function correlation behavior

37ae5dd

Need to fix the a few broken cases. Related to ibis-project#8921, trying to write down exactly what the expected behavior is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: inlining expressions leads to wrong results for non-pure functions #8921

bug: inlining expressions leads to wrong results for non-pure functions #8921

NickCrews commented Apr 9, 2024 •

edited

kszucs commented Apr 9, 2024

NickCrews commented Apr 9, 2024

NickCrews commented Apr 12, 2024

kszucs commented Apr 13, 2024

cpcloud commented Apr 13, 2024

kszucs commented Apr 14, 2024 •

edited

NickCrews commented Apr 18, 2024

NickCrews commented Apr 18, 2024 •

edited

cpcloud commented Apr 18, 2024

NickCrews commented Apr 18, 2024

cpcloud commented Apr 18, 2024

kszucs commented Apr 18, 2024 •

edited

NickCrews commented Apr 19, 2024

bug: inlining expressions leads to wrong results for non-pure functions #8921

bug: inlining expressions leads to wrong results for non-pure functions #8921

Comments

NickCrews commented Apr 9, 2024 • edited

What happened?

What version of ibis are you using?

What backend(s) are you using, if any?

Relevant log output

Code of Conduct

kszucs commented Apr 9, 2024

NickCrews commented Apr 9, 2024

NickCrews commented Apr 12, 2024

kszucs commented Apr 13, 2024

cpcloud commented Apr 13, 2024

kszucs commented Apr 14, 2024 • edited

NickCrews commented Apr 18, 2024

NickCrews commented Apr 18, 2024 • edited

cpcloud commented Apr 18, 2024

NickCrews commented Apr 18, 2024

cpcloud commented Apr 18, 2024

kszucs commented Apr 18, 2024 • edited

NickCrews commented Apr 19, 2024

NickCrews commented Apr 9, 2024 •

edited

kszucs commented Apr 14, 2024 •

edited

NickCrews commented Apr 18, 2024 •

edited

kszucs commented Apr 18, 2024 •

edited