Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: inlining expressions leads to wrong results for non-pure functions #8921

Open
1 task done
NickCrews opened this issue Apr 9, 2024 · 13 comments · Fixed by #8967
Open
1 task done

bug: inlining expressions leads to wrong results for non-pure functions #8921

NickCrews opened this issue Apr 9, 2024 · 13 comments · Fixed by #8967
Labels
bug Incorrect behavior inside of ibis

Comments

@NickCrews
Copy link
Contributor

NickCrews commented Apr 9, 2024

What happened?

See this:

import ibis
import random

ibis.options.interactive = True


@ibis.udf.scalar.python(side_effects=True)
def my_random(x: float) -> float:
    return random.random()


t = ibis.memtable({"x": [1, 2, 3, 4, 5]})
# this happens with both the builtin random, and UDFs:
# t = t.mutate(y=ibis.random())
t = t.mutate(y=my_random(t.x))
t = t.mutate(size=(t.y > 0.5).ifelse("big", "small"))
t
# ┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┓
# ┃ x     ┃ y        ┃ size   ┃
# ┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
# │ int64 │ float64  │ string │
# ├───────┼──────────┼────────┤
# │     1 │ 0.208385 │ big    │
# │     2 │ 0.566183 │ big    │
# │     3 │ 0.836448 │ small  │
# │     4 │ 0.530235 │ small  │
# │     5 │ 0.526687 │ small  │
# └───────┴──────────┴────────┘

Notice that the size column is sometimes incorrect. This is because size is getting calculated from a totally uncorrelated RANDOM() in the generated SQL. ibis.to_sql(t) shows:

SELECT
  "t0"."x",
  RANDOM() AS "y",
  CASE WHEN RANDOM() > CAST(0.5 AS DOUBLE) THEN 'big' ELSE 'small' END AS "size"
FROM "ibis_pandas_memtable_m5jukuqv2rcrhkpg2vsdu6k5fe" AS "t0"

For this to work as expected, there can only be a single RANDOM() in the generated SQL. I think that the agressive inlining that we do to merge selects/mutates together is over-aggressive. After a short talk with @cpcloud, this inlining is only correct when the entire expression is pure. Either we need to somehow know when an expression is pure (most are, except for (some) UDFs, random(), maybe now(), and maybe others I'm not thinking of?), or just be conservative and only do this inlining when the subsequent selects/mutates are just reselecting/renaming columns with no other changes.

What version of ibis are you using?

main

What backend(s) are you using, if any?

I'm not sure if this is present on all backends, but I think so.

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@NickCrews NickCrews added the bug Incorrect behavior inside of ibis label Apr 9, 2024
@kszucs
Copy link
Member

kszucs commented Apr 9, 2024

We can add these impure types to https://github.com/ibis-project/ibis/blob/main/ibis/backends/sql/rewrites.py#L155 in order to prevent merging those selects.

@NickCrews
Copy link
Contributor Author

That would work, if we are sure we can get all the types. What about ops.ArrayDistinct though? In duckdb that is an impure function, because the order of the result is not guaranteed. But in other backends it might be a pure function. Do we just have to be conservative and call that a blocking op?

@NickCrews
Copy link
Contributor Author

ibis.uuid() is another impure expression to worry about.

What do you think is the counter-argument against only merging on re-selects? Is it just the uglier/more verbose SQL? Unless we have evidence that it is slower, I would want to lean towards the less-likely-to-be-incorrect route, even if the generated SQL is uglier for human consumption.

@kszucs
Copy link
Member

kszucs commented Apr 13, 2024

The problem doesn't just occur in case of merging Select nodes. Consider the following case:

In [1]: import ibis

In [2]: uid = ibis.uuid()

In [3]: t = ibis.table({"a": "int", "b": "string"})

In [4]: t1 = t.select(t, uid)

In [7]: t1 = t.select(t, uid=uid)

In [8]: t2 = t1.select(uid)

In [9]: t2
Out[9]:
r0 := UnboundTable: unbound_table_0
  a int64
  b string

r1 := Project[r0]
  a:   r0.a
  b:   r0.b
  uid: RandomUUID()

Project[r1]
  RandomUUID(): r1.uid

In [10]: t3 = t1.select(ibis.uuid())

In [11]: t3
Out[11]:
r0 := UnboundTable: unbound_table_0
  a int64
  b string

r1 := Project[r0]
  a:   r0.a
  b:   r0.b
  uid: RandomUUID()

Project[r1]
  RandomUUID(): r1.uid

This "rewrite" happens at the API level as well during value dereferencing which we certainly would like to keep. In the first case we use the same variable so we mean the same value whereas in the latter we create a new uuid assuming that the user intention is to create a new one.

I think this issue can be resolved by better modelling of inpure expressions like uuid() and random() and making them actually unique by a virtual field, then we would need to mark those expressions solving both the dereferencing and Select merge cases:

class Unique(Value):
    _counter = itertools.count()
    uid: Optional[int] = None

    def __init__(self, uid, **kwargs):
        if uid is None:
            uid = next(self._counter)
        super().__init__(uid=uid, **kwargs)


class RandomUUID(Unique):
    dtype = dt.uuid


class RandomScalar(Unique):
    dtype = dt.float64

@cpcloud
Copy link
Member

cpcloud commented Apr 13, 2024

What's the issue with disabling the merging? Otherwise we'll be playing whack-a-mole with discovering impure functions.

@kszucs
Copy link
Member

kszucs commented Apr 14, 2024

Disabling merging would solve the original problem in the issue but wouldn't solve the dereferencing one shown by me. My proposed solution would solve both.

@NickCrews
Copy link
Contributor Author

Sorry I went on vacation for a few days, therefore radio silence.

I'm hesitant that the linked PR wasn't the right solution:

  • still doesn't address the ops.ArrayDistinct point above
  • still doesn't address the UDF point above
  • More significantly, the added test case isn't actually what I would expect. There is some "spooky action at a distance" happening here: as soon as I do t.mutate(y=impure), now all of a sudden impure has become tainted, it is now referencing the first table that it is bound to? I find that very surprising.

If I wanted y and z to be correlated, I would have referenced the parent column directly:

    impure = func()
    t = ibis.table({"x": "int64"}, name="t")
    t1 = t.mutate(y=impure)
    t2 = t1.mutate(z=t1.y.cast("string")) 
    expected = ops.Project(
        parent=t1,
        values={"x": t1.x, "y": t1.y, "z": t1.y.cast("string")},
    )

If I don't reference the parent table in subsequent selects, then I expect the columns to be uncorrelated:

    impure = func()
    t = ibis.table({"x": "int64"}, name="t")
    t1 = t.mutate(y=impure)
    t2 = t1.mutate(z=impure.cast("string"))
    expected = ops.Project(
        parent=t1,
        values={"x": t1.x, "y": t1.y, "z": impure.cast("string")},
    )

I think we need to rely on the user doing explicit .selects and .mutates to be the boundary locations where expressions are separated as correlated/uncorrelated.

@NickCrews NickCrews reopened this Apr 18, 2024
@NickCrews
Copy link
Contributor Author

NickCrews commented Apr 18, 2024

The problem doesn't just occur in case of merging Select nodes. Consider the following case:

I see what you're saying here, that does look like a problem. Just to be sure here, what I would expect is for both cases to be something like the very simple

UnboundTable:
   RandomUUID(): RandomUUID()

(ie there is literally NO reference to the parent tables, because we only selected a single UUID expression, which doesn't depend on any parents).

Is this what you would expect too?

@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2024

  • The ordering of ArrayDistinct is not part of its purity, because of the relational model, which doesn't guarantee the ordering of any output without an ORDER BY. The same is true for array aggregation (Ibis calls this collect). Under a model like the one you're alluding to, we'd also have to consider parallel floating point arithmetic to be impure, since the bitwise output can be different for summation (for example) depending on the order in which the summands are processed. I don't think we should consider array distinct impure.
  • UDFs will be done in a follow-up.

More significantly, the added test case isn't actually what I would expect.

Well, yeah :) See my comment here: #8967 (comment)

@NickCrews
Copy link
Contributor Author

In this comment, @kszucs says:

# semantically this should produce two different columns
t.select(x=ibis.random(), y=ibis.random())
# semantically this should produce identical columns
rand = ibis.random()
t.select(x=rand, y=rand)

I think that both of these should actually produce two different columns. If you want to get two identical columns, I think you should have to explicitly re-select from a parent relation:

t.select(common=ibis.random()).select(x=_.common, y=_.common)

Consider this example (maybe I'm not understanding the implementation, correct me if so):

# @functools.cache()
def generate_noise(amplitude):
    return (ibis.random() - .5) * amplitude


def add_noise(t, columns: list[str]):
    return t.mutate(**{col:t[col] + generate_noise(100) for col in columns})

If there is a @functools.cache() on the helper function, then the id of the returned expression is the same, and so you get correlated noise in each of the columns. If there is no @functools.cache(), then the noise will be uncorrelated. I don't think this behavior should depend on such an implementation detail.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 18, 2024
Need to fix the UDF test case.

Related to ibis-project#8921,
trying to write down exactly what
the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 18, 2024
Need to fix the UDF test case.

Related to ibis-project#8921,
trying to write down exactly what
the expected behavior is.
@cpcloud
Copy link
Member

cpcloud commented Apr 18, 2024

@kszucs Perhaps we should consider restricting impurity further, to effectively require a projection to get the same value.

I agree with Nick that t.select(x=rand, y=rand) should produce two random() invocations, and I don't think we should make assumptions about impurity based on whether two variables are the same.

A variable isn't the same as a column in a projection.

@kszucs
Copy link
Member

kszucs commented Apr 18, 2024

I can accept that as the desired behaviour, though we need to prevent dereferencing in that case. Going to take a look.

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 19, 2024
Need to fix the UDF test case.

Related to ibis-project#8921,
trying to write down exactly what
the expected behavior is.
@NickCrews
Copy link
Contributor Author

This comment is maybe getting towards the use cases we need to support?

NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 19, 2024
Need to fix the UDF test case.

Related to ibis-project#8921,
trying to write down exactly what
the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 19, 2024
Need to fix the UDF test case.

Related to ibis-project#8921,
trying to write down exactly what
the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
NickCrews added a commit to NickCrews/ibis that referenced this issue Apr 23, 2024
Need to fix the a few broken cases.

Related to ibis-project#8921,
trying to write down exactly what the expected behavior is.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior inside of ibis
Projects
Status: backlog
3 participants