Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(common): add a pure python egraph implementation #5781

Merged
merged 3 commits into from Apr 21, 2023

Conversation

kszucs
Copy link
Member

@kszucs kszucs commented Mar 20, 2023

Must haves:

  • support variadics when converting back from ENode to Node
  • remove newegraph.py implementation in favor of egraph.py
  • partially clean up the code

Should haves:

  • support conditions when matching (related feat(common): add pattern matchers #5824)
  • extensive memoization for better performance
  • rethink how to call dynamic rewrites (with enodes or with nodes?)
  • keep track of enode costs during union operation
  • add a backoff scheduler when matching patterns to speed up aggressively matching rules like commutativity and associativity

Could haves:

  • support custom cost function for extraction
  • would be great to implement the ENode interface in ops.Node so we don't need to convert ops.Nodes to ENodes, though we would still need to have a non-validated Node alternative for performance reasons

@saulshanabrook
Copy link
Contributor

Hey @kszucs! This is cool to see how you are integrating this with Ibis! I have also been experimenting with different ways of exposing e-graphs in Python (egraphs-good/egglog#126) as well. I had a few thoughts about the differences in API design. If you are interested in talking about it, let me know, I am happy to post here or jump on a call.

@kszucs
Copy link
Member Author

kszucs commented Mar 28, 2023

Hey @saulshanabrook!

I have also been experimenting with different ways of exposing e-graphs in Python (egraphs-good/egglog#126) as well.

I know, I have been following your work on the egg-smol python bindings :)
Before checking out your higher level API I used the lower level bindings to integrate egg-smol with ibis which actually worked, but was hard to express anything else than syntactic rewrite rules. Your freshly merged high-level API seems pretty promising, I especially like the tight integration with the typing system!

I started to work on a pure python implementation to support arbitrary python callables both when searching and when replacing enodes. It also makes it easier to customize other parts of the match-replace loop.

If you are interested in talking about it, let me know, I am happy to post here or jump on a call.

Definitely, though I'd need to first have a hand-on experiment with the new high-level API. Also @jcrist might be also interested.

@saulshanabrook
Copy link
Contributor

Before checking out your higher level API I used the lower level bindings to integrate egg-smol with ibis which actually worked, but was hard to express anything else than syntactic rewrite rules.

Oh cool! This was my first Rust project so don't look too close 🫣. It's funny to not know if anyone is looking at/trying something on Github.

Your freshly merged high-level API seems pretty promising, I especially like the tight integration with the typing system!

Yeah, I have gotten quite fond of that method of having Python functions with type annotations be the way to define operations instead of using classes... Since it integrates so nicely with the static typing! (As a side note, it's too bad that MyPy now complains at empty bodies with ... in them python/mypy#13729. They used to not complain, which was great for this sort of thing, hence all the # type: ignore lines which are so ugly!)

It's also fun how to make the Python typing match up with the egg-smol typing, so that an expression is allowed in python if and only if it is allowed in the egg-smol typing.

I started to work on a pure python implementation to support arbitrary python callables both when searching and when replacing enodes. It also makes it easier to customize other parts of the match-replace loop.

Ah yeah, that makes sense. I wonder more generally about that in egg-smol, can anything that was written as a custom replacement be represented in egg-smol in some way? If you have particular examples that seem tricky to do syntactically, I would be interested in those! Could then ask upstream what the strategy would be in egg-smol for them.

Also, I did have this idea at one point of adding Python objects as a builtin sort to egg-smol, since you can add new ones at runtime if you wanted other Python values besides ints/strings/floats as leaf nodes.

@kszucs
Copy link
Member Author

kszucs commented Mar 28, 2023

Before checking out your higher level API I used the lower level bindings to integrate egg-smol with ibis which actually worked, but was hard to express anything else than syntactic rewrite rules.

Oh cool! This was my first Rust project so don't look too close 🫣. It's funny to not know if anyone is looking at/trying something on Github.

Oh noo, the bindings are great! It was due to egg-smol library itself :)

If you have particular examples that seem tricky to do syntactically, I would be interested in those! Could then ask upstream what the strategy would be in egg-smol for them.

The egglog way of thinking requires more of a logical programming view which can be hard for a usual python developer, most of the time it's easier to calculate conditions in an imperative fashion (which could be memoized for better runtime performance).
Ideally ibis should be able to express rewrites rules similarly to cockroachdb's rules (they supposed to use equality saturation like solution too - at least according to their docstrings).

Also, I did have this idea at one point of adding Python objects as a builtin sort to egg-smol, since you can add new ones at runtime if you wanted other Python values besides ints/strings/floats as leaf nodes.

That would be pretty handy, would that allow calling arbitrary python functions?

@saulshanabrook
Copy link
Contributor

saulshanabrook commented Mar 28, 2023

Oh, one other thing I forgot to mention about the high level design... I think it's interesting if the node/operation definition can be used internally but also be a user-facing API. Like for example in the tests here there are things like:

class MyNode(Concrete, Node):
    pass

class MyLit(MyNode):
    value: int

class MyAdd(MyNode):
    a: MyNode
    b: MyNode

class MyMul(MyNode):
    a: MyNode
    b: MyNode

# number postfix highlights the depth of the node
one = MyLit(value=1)
two = MyLit(value=2)
two1 = MyAdd(a=one, b=one)
three1 = MyAdd(a=one, b=two)
six2 = MyMul(a=three1, b=two1)
seven2 = MyAdd(a=six2, b=one)

We wouldn't ask a user of Ibis to build up nodes in this manner, they would want some python-like API. So we would have to build another API on top of this one that is exposed to the user.

However, with the high level API of using Python functions as operations, this API could itself be exposed to the user:

from __future__ import annotations
from egg_smol import *

egraph = EGraph()


@egraph.class_
class MyNode(BaseExpr):
    def __init__(self, value: i64Like):
        ...
    def __add__(self, other: MyNode) -> MyNode:
        ...
    def __mul__(self, other: MyNode) -> MyNode:
        ...

one = MyNode(1)
two = MyNode(2)
two1 = one + one
three1 = one + two
six2 = three1 * two1
seven2 = six2 + one

Of course, you would have to provide all the right niceties, like an execute() method and a nice repr. But I think there is a big advantage to only having one level, since then it becomes easier for other folks to write their own expressions and rewrites between them. Like, let's say ibis has this MyNode class that defines addition and multiplication. Then I am a third-party library, which wants to add a power method and translate that to repeated multiplication. It would be great if I could just define that myself, without having to add a PR to ibis!

@egraph.function(cost=10)
def pow(a: MyNode, b: MyNode) -> MyNode:
    ...


a, b = vars_("a b", MyNode)
i = var("i", i64)
egraph.register(
    # x ^ 0 = 1
    rewrite(pow(a, MyNode(i64(0)))).to(MyNode(i64(1))),
    # x ^ 1 = x
    rewrite(pow(a, MyNode(i64(1)))).to(a),
    # x ^ i = x * x ^ (i - 1) iff i > 1
    rewrite(pow(a, MyNode(i))).to(a * pow(a, MyNode(i - i64(1))), i > i64(1)),
)

x = egraph.define("x", pow(MyNode(2), MyNode(3)), cost=20)
egraph.run(10)
print(egraph.extract(x))
# Prints "MyNode(2) * (MyNode(2) * MyNode(2))"

This to me would be a huge win! Then also if multiple libraries started using the same system to express their expressions and rewrites it would become feasible for third parties to write rewrites between them. Of course that's still a long way off, but I think it would be a way of making it much easier on users... Instead of having to say "will I use dask dataframes or ibis dataframes"? They could use whichever API maps more closely to their domain, and then write rewrites (provided no one else already has) to the backend they would like to use. And if that changes, then they don't have to rewrite their business logic, just write different rewrites.

Of course there are a lot of stepping stones before getting there, but that is what excited me about these systems! (besides also just having a nicer way in Python to express optimizations).


If you have particular examples that seem tricky to do syntactically, I would be interested in those! Could then ask upstream what the strategy would be in egg-smol for them.

The egglog way of thinking requires more of a logical programming view which can be hard for a usual python developer, most of the time it's easier to calculate conditions in an imperative fashion (which could be memoized for better runtime performance).
Ideally ibis should be able to express rewrites rules similarly to cockroachdb's rules (they supposed to use equality saturation like solution too - at least according to their docstrings).

Thanks for that link! Yeah it's interesting to think about the different ways of writing rules and their equivalences. I would love to dive into a precise example here to spell those out... I haven't worked with SQL rewrite rules much so it isn't as clear to what the different ways of expressing them are.

Also, I did have this idea at one point of adding Python objects as a builtin sort to egg-smol, since you can add new ones at runtime if you wanted other Python values besides ints/strings/floats as leaf nodes.

That would be pretty handy, would that allow calling arbitrary python functions?

I think so... Like in the same way that the builtin sorts to egg-smol call arbitrary Rust functions: https://github.com/mwillsey/egg-smol/blob/669ea92d3ac73b23bd8b0565fb0866cda0770a02/src/sort/f64.rs#L36

@kszucs kszucs marked this pull request as ready for review April 3, 2023 12:04
@kszucs kszucs changed the title feat(common): add a pure python egraph implementation [WIP] feat(common): add a pure python egraph implementation Apr 3, 2023
@kszucs
Copy link
Member Author

kszucs commented Apr 3, 2023

Sorry for the late response!

Of course, you would have to provide all the right niceties, like an execute() method and a nice repr. But I think there is a big advantage to only having one level, since then it becomes easier for other folks to write their own expressions and rewrites between them.

It's indeed a nice feature, however with ibis we have a bit more complicated situation which I'm not sure how to apply the typed interface of egg-smol-python. Basically we have two levels in ibis: the operation Nodes and Expressions.
The nodes are the terms actually while the expressions are wrappers on node hierarchies providing the API functions which serve as factories for new nodes.
I'm not sure how could that be expressed using python typehints.

Like, let's say ibis has this MyNode class that defines addition and multiplication. Then I am a third-party library, which wants to add a power method and translate that to repeated multiplication. It would be great if I could just define that myself, without having to add a PR to ibis!

@egraph.function(cost=10)
def pow(a: MyNode, b: MyNode) -> MyNode:
    ...

In ibis we already have the Power node to represent this case which we can add rules to rewrite.
For me a more interesting case would be to provide plain python code in the function body to actually produce the result node, there can be rules easy to express in imperative style of programming while hard to express egglog like logical relations.

I think so... Like in the same way that the builtin sorts to egg-smol call arbitrary Rust functions: https://github.com/mwillsey/egg-smol/blob/669ea92d3ac73b23bd8b0565fb0866cda0770a02/src/sort/f64.rs#L36

This would certainly help in this case, it's still in "enode-land". What I'm thinking is to support both e-node and plain "node-land" (the extracted e-nodes) if that makes sense. While the former would be more performant the latter would be close the the original way of thinking.

This to me would be a huge win! Then also if multiple libraries started using the same system to express their expressions and rewrites it would become feasible for third parties to write rewrites between them. Of course that's still a long way off, but I think it would be a way of making it much easier on users... Instead of having to say "will I use dask dataframes or ibis dataframes"? They could use whichever API maps more closely to their domain, and then write rewrites (provided no one else already has) to the backend they would like to use. And if that changes, then they don't have to rewrite their business logic, just write different rewrites.

Having a standard way for defining terms and term graphs would be great indeed, then the utilities (like egraphs) around these could be reused as well, it's more similar to what arrow provides for columnar data. Then EGraphs are "execution engines" on that hierarchic data.

Of course there are a lot of stepping stones before getting there, but that is what excited me about these systems! (besides also just having a nicer way in Python to express optimizations).

I'm also intrigued by these concepts! Once I manage to roll up my other PRs we can elaborate more on these.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a read through. I confess I don't fully understand the egraph implementation, but the test coverage looks good and I trust your work here.

My main concerns are:

  • The high use of operator overloading for defining patterns. IMO operators shouldn't be overloaded to do surprising things; if a user sees @ they should expect matrix multiplication, if they see Class[...] they should expect type parametrization. I'd prefer if we found a less-magical syntax for defining patterns.
  • Some simple docstrings for all the public methods on each type would aid a lot in helping others understand how this code works and how it should be used.

ibis/common/collections.py Outdated Show resolved Hide resolved
ibis/common/collections.py Outdated Show resolved Hide resolved
@@ -208,4 +208,113 @@ def __repr__(self):
return f"{self.__class__.__name__}({super().__repr__()})"


class DisjointSet(Mapping[K, Set[K]]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this datastructure is pretty specific to the egraph implementation and seems unlikely (IMO) to be used in other parts of Ibis, I'd prefer this was moved to ibis/common/egraph.py to keep the implementation together. It'd aid with readability of the code to have all these methods in the same file IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that this is unlikely to use in other places, but I have a slight preference of keeping it here. At first I kept both in the same file, and it was harder to comprehend which structure does what. Once I managed to properly factor it out and properly unit test it, now we can be certain that it complies with the specification of the union find algorithm.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still would like to see this moved to the egraph.py file. This is a unique enough datastructure that I'd expect most readers not to be familiar the meaning of operations available on a disjoint set (or the "union find" algorithm). Having all the code in the same file would really help with readability IMO.

Copy link
Member Author

@kszucs kszucs Apr 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, going to move it to egraph.py

ibis/common/egraph.py Show resolved Hide resolved
ibis/common/egraph.py Show resolved Hide resolved
ibis/common/egraph.py Show resolved Hide resolved
ibis/expr/operations/core.py Outdated Show resolved Hide resolved
ibis/common/tests/test_egraph.py Show resolved Hide resolved
ibis/common/egraph.py Show resolved Hide resolved
@kszucs kszucs force-pushed the pyegraph branch 2 times, most recently from 54ae015 to d8c77bb Compare April 19, 2023 14:25
ibis/common/graph.py Outdated Show resolved Hide resolved
@kszucs kszucs force-pushed the pyegraph branch 2 times, most recently from edc9aa4 to 0c260e7 Compare April 19, 2023 14:30
@kszucs
Copy link
Member Author

kszucs commented Apr 19, 2023

@jcrist I removed some leftovers and squashed into a single commit, should be ready to another look

ibis/common/egraph.py Outdated Show resolved Hide resolved
@cpcloud
Copy link
Member

cpcloud commented Apr 20, 2023

The high use of operator overloading for defining patterns. IMO operators shouldn't be overloaded to do surprising things; if a user sees @ they should expect matrix multiplication, if they see Class[...] they should expect type parametrization. I'd prefer if we found a less-magical syntax for defining patterns.

I guess I can go either way here.

Whether a thing is surprising is highly context- and person-dependent.

It may surprise some people that ~ encompasses both bitwise not operations as well as boolean negation. For people who are used to numpy and pandas it probably won't surprise them.

For SQL users, who knows? Many SQL dialects tend to prefer English words to operators, but some (Postgres for example) have a large variety of syntactic operators.

Matrix multiplication is such a niche operator (I personally still do not understand why it's part of Python's syntax) that restricting to just that seems like it's leaving a lot of shorthand convenience on the table. After all, it's not like it has multiple meanings in this context.

Similarly, if ibis ever grows something like a matmul operator that actually does do matrix multiplication, it's going to be mutually exclusive with how it's used in the pattern matching context.

I'm +0.5 on it.

Copy link
Member

@jcrist jcrist left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@kszucs kszucs merged commit fc89cc3 into ibis-project:master Apr 21, 2023
82 checks passed
@kszucs kszucs deleted the pyegraph branch May 16, 2023 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants