Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(clickhouse): move translate_val to toposorted compiler loop #7209

Merged
merged 5 commits into from
Sep 26, 2023

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Sep 24, 2023

This PR moves translate_val to function the same way that translate_rel
does: inputs are compiled and then passed into translate_val instead of
translation rules having to compile their own inputs.

A number interesting things come out of this:

  1. Recursion is entirely removed from the compiler. This makes the logic it
    much easier to follow IMO.
  2. Most rules are greatly simplified: no need to call translate_val
    recursively, inputs are compiled and then passed to functions.
  3. A few features "just work" after this refactor because the translation rule
    for the op is simpler (window functions)

Additionally, I was able to remove most occurrences of dialect="clickhouse"
and convert the clickhouse translate rules to use sqlglot functions or or other
constructs (SQLStringView and Joins need the dialect parameter).

Not counting the newly generated SQL, the net lines of code are reduced as well:

❯ git diff --shortstat upstream/master HEAD~
 24 files changed, 975 insertions(+), 1258 deletions(-)

@cpcloud cpcloud added refactor Issues or PRs related to refactoring the codebase clickhouse The ClickHouse backend labels Sep 24, 2023
@cpcloud cpcloud force-pushed the clickhouse-translate-val-map branch 9 times, most recently from 034819a to d771fd4 Compare September 25, 2023 12:02
op = op.replace(
p.WindowFunction(p.Cumulative, ...)
>> (lambda op, _: cumulative_to_window(op.func, op.frame))
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern system is pretty darn sweet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You "should" be able to do the replacements in one pass by combining them into an AnyOf pattern:

    params = {param.op(): value for param, value in params.items()}
    scalar_rule = p.ScalarParameter >> (
        lambda op, _: ops.Literal(value=params[op], dtype=op.dtype)
    )
    window_rule = p.WindowFunction(p.Cumulative, ...) >> (
        lambda op, _: cumulative_to_window(op.func, op.frame)
    )
    op = op.replace(scalar_rule | window_rule)

Copy link
Member

@kszucs kszucs Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provide a way to spell the replacement rule as a decorated function:

@replace(ops.ScalarParameter)
def substitute_param(op, ctx):
    return ops.Literal(value=params[op], dtype=op.dtype)

Possibly worth a low-priority ticket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid building that until we have some more use cases as input

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-pass replacement worked 🎉

Copy link
Member

@kszucs kszucs Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will work until the replacement rules have distinct matchers because the first matching pattern gets applied.

return new_expr.op()


def _translate_node(node, *args, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please remove _translate_node, translate_rel and translate_val in favor of a single flat overloaded function, e.g. translate()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that in a follow up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think this adds unnecessary implicit behavior:

  1. I now have to deal with import cycles
  2. I now have to ensure that both values and relations translate is imported so that all the rules are registered.

I don't really see what we gain by making these into a single rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put all the translate implementations into a single file?

Copy link
Member Author

@cpcloud cpcloud Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that they are separate now, it makes it easier to find things IMO.

)

# apply translate rules in topological order
results = op.map(fn, filter=(ops.TableNode, ops.Value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restricting the traversal is required or just performance optimization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried it without. I'd rather keep it as is, if only for readability. It's easier to see that the code should only work with these node types than if there were no filter, which would suggest it handles generic Node instances.

@cpcloud cpcloud force-pushed the clickhouse-translate-val-map branch 3 times, most recently from a320152 to a2b4eaf Compare September 25, 2023 12:34
*map(partial(translate_val, **kw), op.values), dialect="clickhouse"
)
@translate_rel.register
def _dummy(op: ops.DummyTable, *, values, **_):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the keyword-only marker * from the functions? I usually find it more distracting than useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of like it as an indicator of compiled inputs. I'd rather keep it if that's okay.

Copy link
Member Author

@cpcloud cpcloud Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also further enforces the op structure during compilation, you can't have any rogue arguments.

The error messages are also much more readable with this when an argument is missing. With positional-or-keyword you don't get the name of the argument that was missing when you fail to pass it in whereas with keyword-only you do.

I think it's better than having it positional-or-keyword, even if the * is a bit distracting.

ops.Multiply: "*",
ops.Divide: "/",
ops.Modulus: "%",
ops.Add: operator.add,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We repeat these mappings (along with their string infix representation) at multiple places, possible we should set them directly on the operation classes. Should create a follow-up ticket for this.

@cpcloud cpcloud force-pushed the clickhouse-translate-val-map branch 3 times, most recently from 9ec68ae to 85a56b2 Compare September 25, 2023 16:15
@cpcloud cpcloud force-pushed the clickhouse-translate-val-map branch 4 times, most recently from 18dff95 to 3c4ca70 Compare September 26, 2023 10:54
@cpcloud cpcloud added this to the 7.0 milestone Sep 26, 2023
@cpcloud cpcloud force-pushed the clickhouse-translate-val-map branch 2 times, most recently from 2c19f21 to 3bccfb1 Compare September 26, 2023 11:31
def cumulative_to_window(
func: ops.Cumulative, frame: ops.WindowFrame
) -> ops.WindowFunction:
klass = getattr(ops, func.__class__.__name__.replace("Cumulative", ""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit fragile to me but fine for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't changed the name of cumulative operations ever ... maybe?

@@ -952,9 +949,6 @@ def agg(df):


@pytest.mark.notimpl(["datafusion", "polars"], raises=com.OperationNotDefinedError)
@pytest.mark.broken(
["clickhouse"], reason="clickhouse returns incorrect results", raises=AssertionError
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

alltypes[alltypes.int_col == 1]
.limit(n)
.int_col.collect()
.map(lambda x: my_add(x, 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty nice!



@pytest.fixture
def translate():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see this gone!

# rewrite cumulative functions to window functions, so that we don't have
# to think about handling them in the compiler, we need only compile window
# functions
replace_cumulative_ops = p.WindowFunction(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like it!

op.value,
ops.TableArrayView(
ops.Selection(
table=an.find_first_base_table(op.options), selections=(op.options,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should switch to Topmost pattern to locate the base table, but in a follow-up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like BaseTableOf = Topmost(ops.Relation)

c.ExistsSubquery(x)
)

op = op.replace(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting to have a ruleset here, great to see it in action!

Copy link
Member

@kszucs kszucs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great improvement, thanks @cpcloud!

@kszucs kszucs merged commit 7f1852e into ibis-project:master Sep 26, 2023
86 checks passed
@cpcloud cpcloud deleted the clickhouse-translate-val-map branch September 26, 2023 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clickhouse The ClickHouse backend refactor Issues or PRs related to refactoring the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants