Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result in certain situations with multiple joins with difference partition schemes #12577

Closed
rongrong opened this issue Apr 2, 2019 · 5 comments
Assignees

Comments

@rongrong
Copy link
Contributor

rongrong commented Apr 2, 2019

This is a bug introduced by #12013. The result would be wrong if the following situation happens:

  • The query uses COALESCE(joinKey) on top of FULL OUTER JOIN with equi-join.
  • The children of the FullJoin node uses a different hash function to compute the partition from the join keys. For example, hash is computed on (a, constant) and join key is just a.
  • There is another JOIN with the result of FULL OUTER JOIN using equi-join on only the coalesced keys of the FULL OUTER JOIN.

In such situation, the newly introduced optimization would assume that the result of the FULL OUTER JOIN is already partitioned on COALESCE(a) thus there's no need for another shuffle before the next join. However, because the hash function is calculated on (a, constant), even if the data is "partitioned on a" it would be on a different node as a hash function computed with just a. Thus a shuffle would still be needed to produce correct result.

@rongrong
Copy link
Contributor Author

rongrong commented Apr 2, 2019

Theoretically it would also produce incorrect results if the input of the FULL OUTER JOIN is partitioned on more columns than just the join keys, for example, input is partitioned on (a, b) but only join on a. Though practically I don't know what query shape would make that happen.

@wenleix
Copy link
Contributor

wenleix commented Apr 8, 2019

@rongrong : Might be easier to understand the condition combination with an example query ? :) (e.g. TPCH)

@rongrong
Copy link
Contributor Author

rongrong commented Apr 9, 2019

Here's an example that can produce the wrong plan. It's harder to come up with a query that can produce meaningful & wrong results though:

SELECT * 
FROM customer t3
JOIN (
    SELECT coalesce(c1, c2) c
    FROM (
        SELECT custkey c1, name FROM customer WHERE name = 'a') t1 
    FULL OUTER JOIN (
        SELECT custkey c2, name FROM customer WHERE name = 'b' GROUP BY 1, 2) t2
    ON t1.c1 = t2.c2) t
ON t3.custkey = t.c;

@tooptoop4
Copy link

@rongrong did #12946 fix this issue? if so, does .226 have the fix?

@rongrong
Copy link
Contributor Author

@tooptoop4 Yes #12946 reintroduced FULL OUTER JOIN + COALESCE optimization and it should not have this bug. It will be released in 0.227.

@rongrong rongrong closed this as completed Dec 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants