New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multiple columns in IN predicate #6385
Comments
There's a PR in progress to support this: #5323 |
I'm starting working on this |
@arhimondr do you have plans how tuples with NULLs should be distributed? Also, how does it play with structural types? What should be the result of |
Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered |
I'm going to change that to be |
Does it mean |
Yes |
Just note that today:
and |
No, they have different size, so the comparison should return false. |
With respect to how they are distributed, yes. But not for the purpose of comparison for the IN test - that still behaves according to equality semantics |
Oh, sure. Would the answer be the same for |
IN should have the same semantics as = |
Right, so
And yeah,
that's how the array comparison is implemented now. It works only because we check the size first. If it is different - it is false. But when it comes to actually compare the values, we do not support nulls
Postgree would just answer
I guess since we didn't implement this - it is not covered by the standard. Null comparisons for And it seems that semantics for
is equivalent to
is equivalent to
I didn't check the SQL standard yet (i will), but so far the idea is to change the |
*** Update: this comment is incorrect. See follow-up comments below. ***
|
Ah, I take that back. That first block talks about arrays being "identical", not "equal", so that's not the operation to be considered for equality. |
Here are the equality rules for arrays:
So, according to the standard, |
And for rows:
|
Yeah, thanks for digging into the standard. I was just about to do so. So, yeah,
We should leave |
And yeah, we cannot rely on |
Little bit of a pseudo code =)
|
This algorithm is already implemented in the HashSemiJoinOperator. On the probe side we need no modification. We always operate on a single row. On the build side we need to broadcast all the values with nulls at any level (we call them In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an The only unresolved issue there is the After implementing a
|
@arhimondr -- this is correct in a sense this doesn't produce wrong result. It's not (yet) supported.
Which PR you refer to?
Doesn't it mean we need to implement missing support for |
Yes, exactly. |
This isn't true unfortunately. Consider probe row: So it seems that probe rows need to have |
E.g probe should be similar to:
|
We already distribute an "any" row. Initially i though we introduced that
to cover the "empty in" case, but than i realized that it should cover the
probe nulls case as well.
For your particular example it would return the correct result. Because the
"array[2, 1]" value is gonna be broadcasted as the "any" value.
Although It still feels that there's a case that may not be covered, so far
i couldn't come up with one.
Let me try to follow you through my thoughts.
No matter what the build side set is, for an indeterminate value on a probe
side only "false" or "null" can be returned.
In the SQL "null or false" expression evaluates to null.
It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.
You can get a "null" only when matching an indeterminate with an
indeterminate (a value that contains a null somewhere).
Since we replicate all the indeterminate values, we shouldn't miss a null.
Let me know if I'm missing something.
…On Mon, May 7, 2018 at 10:51 AM Karol Sobczak ***@***.***> wrote:
E.g probe should be similar to:
Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
|
SemiJoin (broadcasts probe rows with nulls)
|
AssignUniqueID
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6385 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o>
.
|
UPD:
I'm slow. You get a "null" also for an "array[1,null] in (array[1,2],
array[2,3])". In case array[2,3] is replicated we get a "false", what is
incorrect.
Yeah, you are totally right, we need a unique row id approach here.
…On Mon, May 7, 2018 at 5:20 PM Andriy Rosa ***@***.***> wrote:
We already distribute an "any" row. Initially i though we introduced that
to cover the "empty in" case, but than i realized that it should cover the
probe nulls case as well.
For your particular example it would return the correct result. Because
the "array[2, 1]" value is gonna be broadcasted as the "any" value.
Although It still feels that there's a case that may not be covered, so
far i couldn't come up with one.
Let me try to follow you through my thoughts.
No matter what the build side set is, for an indeterminate value on a
probe side only "false" or "null" can be returned.
In the SQL "null or false" expression evaluates to null.
It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.
You can get a "null" only when matching an indeterminate with an
indeterminate (a value that contains a null somewhere).
Since we replicate all the indeterminate values, we shouldn't miss a null.
Let me know if I'm missing something.
On Mon, May 7, 2018 at 10:51 AM Karol Sobczak ***@***.***>
wrote:
> E.g probe should be similar to:
>
> Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
> |
> SemiJoin (broadcasts probe rows with nulls)
> |
> AssignUniqueID
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#6385 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o>
> .
>
|
unless eg |
Yes, but it is adds overhead for cases where there are no nulls at all. Maybe instead for now we could just support semi joins which are used by On the other hand semi join with majority of build rows being indeterminate might have performance similar to cross join. Do you want to complete this issue because you want to fix complex type equality and this is prerequisite? |
It is the other way. I want to fix complex type equation as a prerequisite
to this.
…On Tue, May 8, 2018 at 4:38 AM Karol Sobczak ***@***.***> wrote:
Yeah, you are totally right, we need a unique row id approach here.
Yes, but it is adds overhead for cases where there are no nulls at all.
Maybe instead for now we could just support semi joins which are used for
row filtering (e.g: FilterNode above SemiJoin node)? That is much easier
issue since we don't have to distinguish between null/false semi join
result. For cases where semi join is not used filter we can simply use
current semantics (e.g: fail when any row is indeterminate (e.g: via filter
below semi join)).
On the other hand semi join with majority of build rows being
indeterminate might have performance similar to cross join.
Do you want to complete this issue because you want to fix complex type
equality and this is prerequisite?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6385 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFUBrPI4cNpro52g7VAjjqrAc_Pcx8Vtks5twWd8gaJpZM4KZa0o>
.
|
That's a good point. Actually in many cases you don't have to distinguish between For example
I'm not sure, but it seems that whenever we use an Our optimizer should be smart enough to detect whether we really care about nulls. |
IN in projection is uncommon. However, in WHERE the IN might be negated.
Then we need to distinguish between false result and null
… |
Hi there, just upping this thread in case anyone has given some more thoughts into it. Multiple columns in IN predicate sounds like an interesting feature to have as it can avoid a lot of complexity, especially when dealing with template-generated queries. |
Update: there is a way to use multiple-columns with an Example: SELECT a, b
FROM table
WHERE (a, b)
IN (
SELECT (a, b) --- wrap a, b in parenthesis here
FROM table
) |
The problem is that currently Presto doesn't support IN expression for
complex types. There were complications mostly related to `null` vs `false`
result for semi join.
…On Wed, Sep 11, 2019 at 2:45 AM Jivan Roquet ***@***.***> wrote:
Update: there is a way to use multiple-columns with an WHERE (a, b) IN
(SELECT ...), by wrapping the selected rows in parenthesis so as they're
effectively turned into one single anonymous row.
Example:
SELECT a, bFROM tableWHERE (a, b)IN (
SELECT (a, b) --- wrap a, b in parenthesis here
FROM table
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6385?email_source=notifications&email_token=ABKQDLD7TR3EAORGCAVA7FLQJC427A5CNFSM4CTFVUUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6N5LFY#issuecomment-530306455>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABKQDLDL62J2DXVVEAJMGXDQJC427ANCNFSM4CTFVUUA>
.
|
How comes it works by simply wrapping multiple columns in parenthesis, then? |
Support queries like:
It should be easy to implement once #6384 got implemented.
The text was updated successfully, but these errors were encountered: