Support multiple columns in IN predicate #6385

kokosing · 2016-10-18T06:25:41Z

Support queries like:

presto:sf1> select count(*) from lineitem where (orderkey, linenumber) IN (SELECT orderkey, linenumber from lineitem);
Query 20161018_062422_00016_uqzsf failed: line 1:60: Multiple columns returned by subquery are not yet supported. Found 2

It should be easy to implement once #6384 got implemented.

The text was updated successfully, but these errors were encountered:

martint · 2016-10-18T13:41:33Z

There's a PR in progress to support this: #5323

arhimondr · 2018-04-30T19:24:27Z

I'm starting working on this

findepi · 2018-04-30T20:09:38Z

@arhimondr do you have plans how tuples with NULLs should be distributed?
IN currently employs "replicate any and null" distribution (unless it's broadcasted of course). In tuple case, at first it seems you need to replicate tuples that contain any NULL value. Is it possible to do better?

Also, how does it play with structural types? What should be the result of map(2 => null) IN ( map(2 => 3) ) ?

arhimondr · 2018-04-30T21:36:17Z

Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL

arhimondr · 2018-04-30T21:37:02Z

replicate any and null

I'm going to change that to be replicate any and indeterminete

findepi · 2018-05-01T07:16:06Z

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

arhimondr · 2018-05-01T13:35:12Z

Yes

findepi · 2018-05-01T14:22:55Z

Just note that today:

presto> select array[1, 2, 3] = array[42, null];
 _col0
-------
 false

and IN is defined in terms of equality, so = and IN should remain consistent. (AFAIU... @martint can you comment?)

martint · 2018-05-01T14:39:33Z

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

No, they have different size, so the comparison should return false.

martint · 2018-05-01T14:45:39Z

Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL

With respect to how they are distributed, yes. But not for the purpose of comparison for the IN test - that still behaves according to equality semantics

findepi · 2018-05-01T15:22:19Z

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

No, they have different size, so the comparison should return false.

Oh, sure. Would the answer be the same for array[1, 2] and array[42, null]? (array = is left-to-right, so this comparison returns false)
What about array[1, 2] and array[null, 42]? This = comparison fails today, but multi-column IN can result in equivalent case ((1,2) IN ( select null, 42 ))

martint · 2018-05-01T15:33:34Z

IN should have the same semantics as =

arhimondr · 2018-05-02T16:10:05Z

IN should have the same semantics as =

Right, so

array[1, 2, 3] IN ( array[42, null] ) is equal to array[1, 2, 3] = array[42, null]

And yeah,

No, they have different size, so the comparison should return false.

that's how the array comparison is implemented now. It works only because we check the size first. If it is different - it is false.

But when it comes to actually compare the values, we do not support nulls

presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements

Postgree would just answer true in that case.

postgres=# SELECT ARRAY[1, null] = ARRAY[1, null];
 ?column?
----------
 t
(1 row)

I guess since we didn't implement this - it is not covered by the standard.

Null comparisons for null in row are also not supported https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/type/RowType.java#L263

And it seems that semantics for null in row and semantics for null in multicolumn in are different

postgres=# SELECT (1, cast(null as int)) IN (SELECT 1, cast(null as int));
 ?column?
----------

(1 row)

is equivalent to

postgres=# SELECT 1 = 1 AND null = null;
 ?column?
----------

(1 row)

postgres=# SELECT array[1, cast(null as int)] IN (SELECT array[1, cast(null as int)]);
 ?column?
----------
 t
(1 row)

is equivalent to

postgres=# SELECT array[1, cast(null as int)] = array[1, cast(null as int)];
 ?column?
----------
 t
(1 row)

I didn't check the SQL standard yet (i will), but so far the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level. And than to just use it instead of isNull checks in SemiJoin related code.

martint · 2018-05-02T17:29:27Z

*** Update: this comment is incorrect. See follow-up comments below. ***



```
Let A1 and A2 be arrays of EDT. A1 and A2 are identical if and only if A1 and A2
have the same cardinality n and if, for all i in the range 1 (one) ≤ i ≤ n, the element 
at ordinal position i in A1 is identical to the element at ordinal position i in A2.
```

and

```
9.10 Determination of identical values
...

2) Case:
    a) If V1 and V2 are both the null value, then V1 is identical to V2.
```

martint · 2018-05-02T17:37:48Z

Ah, I take that back. That first block talks about arrays being "identical", not "equal", so that's not the operation to be considered for equality.

martint · 2018-05-02T17:43:32Z

Here are the equality rules for arrays:

ii) If the declared types of XV and YV are array types or distinct types whose source types 
    are array types and the cardinalities of XV and YV are N1 and N2, respectively, then 
    let Xi, 1 (one) ≤ i ≤ N1, denote a <value expression> whose value and declared type is 
    that of the i-th element of XV and let Yi, 1 (one) ≤ i ≤ N2, denote a <value expression> 
    whose value and declared type is that of the i-th element of YV. The result of

        X <comp op> Y

    is determined as follows:
    1) X = Y is True if N1 = 0 (zero) and N2 = 0 (zero).
    2) X = Y is True if N1 = N2 and, for all i, Xi = Yi is True.
    3) X = Y is False if and only if one of the following is true:
            A) N1 ≠ N2.
            B) N1 = N2 and for some i, 1 (one) ≤ i ≤ N1, NOT (Xi = Yi) is True.
    4) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.

So, according to the standard, array [1, null] = array[1, null] => null

martint · 2018-05-02T17:47:00Z

And for rows:

i) If the declared types of XV and YV are row types with degree N, then let Xi, 1 (one) ≤ i ≤ N,
    denote a <value expression> whose value and declared type is that of the i-th  field of XV and let Yi
    denote a <value expression> whose value and declared type is that of the i-th  field of YV.

    The result of

        X <comp op> Y

    is determined as follows:
    1) X = Y is True if and only if Xi = Yi is True for all i.
    2) X < Y is True if and only if Xi = Yi is True for all i < n and Xn < Yn for some n.
    3) X = Y is False if and only if NOT (Xi = Yi) is True for some i.
    4) X < Y is False if and only if X = Y is True or Y < X is True.
    5) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.

arhimondr · 2018-05-02T17:51:40Z

Yeah, thanks for digging into the standard. I was just about to do so.

So, yeah, PostgreSQL implements array equals not by the standard.

the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level

We should leave INDETERMINATE operator implementation as is than.

arhimondr · 2018-05-02T17:59:17Z

And yeah, we cannot rely on INDETERMINATE when comparing, because due to the standard array [1, null] = array[1, null] => null, but array [1, null] = array[2, null] => FALSE

arhimondr · 2018-05-02T22:46:16Z

Little bit of a pseudo code =)

    Boolean in(List<Object> row, Set<List<Object>> searchRowSet)
    {
        if (searchRowSet.isEmpty()) {
            return false;
        }

        if (searchRowSet.containsExact(row)) {
            return true;
        }

        for (List<Object> setElement : searchRowSet) {
            if (nullableEquals(setElement, row) == null) {
                return null;
            }
        }

        return false;
    }

    Boolean nullableEquals(List<Object> setElement, List<Object> values)
    {
        boolean match = true;
        for (int i = 0; i < values.size(); i++) {
            Boolean result = nullableEquals(setElement.get(i), values.get(i));
            if (result == Boolean.FALSE) {
                return false;
            }
            if (result == null) {
                match = false;
            }
        }
        if (!match) {
            return null;
        }
        return true;
    }

    Boolean nullableEquals(Object first, Object second)
    {
        if (first == null || second == null) {
            return null;
        }
        return first.equals(second);
    }

arhimondr · 2018-05-02T22:59:19Z

This algorithm is already implemented in the HashSemiJoinOperator.

On the probe side we need no modification. We always operate on a single row.

On the build side we need to broadcast all the values with nulls at any level (we call them INDETERMINATE). This should be an easy change, since we already doing this for null's.

In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.

The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.

After implementing a nullableEquals for complex types we can add a rewrite:

SELECT (value1, value2) IN (SELECT column1, column2 FROM table) -> SELECT row(value1, value2) IN (SELECT row(column1, column2) FROM table)

@kokosing @martint @findepi

findepi · 2018-05-03T07:35:07Z

presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements

@arhimondr -- this is correct in a sense this doesn't produce wrong result. It's not (yet) supported.

In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.

Which PR you refer to?

The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.

Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?

arhimondr · 2018-05-03T18:26:04Z

Which PR you refer to?

#7792

Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?

Yes, exactly.

sopel39 · 2018-05-07T14:48:18Z

On the probe side we need no modification. We always operate on a single row.

This isn't true unfortunately. Consider probe row: array[null, 1] and build row: array[2, 1]. Now you need to make sure that both rows land on the same machine to deduce that the semi join result for array[null, 1] should be null instead of false. In order to do so, you most likely need to broadcast array[null, 1] row to each node. But then you need to reduce such broadcasted probe rows after semi join operator.

So it seems that probe rows need to have unique row id columns before broadcasting. After broadcast there should be an aggregation that will group by unique row id with OR aggregation on semi join symbol.

sopel39 · 2018-05-07T14:50:56Z

E.g probe should be similar to:

Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
   |
SemiJoin (broadcasts probe rows with nulls)
   |
AssignUniqueID

arhimondr · 2018-05-08T01:25:03Z

We already distribute an "any" row. Initially i though we introduced that to cover the "empty in" case, but than i realized that it should cover the probe nulls case as well. For your particular example it would return the correct result. Because the "array[2, 1]" value is gonna be broadcasted as the "any" value. Although It still feels that there's a case that may not be covered, so far i couldn't come up with one. Let me try to follow you through my thoughts. No matter what the build side set is, for an indeterminate value on a probe side only "false" or "null" can be returned. In the SQL "null or false" expression evaluates to null. It is enough a single expression in the IN comparison chain to return "null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2, value3 ... ) is always null. No matter what the rest of the values are. You can get a "null" only when matching an indeterminate with an indeterminate (a value that contains a null somewhere). Since we replicate all the indeterminate values, we shouldn't miss a null. Let me know if I'm missing something.

…

On Mon, May 7, 2018 at 10:51 AM Karol Sobczak ***@***.***> wrote: E.g probe should be similar to: Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)] | SemiJoin (broadcasts probe rows with nulls) | AssignUniqueID — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6385 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o> .

arhimondr · 2018-05-08T01:30:04Z

UPD: I'm slow. You get a "null" also for an "array[1,null] in (array[1,2], array[2,3])". In case array[2,3] is replicated we get a "false", what is incorrect. Yeah, you are totally right, we need a unique row id approach here.

…

On Mon, May 7, 2018 at 5:20 PM Andriy Rosa ***@***.***> wrote: We already distribute an "any" row. Initially i though we introduced that to cover the "empty in" case, but than i realized that it should cover the probe nulls case as well. For your particular example it would return the correct result. Because the "array[2, 1]" value is gonna be broadcasted as the "any" value. Although It still feels that there's a case that may not be covered, so far i couldn't come up with one. Let me try to follow you through my thoughts. No matter what the build side set is, for an indeterminate value on a probe side only "false" or "null" can be returned. In the SQL "null or false" expression evaluates to null. It is enough a single expression in the IN comparison chain to return "null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2, value3 ... ) is always null. No matter what the rest of the values are. You can get a "null" only when matching an indeterminate with an indeterminate (a value that contains a null somewhere). Since we replicate all the indeterminate values, we shouldn't miss a null. Let me know if I'm missing something. On Mon, May 7, 2018 at 10:51 AM Karol Sobczak ***@***.***> wrote: > E.g probe should be similar to: > > Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)] > | > SemiJoin (broadcasts probe rows with nulls) > | > AssignUniqueID > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#6385 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AFUBrEgaUwO_xoS3gLaebmamEb_pICHuks5twF9dgaJpZM4KZa0o> > . >

findepi · 2018-05-08T07:10:00Z

It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.

unless eg value1 = value3: null or false or true is true

sopel39 · 2018-05-08T09:37:46Z

Yeah, you are totally right, we need a unique row id approach here.

Yes, but it is adds overhead for cases where there are no nulls at all. Maybe instead for now we could just support semi joins which are used by EXISTS (e.g: FilterNode above SemiJoin node)? That is much easier since we don't have to distinguish between null/false semi join result. For cases where semi join is not used filter we can simply use current semantics (e.g: fail when any row is indeterminate [e.g: via filter below semi join]).

On the other hand semi join with majority of build rows being indeterminate might have performance similar to cross join.

Do you want to complete this issue because you want to fix complex type equality and this is prerequisite?

arhimondr · 2018-05-08T12:22:18Z

It is the other way. I want to fix complex type equation as a prerequisite to this.

…

On Tue, May 8, 2018 at 4:38 AM Karol Sobczak ***@***.***> wrote: Yeah, you are totally right, we need a unique row id approach here. Yes, but it is adds overhead for cases where there are no nulls at all. Maybe instead for now we could just support semi joins which are used for row filtering (e.g: FilterNode above SemiJoin node)? That is much easier issue since we don't have to distinguish between null/false semi join result. For cases where semi join is not used filter we can simply use current semantics (e.g: fail when any row is indeterminate (e.g: via filter below semi join)). On the other hand semi join with majority of build rows being indeterminate might have performance similar to cross join. Do you want to complete this issue because you want to fix complex type equality and this is prerequisite? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6385 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFUBrPI4cNpro52g7VAjjqrAc_Pcx8Vtks5twWd8gaJpZM4KZa0o> .

arhimondr · 2018-05-09T16:39:20Z

That is much easier since we don't have to distinguish between null/false semi join result.

That's a good point. Actually in many cases you don't have to distinguish between NULL and FALSE for IN query as well.

For example

SELECT *
FROM nation
WHERE nationkey in (SELECT nationkey FROM othertable)

I'm not sure, but it seems that whenever we use an IN subquery as a filter predicate - we shouldn't care about null vs false in many cases, since in predicate we only care if the entire expression is evaluated to true (there are tricky cases with negation though?).

Our optimizer should be smart enough to detect whether we really care about nulls.

findepi · 2018-05-09T16:44:07Z

IN in projection is uncommon. However, in WHERE the IN might be negated. Then we need to distinguish between false result and null

…

JivanRoquet · 2019-09-09T23:50:09Z

Hi there, just upping this thread in case anyone has given some more thoughts into it. Multiple columns in IN predicate sounds like an interesting feature to have as it can avoid a lot of complexity, especially when dealing with template-generated queries.

JivanRoquet · 2019-09-11T09:44:49Z

Update: there is a way to use multiple-columns with an WHERE (a, b) IN (SELECT ...), by wrapping the selected rows in parenthesis so as they're effectively turned into one single anonymous row.

Example:

SELECT a, b
FROM table
WHERE (a, b)
IN (
  SELECT (a, b) --- wrap a, b in parenthesis here
  FROM table
)

Credits: https://stackoverflow.com/a/57876281/2091169

arhimondr · 2019-09-11T16:23:23Z

The problem is that currently Presto doesn't support IN expression for complex types. There were complications mostly related to `null` vs `false` result for semi join.

…

On Wed, Sep 11, 2019 at 2:45 AM Jivan Roquet ***@***.***> wrote: Update: there is a way to use multiple-columns with an WHERE (a, b) IN (SELECT ...), by wrapping the selected rows in parenthesis so as they're effectively turned into one single anonymous row. Example: SELECT a, bFROM tableWHERE (a, b)IN ( SELECT (a, b) --- wrap a, b in parenthesis here FROM table ) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6385?email_source=notifications&email_token=ABKQDLD7TR3EAORGCAVA7FLQJC427A5CNFSM4CTFVUUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6N5LFY#issuecomment-530306455>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKQDLDL62J2DXVVEAJMGXDQJC427ANCNFSM4CTFVUUA> .

JivanRoquet · 2019-09-12T00:40:18Z

The problem is that currently Presto doesn't support IN expression for complex types. There were complications mostly related to null vs false result for semi join.

How comes it works by simply wrapping multiple columns in parenthesis, then?

kokosing mentioned this issue Oct 18, 2016

Support for correlated subqueries #4017

Closed

geraint0923 self-assigned this Oct 18, 2016

kokosing mentioned this issue Apr 20, 2018

too long where clause cause error #10461

Closed

arhimondr assigned arhimondr and unassigned geraint0923 Apr 30, 2018

atris mentioned this issue Jun 27, 2018

Execute EXISTS predicate as SEMI JOIN #6384

Closed

kokosing closed this as completed Mar 3, 2021

Support multiple columns in IN predicate #6385

Support multiple columns in IN predicate #6385

Comments

kokosing commented Oct 18, 2016

martint commented Oct 18, 2016

arhimondr commented Apr 30, 2018

findepi commented Apr 30, 2018

arhimondr commented Apr 30, 2018

arhimondr commented Apr 30, 2018

findepi commented May 1, 2018

arhimondr commented May 1, 2018

findepi commented May 1, 2018

martint commented May 1, 2018

martint commented May 1, 2018

findepi commented May 1, 2018

martint commented May 1, 2018

arhimondr commented May 2, 2018 • edited

martint commented May 2, 2018 • edited

martint commented May 2, 2018

martint commented May 2, 2018 • edited

martint commented May 2, 2018

arhimondr commented May 2, 2018

arhimondr commented May 2, 2018

arhimondr commented May 2, 2018 • edited

arhimondr commented May 2, 2018 • edited

findepi commented May 3, 2018

arhimondr commented May 3, 2018

sopel39 commented May 7, 2018 • edited

sopel39 commented May 7, 2018

arhimondr commented May 8, 2018 via email

arhimondr commented May 8, 2018 via email

findepi commented May 8, 2018

sopel39 commented May 8, 2018 • edited

arhimondr commented May 8, 2018 via email

arhimondr commented May 9, 2018 • edited

findepi commented May 9, 2018 via email

JivanRoquet commented Sep 9, 2019

JivanRoquet commented Sep 11, 2019 • edited

arhimondr commented Sep 11, 2019 via email

JivanRoquet commented Sep 12, 2019 • edited

arhimondr commented May 2, 2018 •

edited

martint commented May 2, 2018 •

edited

martint commented May 2, 2018 •

edited

arhimondr commented May 2, 2018 •

edited

arhimondr commented May 2, 2018 •

edited

sopel39 commented May 7, 2018 •

edited

sopel39 commented May 8, 2018 •

edited

arhimondr commented May 9, 2018 •

edited

JivanRoquet commented Sep 11, 2019 •

edited

JivanRoquet commented Sep 12, 2019 •

edited