Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple columns in IN predicate #6385

Closed
kokosing opened this issue Oct 18, 2016 · 36 comments
Closed

Support multiple columns in IN predicate #6385

kokosing opened this issue Oct 18, 2016 · 36 comments
Assignees

Comments

@kokosing
Copy link
Contributor

Support queries like:

presto:sf1> select count(*) from lineitem where (orderkey, linenumber) IN (SELECT orderkey, linenumber from lineitem);
Query 20161018_062422_00016_uqzsf failed: line 1:60: Multiple columns returned by subquery are not yet supported. Found 2

It should be easy to implement once #6384 got implemented.

@martint
Copy link
Contributor

martint commented Oct 18, 2016

There's a PR in progress to support this: #5323

@arhimondr
Copy link
Member

I'm starting working on this

@arhimondr arhimondr assigned arhimondr and unassigned geraint0923 Apr 30, 2018
@findepi
Copy link
Contributor

findepi commented Apr 30, 2018

@arhimondr do you have plans how tuples with NULLs should be distributed?
IN currently employs "replicate any and null" distribution (unless it's broadcasted of course). In tuple case, at first it seems you need to replicate tuples that contain any NULL value. Is it possible to do better?

Also, how does it play with structural types? What should be the result of map(2 => null) IN ( map(2 => 3) ) ?

@arhimondr
Copy link
Member

Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL

@arhimondr
Copy link
Member

replicate any and null

I'm going to change that to be replicate any and indeterminete

@findepi
Copy link
Contributor

findepi commented May 1, 2018

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

@arhimondr
Copy link
Member

Yes

@findepi
Copy link
Contributor

findepi commented May 1, 2018

Just note that today:

presto> select array[1, 2, 3] = array[42, null];
 _col0
-------
 false

and IN is defined in terms of equality, so = and IN should remain consistent. (AFAIU... @martint can you comment?)

@martint
Copy link
Contributor

martint commented May 1, 2018

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

No, they have different size, so the comparison should return false.

@martint
Copy link
Contributor

martint commented May 1, 2018

Basically any structural type that contains a null value (no matter how deeply it is enclosed) is considered indeterminete. And IN for indeterminete should behave exactly the same as for NULL

With respect to how they are distributed, yes. But not for the purpose of comparison for the IN test - that still behaves according to equality semantics

@findepi
Copy link
Contributor

findepi commented May 1, 2018

Does it mean array[1, 2, 3] IN ( array[42, null] ) will return null and not false?

No, they have different size, so the comparison should return false.

Oh, sure. Would the answer be the same for array[1, 2] and array[42, null]? (array = is left-to-right, so this comparison returns false)
What about array[1, 2] and array[null, 42]? This = comparison fails today, but multi-column IN can result in equivalent case ((1,2) IN ( select null, 42 ))

@martint
Copy link
Contributor

martint commented May 1, 2018

IN should have the same semantics as =

@arhimondr
Copy link
Member

arhimondr commented May 2, 2018

IN should have the same semantics as =

Right, so

array[1, 2, 3] IN ( array[42, null] ) is equal to array[1, 2, 3] = array[42, null]

And yeah,

No, they have different size, so the comparison should return false.

that's how the array comparison is implemented now. It works only because we check the size first. If it is different - it is false.

But when it comes to actually compare the values, we do not support nulls

presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements

Postgree would just answer true in that case.

postgres=# SELECT ARRAY[1, null] = ARRAY[1, null];
 ?column?
----------
 t
(1 row)

I guess since we didn't implement this - it is not covered by the standard.

Null comparisons for null in row are also not supported https://github.com/prestodb/presto/blob/master/presto-spi/src/main/java/com/facebook/presto/spi/type/RowType.java#L263

And it seems that semantics for null in row and semantics for null in multicolumn in are different

postgres=# SELECT (1, cast(null as int)) IN (SELECT 1, cast(null as int));
 ?column?
----------

(1 row)

is equivalent to

postgres=# SELECT 1 = 1 AND null = null;
 ?column?
----------

(1 row)
postgres=# SELECT array[1, cast(null as int)] IN (SELECT array[1, cast(null as int)]);
 ?column?
----------
 t
(1 row)

is equivalent to

postgres=# SELECT array[1, cast(null as int)] = array[1, cast(null as int)];
 ?column?
----------
 t
(1 row)

I didn't check the SQL standard yet (i will), but so far the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level. And than to just use it instead of isNull checks in SemiJoin related code.

@martint
Copy link
Contributor

martint commented May 2, 2018

*** Update: this comment is incorrect. See follow-up comments below. ***



```
Let A1 and A2 be arrays of EDT. A1 and A2 are identical if and only if A1 and A2
have the same cardinality n and if, for all i in the range 1 (one) ≤ i ≤ n, the element 
at ordinal position i in A1 is identical to the element at ordinal position i in A2.
```

and

```
9.10 Determination of identical values
...

2) Case:
    a) If V1 and V2 are both the null value, then V1 is identical to V2.
```

@martint
Copy link
Contributor

martint commented May 2, 2018

Ah, I take that back. That first block talks about arrays being "identical", not "equal", so that's not the operation to be considered for equality.

@martint
Copy link
Contributor

martint commented May 2, 2018

Here are the equality rules for arrays:

ii) If the declared types of XV and YV are array types or distinct types whose source types 
    are array types and the cardinalities of XV and YV are N1 and N2, respectively, then 
    let Xi, 1 (one) ≤ i ≤ N1, denote a <value expression> whose value and declared type is 
    that of the i-th element of XV and let Yi, 1 (one) ≤ i ≤ N2, denote a <value expression> 
    whose value and declared type is that of the i-th element of YV. The result of

        X <comp op> Y

    is determined as follows:
    1) X = Y is True if N1 = 0 (zero) and N2 = 0 (zero).
    2) X = Y is True if N1 = N2 and, for all i, Xi = Yi is True.
    3) X = Y is False if and only if one of the following is true:
            A) N1 ≠ N2.
            B) N1 = N2 and for some i, 1 (one) ≤ i ≤ N1, NOT (Xi = Yi) is True.
    4) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.

So, according to the standard, array [1, null] = array[1, null] => null

@martint
Copy link
Contributor

martint commented May 2, 2018

And for rows:

i) If the declared types of XV and YV are row types with degree N, then let Xi, 1 (one) ≤ i ≤ N,
    denote a <value expression> whose value and declared type is that of the i-th  field of XV and let Yi
    denote a <value expression> whose value and declared type is that of the i-th  field of YV.

    The result of

        X <comp op> Y

    is determined as follows:
    1) X = Y is True if and only if Xi = Yi is True for all i.
    2) X < Y is True if and only if Xi = Yi is True for all i < n and Xn < Yn for some n.
    3) X = Y is False if and only if NOT (Xi = Yi) is True for some i.
    4) X < Y is False if and only if X = Y is True or Y < X is True.
    5) X <comp op> Y is Unknown if X <comp op> Y is neither True nor False.

@arhimondr
Copy link
Member

Yeah, thanks for digging into the standard. I was just about to do so.

So, yeah, PostgreSQL implements array equals not by the standard.

the idea is to change the INDETERMINATE operator for ROW/MAP/ARRAY to return true only if there is a null on the first level

We should leave INDETERMINATE operator implementation as is than.

@arhimondr
Copy link
Member

And yeah, we cannot rely on INDETERMINATE when comparing, because due to the standard array [1, null] = array[1, null] => null, but array [1, null] = array[2, null] => FALSE

@arhimondr
Copy link
Member

arhimondr commented May 2, 2018

Little bit of a pseudo code =)

    Boolean in(List<Object> row, Set<List<Object>> searchRowSet)
    {
        if (searchRowSet.isEmpty()) {
            return false;
        }

        if (searchRowSet.containsExact(row)) {
            return true;
        }

        for (List<Object> setElement : searchRowSet) {
            if (nullableEquals(setElement, row) == null) {
                return null;
            }
        }

        return false;
    }

    Boolean nullableEquals(List<Object> setElement, List<Object> values)
    {
        boolean match = true;
        for (int i = 0; i < values.size(); i++) {
            Boolean result = nullableEquals(setElement.get(i), values.get(i));
            if (result == Boolean.FALSE) {
                return false;
            }
            if (result == null) {
                match = false;
            }
        }
        if (!match) {
            return null;
        }
        return true;
    }

    Boolean nullableEquals(Object first, Object second)
    {
        if (first == null || second == null) {
            return null;
        }
        return first.equals(second);
    }

@arhimondr
Copy link
Member

arhimondr commented May 2, 2018

This algorithm is already implemented in the HashSemiJoinOperator.

On the probe side we need no modification. We always operate on a single row.

On the build side we need to broadcast all the values with nulls at any level (we call them INDETERMINATE). This should be an easy change, since we already doing this for null's.

In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.

The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.

After implementing a nullableEquals for complex types we can add a rewrite:

SELECT (value1, value2) IN (SELECT column1, column2 FROM table) -> SELECT row(value1, value2) IN (SELECT row(column1, column2) FROM table)

@kokosing @martint @findepi

@findepi
Copy link
Contributor

findepi commented May 3, 2018

presto> SELECT ARRAY[1, null] = ARRAY[1, null];
Query 20180502_154031_00004_dwbpc failed: ARRAY comparison not supported for arrays with null elements

@arhimondr -- this is correct in a sense this doesn't produce wrong result. It's not (yet) supported.

In that PR we also blocking null values on the probe side. That change was introduced mainly because we didn't broadcast an any row. But @findepi already added it. So, i guess i can just drop the very last commit.

Which PR you refer to?

The only unresolved issue there is the nullableEquals. When for basic types it is easy to take care about nulls outside of the equals method, for complex types it is not that easy. Currently equals for complex typed doesn't support null at all.

Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?

@arhimondr
Copy link
Member

Which PR you refer to?

#7792

Doesn't it mean we need to implement missing support for null elements in equality functions for complex types? I.e. replace "ARRAY comparison not supported for arrays with null elements" with something that follows the standard more closely?

Yes, exactly.

@sopel39
Copy link
Contributor

sopel39 commented May 7, 2018

On the probe side we need no modification. We always operate on a single row.

This isn't true unfortunately. Consider probe row: array[null, 1] and build row: array[2, 1]. Now you need to make sure that both rows land on the same machine to deduce that the semi join result for array[null, 1] should be null instead of false. In order to do so, you most likely need to broadcast array[null, 1] row to each node. But then you need to reduce such broadcasted probe rows after semi join operator.

So it seems that probe rows need to have unique row id columns before broadcasting. After broadcast there should be an aggregation that will group by unique row id with OR aggregation on semi join symbol.

@sopel39
Copy link
Contributor

sopel39 commented May 7, 2018

E.g probe should be similar to:

Aggregation[UniqueIdSymbol][aggregateSemiJoinSymbol:=OR(semiJoinSymbol)]
   |
SemiJoin (broadcasts probe rows with nulls)
   |
AssignUniqueID

@arhimondr
Copy link
Member

arhimondr commented May 8, 2018 via email

@arhimondr
Copy link
Member

arhimondr commented May 8, 2018 via email

@findepi
Copy link
Contributor

findepi commented May 8, 2018

It is enough a single expression in the IN comparison chain to return
"null" to evaluate the entire IN as null, e.g.: value1 IN (null, value2,
value3 ... ) is always null. No matter what the rest of the values are.

unless eg value1 = value3: null or false or true is true

@sopel39
Copy link
Contributor

sopel39 commented May 8, 2018

Yeah, you are totally right, we need a unique row id approach here.

Yes, but it is adds overhead for cases where there are no nulls at all. Maybe instead for now we could just support semi joins which are used by EXISTS (e.g: FilterNode above SemiJoin node)? That is much easier since we don't have to distinguish between null/false semi join result. For cases where semi join is not used filter we can simply use current semantics (e.g: fail when any row is indeterminate [e.g: via filter below semi join]).

On the other hand semi join with majority of build rows being indeterminate might have performance similar to cross join.

Do you want to complete this issue because you want to fix complex type equality and this is prerequisite?

@arhimondr
Copy link
Member

arhimondr commented May 8, 2018 via email

@arhimondr
Copy link
Member

arhimondr commented May 9, 2018

That is much easier since we don't have to distinguish between null/false semi join result.

That's a good point. Actually in many cases you don't have to distinguish between NULL and FALSE for IN query as well.

For example

SELECT *
FROM nation
WHERE nationkey in (SELECT nationkey FROM othertable)

I'm not sure, but it seems that whenever we use an IN subquery as a filter predicate - we shouldn't care about null vs false in many cases, since in predicate we only care if the entire expression is evaluated to true (there are tricky cases with negation though?).

Our optimizer should be smart enough to detect whether we really care about nulls.

@findepi
Copy link
Contributor

findepi commented May 9, 2018 via email

@JivanRoquet
Copy link

Hi there, just upping this thread in case anyone has given some more thoughts into it. Multiple columns in IN predicate sounds like an interesting feature to have as it can avoid a lot of complexity, especially when dealing with template-generated queries.

@JivanRoquet
Copy link

JivanRoquet commented Sep 11, 2019

Update: there is a way to use multiple-columns with an WHERE (a, b) IN (SELECT ...), by wrapping the selected rows in parenthesis so as they're effectively turned into one single anonymous row.

Example:

SELECT a, b
FROM table
WHERE (a, b)
IN (
  SELECT (a, b) --- wrap a, b in parenthesis here
  FROM table
)

Credits: https://stackoverflow.com/a/57876281/2091169

@arhimondr
Copy link
Member

arhimondr commented Sep 11, 2019 via email

@JivanRoquet
Copy link

JivanRoquet commented Sep 12, 2019

The problem is that currently Presto doesn't support IN expression for complex types. There were complications mostly related to null vs false result for semi join.

How comes it works by simply wrapping multiple columns in parenthesis, then?

@kokosing kokosing closed this as completed Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants