Add array_union UDF to Presto #5644

aaronshan · 2016-07-13T11:20:46Z

add array_union udf function and unit tests, description

facebook-github-bot · 2016-07-13T11:21:05Z

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla - and if you have received this in error or have any questions, please drop us a line at cla@fb.com. Thanks!

ghost · 2016-07-13T18:01:50Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

cberner · 2016-07-13T21:58:46Z

I think we already have this functionality as array_distinct(concat(x, y)). Does that work for you?

aaronshan · 2016-07-14T02:12:50Z

@cberner When I use array_distinct(concat(x, y)) and array_union(x,y) to union an Array with an empty Array:

presto:default> select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2);
Query 20160714_023231_00018_spzxe failed: line 1:23: Could not choose a best candidate operator. Explicit type casts must be added.
Candidates are:
     * concat(array(integer),array(integer)):array(integer)
     * concat(array(integer),array(array(integer))):array(array(integer))

select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2)

presto:default> select array_union(arr1, arr2) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2);
   _col0
-----------
 [5, null]
(1 row)

Query 20160714_023238_00019_spzxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

And I think that array_union(x,y) function is more convenient and more unambiguous than array_distinct(concat(x, y)), if we can use one function resolve it, why we need use two? 😜

electrum · 2016-07-14T02:56:32Z

The opposite argument is that we would have N^2 or more functions. We avoid adding every combination of things unless there is a compelling reason, since it adds complexity for users and more code to maintain. Once we add a function, it can't be removed without breaking everyone using it.

aaronshan · 2016-07-14T04:37:43Z

@electrum Thanks for your replay. I agree with your opinion. maybe, I should use array_union by a plugin (I already do it). 😁
another, It also has a bug when concat an Array with an empty Array:

presto:default> select concat(arr1, arr2) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2);
Query 20160714_043222_00003_588eg failed: line 1:8: Could not choose a best candidate operator. Explicit type casts must be added.
Candidates are:
     * concat(array(integer),array(integer)):array(integer)
     * concat(array(integer),array(array(integer))):array(array(integer))

select concat(arr1, arr2) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2)

electrum · 2016-07-14T05:38:32Z

The empty array problem definitely looks like a bug, or something that could be fixed, though there might be difficult / impossible to fix generically. Can you file a separate issue for that?

I'm curious how that actually occurs in practice. It should only happen for a literal (i.e. the query contains exactly ARRAY[]), so you'd either have to type that or it could come from code generation (which could be modified to add a cast).

electrum · 2016-07-14T05:45:32Z

I also wanted to say thanks for submitting a great pull request, complete with tests and documentation.

We can certainly be convinced to add "redundant" functionality when the convenience outweighs the cost. For example, we added date(ts) for converting timestamp to date as a shorter version of CAST(ts AS date), since it's something that I personally use all the time when writing queries and missed from other databases (from a code perspective this merely required adding an alias for the existing operator).

From a performance perspective, this would be better, since it avoids creating the array twice. In theory, the other form could be optimized, though that adds even more code than just adding a new function.

martint · 2016-07-14T06:06:55Z

The empty array issue is because the concat function is ambiguous. It allows these two forms:

concat(array(T), array(T))
concat(T, array(T))

In the second case, if T is some array type (e.g., array(U)), the invocation looks like:

concat(array(U), array(array(U))

Since ARRAY[] is a valid form for the second argument in both versions, the engine doesn't know which one to use.

Arguably, this was a mistake when the functions were defined. We shouldn't allow that kind of overload.

martint · 2016-07-14T06:20:20Z

I filed #5662 to track the issue with empty array.

aaronshan · 2016-07-14T09:22:08Z

@electrum yeah, you are right. I agree with you completely. I learn more about design method from you. thanks!

aaronshan · 2016-07-14T09:24:00Z

@electrum If you use array_remove and array_intersect, maybe you will get ARRAY[]

presto:default> select arr1 = ARRAY[],arr2 = ARRAY[] from (values (array_remove(ARRAY['a'], 'a'), array_intersect(ARRAY['a'], ARRAY['b']))) as t(arr1, arr2);
 _col0 | _col1
-------+-------
 true  | true
(1 row)

Query 20160714_092119_00020_6pxur, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

aaronshan · 2016-07-14T09:36:33Z

@electrum @cberner
And when you use array_distinct(concat(arr1, arr2)), if arr1 or arr2 is null, it will also get error result.

presto:default> select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], null)) as t(arr1, arr2);
Query 20160714_093036_00025_6pxur failed: line 1:23: Could not choose a best candidate operator. Explicit type casts must be added.
Candidates are:
     * concat(array(integer),array(integer)):array(integer)
     * concat(array(integer),integer):array(integer)
     * concat(array(integer),array(array(integer))):array(array(integer))

select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], null)) as t(arr1, arr2)

and if you use array_union, it will get right result:

presto:default> select array_union(arr1, arr2) from (values (ARRAY[5, NULL], null)) as t(arr1, arr2);
   _col0
-----------
 [5, null]
(1 row)

Query 20160714_093051_00026_6pxur, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

cberner · 2016-07-14T22:33:40Z

presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayUnionFunction.java

+    @SqlType("array(E)")
+    public static Block union(
+            @TypeParameter("E") Type type,
+            @Nullable @SqlType("array(E)") Block leftArray,


I don't think skipping NULL s is the right semantics. Generally, scalar functions return NULL when any of their inputs are NULL because NULL represents an unknown value, and so in this case we don't know what's in the union when we don't know whats in the array on one side.

@cberner Thanks for your replay. I just think NULL represents an unknown value or nothing, when a not NULL array union with NULL， and it should be return not NULL array but NULL. This is different with array_intersect.
If we don't do this, to make sure to get right result, we will have to use like:

array_union(if(arr1 is null, ARRAY[], arr1), if(arr2 is null, ARRAY[], arr2))

This is so complex.

How does distinct(concat(...)) behave with nulls? This should be the same.

@electrum now, we can't concat array with null, it has some issue. you can see #5662 .

cberner · 2016-07-14T22:34:28Z

Seems reasonable to add this, and discussed with @martint. I don't think the NULL semantics are right though

aaronshan · 2016-07-15T10:33:23Z

@cberner @electrum I had already revise the code, because I find that if the two inputs all are NULL, array_union has some issue. So, to make sure get right result. I also have to need use it like:

array_union(if(arr1 is null, ARRAY[], arr1), if(arr2 is null, ARRAY[], arr2))

This is same with only one NULL input.😂😂😂

cberner · 2016-07-15T17:53:15Z

presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayUnionFunction.java

+                    blockBuilder.appendNull();
+                    continue;
+                }
+                long value = BIGINT.getLong(array, i);


You can reach this point even if array.isNull(i) == true, if there are multiple NULL values in the array, in which case you'll read garbage data

cberner · 2016-07-15T17:55:30Z

You can use coalesce instead of if which will save a little typing, but yes NULL generally need special handling, or should be filtered out in SQL queries.
I had some more comments. Also can you squash your commits together into a single commit?

cberner · 2016-07-19T16:34:20Z

presto-main/src/main/java/com/facebook/presto/operator/scalar/ArrayUnionFunction.java

+        int rightArrayCount = rightArray.getPositionCount();
+        LongSet set = new LongOpenHashSet(leftArrayCount + rightArrayCount);
+        BlockBuilder distinctElementBlockBuilder = BIGINT.createBlockBuilder(new BlockBuilderStatus(), leftArrayCount + rightArrayCount);
+        BooleanHolder containsNull = new BooleanHolder(false);


Lets not use this class. I'm not sure why we have CORBA in our classpath, but I don't think we should be using it. An AtomicBoolean is probably fine here

aaronshan · 2016-07-20T03:50:18Z

@cberner Thanks for your advice. I had already revise it.

cberner · 2016-07-20T17:42:51Z

Merged. Thanks!

ghost added the CLA Signed label Jul 13, 2016

geraint0923 assigned cberner Jul 13, 2016

ghost added the CLA Signed label Jul 13, 2016

cberner added the question label Jul 13, 2016

cberner removed their assignment Jul 13, 2016

electrum added syntax-needs-review and removed question labels Jul 14, 2016

ghost added the CLA Signed label Jul 14, 2016

cberner reviewed Jul 14, 2016
View reviewed changes

cberner added changes-requested and removed syntax-needs-review labels Jul 14, 2016

ghost added CLA Signed labels Jul 14, 2016

ghost added the CLA Signed label Jul 15, 2016

cberner reviewed Jul 15, 2016
View reviewed changes

ghost added the CLA Signed label Jul 16, 2016

aaronshan force-pushed the master branch from 9a50613 to a38bd73 Compare July 18, 2016 13:54

ghost added the CLA Signed label Jul 18, 2016

cberner reviewed Jul 19, 2016
View reviewed changes

aaronshan force-pushed the master branch 2 times, most recently from 39c4230 to 31c8439 Compare July 20, 2016 03:34

facebook-github-bot added the CLA Signed label Jul 20, 2016

aaronshan force-pushed the master branch from 31c8439 to 4965d87 Compare July 20, 2016 03:46

facebook-github-bot added the CLA Signed label Jul 20, 2016