Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add array_union UDF to Presto #5644

Closed
wants to merge 1 commit into from
Closed

Conversation

aaronshan
Copy link

add array_union udf function and unit tests, description

@facebook-github-bot
Copy link
Collaborator

Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla - and if you have received this in error or have any questions, please drop us a line at cla@fb.com. Thanks!

@ghost
Copy link

ghost commented Jul 13, 2016

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@ghost ghost added the CLA Signed label Jul 13, 2016
@cberner
Copy link
Contributor

cberner commented Jul 13, 2016

I think we already have this functionality as array_distinct(concat(x, y)). Does that work for you?

@cberner cberner removed their assignment Jul 13, 2016
@aaronshan
Copy link
Author

aaronshan commented Jul 14, 2016

@cberner When I use array_distinct(concat(x, y)) and array_union(x,y) to union an Array with an empty Array:

presto:default> select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2);
Query 20160714_023231_00018_spzxe failed: line 1:23: Could not choose a best candidate operator. Explicit type casts must be added.
Candidates are:
     * concat(array(integer),array(integer)):array(integer)
     * concat(array(integer),array(array(integer))):array(array(integer))

select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2)
presto:default> select array_union(arr1, arr2) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2);
   _col0
-----------
 [5, null]
(1 row)

Query 20160714_023238_00019_spzxe, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

And I think that array_union(x,y) function is more convenient and more unambiguous than array_distinct(concat(x, y)), if we can use one function resolve it, why we need use two? 😜

@electrum
Copy link
Contributor

electrum commented Jul 14, 2016 via email

@aaronshan
Copy link
Author

@electrum Thanks for your replay. I agree with your opinion. maybe, I should use array_union by a plugin (I already do it). 😁
another, It also has a bug when concat an Array with an empty Array:

presto:default> select concat(arr1, arr2) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2);
Query 20160714_043222_00003_588eg failed: line 1:8: Could not choose a best candidate operator. Explicit type casts must be added.
Candidates are:
     * concat(array(integer),array(integer)):array(integer)
     * concat(array(integer),array(array(integer))):array(array(integer))

select concat(arr1, arr2) from (values (ARRAY[5, NULL], ARRAY[])) as t(arr1, arr2)

@electrum
Copy link
Contributor

The empty array problem definitely looks like a bug, or something that could be fixed, though there might be difficult / impossible to fix generically. Can you file a separate issue for that?

I'm curious how that actually occurs in practice. It should only happen for a literal (i.e. the query contains exactly ARRAY[]), so you'd either have to type that or it could come from code generation (which could be modified to add a cast).

@electrum
Copy link
Contributor

I also wanted to say thanks for submitting a great pull request, complete with tests and documentation.

We can certainly be convinced to add "redundant" functionality when the convenience outweighs the cost. For example, we added date(ts) for converting timestamp to date as a shorter version of CAST(ts AS date), since it's something that I personally use all the time when writing queries and missed from other databases (from a code perspective this merely required adding an alias for the existing operator).

From a performance perspective, this would be better, since it avoids creating the array twice. In theory, the other form could be optimized, though that adds even more code than just adding a new function.

@martint
Copy link
Contributor

martint commented Jul 14, 2016

The empty array issue is because the concat function is ambiguous. It allows these two forms:

concat(array(T), array(T))
concat(T, array(T))

In the second case, if T is some array type (e.g., array(U)), the invocation looks like:

concat(array(U), array(array(U))

Since ARRAY[] is a valid form for the second argument in both versions, the engine doesn't know which one to use.

Arguably, this was a mistake when the functions were defined. We shouldn't allow that kind of overload.

@martint
Copy link
Contributor

martint commented Jul 14, 2016

I filed #5662 to track the issue with empty array.

@aaronshan
Copy link
Author

@electrum yeah, you are right. I agree with you completely. I learn more about design method from you. thanks!

@aaronshan
Copy link
Author

@electrum If you use array_remove and array_intersect, maybe you will get ARRAY[]

presto:default> select arr1 = ARRAY[],arr2 = ARRAY[] from (values (array_remove(ARRAY['a'], 'a'), array_intersect(ARRAY['a'], ARRAY['b']))) as t(arr1, arr2);
 _col0 | _col1
-------+-------
 true  | true
(1 row)

Query 20160714_092119_00020_6pxur, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

@aaronshan
Copy link
Author

@electrum @cberner
And when you use array_distinct(concat(arr1, arr2)), if arr1 or arr2 is null, it will also get error result.

presto:default> select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], null)) as t(arr1, arr2);
Query 20160714_093036_00025_6pxur failed: line 1:23: Could not choose a best candidate operator. Explicit type casts must be added.
Candidates are:
     * concat(array(integer),array(integer)):array(integer)
     * concat(array(integer),integer):array(integer)
     * concat(array(integer),array(array(integer))):array(array(integer))

select array_distinct(concat(arr1, arr2)) from (values (ARRAY[5, NULL], null)) as t(arr1, arr2)

and if you use array_union, it will get right result:

presto:default> select array_union(arr1, arr2) from (values (ARRAY[5, NULL], null)) as t(arr1, arr2);
   _col0
-----------
 [5, null]
(1 row)

Query 20160714_093051_00026_6pxur, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

@ghost ghost added the CLA Signed label Jul 14, 2016
@SqlType("array(E)")
public static Block union(
@TypeParameter("E") Type type,
@Nullable @SqlType("array(E)") Block leftArray,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think skipping NULL s is the right semantics. Generally, scalar functions return NULL when any of their inputs are NULL because NULL represents an unknown value, and so in this case we don't know what's in the union when we don't know whats in the array on one side.

Copy link
Author

@aaronshan aaronshan Jul 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cberner Thanks for your replay. I just think NULL represents an unknown value or nothing, when a not NULL array union with NULL, and it should be return not NULL array but NULL. This is different with array_intersect.
If we don't do this, to make sure to get right result, we will have to use like:

array_union(if(arr1 is null, ARRAY[], arr1), if(arr2 is null, ARRAY[], arr2))

This is so complex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@aaronshan aaronshan Jul 15, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum now, we can't concat array with null, it has some issue. you can see #5662 .

@cberner
Copy link
Contributor

cberner commented Jul 14, 2016

Seems reasonable to add this, and discussed with @martint. I don't think the NULL semantics are right though

@ghost ghost added CLA Signed labels Jul 14, 2016
@aaronshan
Copy link
Author

@cberner @electrum I had already revise the code, because I find that if the two inputs all are NULL, array_union has some issue. So, to make sure get right result. I also have to need use it like:

array_union(if(arr1 is null, ARRAY[], arr1), if(arr2 is null, ARRAY[], arr2))

This is same with only one NULL input.😂😂😂

@ghost ghost added the CLA Signed label Jul 15, 2016
blockBuilder.appendNull();
continue;
}
long value = BIGINT.getLong(array, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can reach this point even if array.isNull(i) == true, if there are multiple NULL values in the array, in which case you'll read garbage data

@cberner
Copy link
Contributor

cberner commented Jul 15, 2016

You can use coalesce instead of if which will save a little typing, but yes NULL generally need special handling, or should be filtered out in SQL queries.
I had some more comments. Also can you squash your commits together into a single commit?

int rightArrayCount = rightArray.getPositionCount();
LongSet set = new LongOpenHashSet(leftArrayCount + rightArrayCount);
BlockBuilder distinctElementBlockBuilder = BIGINT.createBlockBuilder(new BlockBuilderStatus(), leftArrayCount + rightArrayCount);
BooleanHolder containsNull = new BooleanHolder(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not use this class. I'm not sure why we have CORBA in our classpath, but I don't think we should be using it. An AtomicBoolean is probably fine here

@aaronshan
Copy link
Author

@cberner Thanks for your advice. I had already revise it.

@cberner
Copy link
Contributor

cberner commented Jul 20, 2016

Merged. Thanks!

@cberner cberner closed this Jul 20, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants