-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add array_union UDF to Presto #5644
Conversation
Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. In order for us to review and merge your code, please sign up at https://code.facebook.com/cla - and if you have received this in error or have any questions, please drop us a line at cla@fb.com. Thanks! |
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks! |
I think we already have this functionality as |
@cberner When I use
And I think that |
The opposite argument is that we would have N^2 or more functions. We avoid adding every combination of things unless there is a compelling reason, since it adds complexity for users and more code to maintain. Once we add a function, it can't be removed without breaking everyone using it.
|
@electrum Thanks for your replay. I agree with your opinion. maybe, I should use
|
The empty array problem definitely looks like a bug, or something that could be fixed, though there might be difficult / impossible to fix generically. Can you file a separate issue for that? I'm curious how that actually occurs in practice. It should only happen for a literal (i.e. the query contains exactly |
I also wanted to say thanks for submitting a great pull request, complete with tests and documentation. We can certainly be convinced to add "redundant" functionality when the convenience outweighs the cost. For example, we added From a performance perspective, this would be better, since it avoids creating the array twice. In theory, the other form could be optimized, though that adds even more code than just adding a new function. |
The empty array issue is because the concat function is ambiguous. It allows these two forms:
In the second case, if T is some array type (e.g.,
Since ARRAY[] is a valid form for the second argument in both versions, the engine doesn't know which one to use. Arguably, this was a mistake when the functions were defined. We shouldn't allow that kind of overload. |
I filed #5662 to track the issue with empty array. |
@electrum yeah, you are right. I agree with you completely. I learn more about design method from you. thanks! |
@electrum If you use
|
@electrum @cberner
and if you use
|
@SqlType("array(E)") | ||
public static Block union( | ||
@TypeParameter("E") Type type, | ||
@Nullable @SqlType("array(E)") Block leftArray, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think skipping NULL
s is the right semantics. Generally, scalar functions return NULL
when any of their inputs are NULL
because NULL
represents an unknown value, and so in this case we don't know what's in the union when we don't know whats in the array on one side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cberner Thanks for your replay. I just think NULL
represents an unknown value or nothing, when a not NULL array union with NULL
, and it should be return not NULL array but NULL
. This is different with array_intersect
.
If we don't do this, to make sure to get right result, we will have to use like:
array_union(if(arr1 is null, ARRAY[], arr1), if(arr2 is null, ARRAY[], arr2))
This is so complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to add this, and discussed with @martint. I don't think the |
@cberner @electrum I had already revise the code, because I find that if the two inputs all are
This is same with only one |
blockBuilder.appendNull(); | ||
continue; | ||
} | ||
long value = BIGINT.getLong(array, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can reach this point even if array.isNull(i) == true
, if there are multiple NULL
values in the array, in which case you'll read garbage data
You can use |
int rightArrayCount = rightArray.getPositionCount(); | ||
LongSet set = new LongOpenHashSet(leftArrayCount + rightArrayCount); | ||
BlockBuilder distinctElementBlockBuilder = BIGINT.createBlockBuilder(new BlockBuilderStatus(), leftArrayCount + rightArrayCount); | ||
BooleanHolder containsNull = new BooleanHolder(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets not use this class. I'm not sure why we have CORBA in our classpath, but I don't think we should be using it. An AtomicBoolean
is probably fine here
39c4230
to
31c8439
Compare
@cberner Thanks for your advice. I had already revise it. |
3f34f97
to
b209842
Compare
Merged. Thanks! |
add array_union udf function and unit tests, description