-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix overwrite logic to account for DestructColumn inside mutate API #2636
BUG: Fix overwrite logic to account for DestructColumn inside mutate API #2636
Conversation
f8e1e1d
to
c49d37b
Compare
4456425
to
9500b8a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has user visible effects right? if so can you add a release note
|
Please add an overview to the PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments.
dc6e70f
to
860998f
Compare
9d17762
to
91c5d8f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
thanks @emilyreff7 very nice! |
Overview
Currently ibis does some checking inside the mutate API method to determine whether any assigned columns are overwriting an existing column in the table. It then uses this logic to correctly select the column expr from the assignment expr that does the overwriting. This also ensures that there are no duplicate columns detected in schema computation.
However, this logic does not properly account for an overwrite that may occur within a DestructColumn, which are used to specify output of multi-column UDFs. Since DestructColumns by design cannot be named (since they can represent many columns internally), the current logic does not detect if a column inside a DestructColumn is overwriting an existing column. The result is an error at expression creation time as duplicate columns are detected in the schema.
Proposed Change
This PR modifies the logic inside mutate to account for overwrites that may happen inside a DestructColumn. First it properly constructs the dict of assignment -> expr by checking for DestructColumn specifically. Then, it determines:
From the above, if an overwrite is detected, the mutation node can be constructed by iterating through all table columns, selecting the expr either from the overwritten dict or the table itself, and then appending all new exprs from the list above.
Testing
Added several new tests in test_vectorized_udf.py to assert that overwriting a column within a multi-col UDF (elementwise, analytic, and reduction) works correctly.
Known Limitations
Since a DestructColumn can represent many columns internally, if one of those columns is overwriting an existing column and the remaining columns are new assignments, the result from the mutation will not respect correct column ordering. That is, if we have a table with columns ['A', 'B', C'] and a mutation with DestructColumn ['B', 'D', 'E'], the output table schema after execution will be ['A', 'B', 'D', 'E', 'C'] since all the DestructColumns are grouped together in the same expr.
A follow-up can explore destructing these columns inside the mutate API, which would change the execution path but ensure correct column ordering.