Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix overwrite logic to account for DestructColumn inside mutate API #2636

Merged
merged 11 commits into from
Mar 2, 2021

Conversation

emilyreff7
Copy link
Contributor

@emilyreff7 emilyreff7 commented Feb 22, 2021

Overview

Currently ibis does some checking inside the mutate API method to determine whether any assigned columns are overwriting an existing column in the table. It then uses this logic to correctly select the column expr from the assignment expr that does the overwriting. This also ensures that there are no duplicate columns detected in schema computation.

However, this logic does not properly account for an overwrite that may occur within a DestructColumn, which are used to specify output of multi-column UDFs. Since DestructColumns by design cannot be named (since they can represent many columns internally), the current logic does not detect if a column inside a DestructColumn is overwriting an existing column. The result is an error at expression creation time as duplicate columns are detected in the schema.

Proposed Change

This PR modifies the logic inside mutate to account for overwrites that may happen inside a DestructColumn. First it properly constructs the dict of assignment -> expr by checking for DestructColumn specifically. Then, it determines:

  1. Dict of column name -> expr for all assignments that overwrite an existing column
  2. List of exprs for all assignments that do not represent any overwriting.

From the above, if an overwrite is detected, the mutation node can be constructed by iterating through all table columns, selecting the expr either from the overwritten dict or the table itself, and then appending all new exprs from the list above.

Testing

Added several new tests in test_vectorized_udf.py to assert that overwriting a column within a multi-col UDF (elementwise, analytic, and reduction) works correctly.

Known Limitations

Since a DestructColumn can represent many columns internally, if one of those columns is overwriting an existing column and the remaining columns are new assignments, the result from the mutation will not respect correct column ordering. That is, if we have a table with columns ['A', 'B', C'] and a mutation with DestructColumn ['B', 'D', 'E'], the output table schema after execution will be ['A', 'B', 'D', 'E', 'C'] since all the DestructColumns are grouped together in the same expr.

A follow-up can explore destructing these columns inside the mutate API, which would change the execution path but ensure correct column ordering.

@jreback jreback added the expressions Issues or PRs related to the expression API label Feb 23, 2021
ibis/expr/api.py Outdated Show resolved Hide resolved
ibis/expr/api.py Show resolved Hide resolved
ibis/expr/api.py Outdated Show resolved Hide resolved
ibis/expr/api.py Outdated Show resolved Hide resolved
@jreback jreback added this to the Next release milestone Feb 24, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has user visible effects right? if so can you add a release note

ibis/expr/api.py Outdated Show resolved Hide resolved
@emilyreff7 emilyreff7 changed the title BUG: Destructure expressions inside mutate API layer BUG: Fix overwrite logic to account for DestructColumn inside mutate API Feb 24, 2021
ibis/expr/api.py Outdated Show resolved Hide resolved
@icexelloss
Copy link
Contributor

Please add an overview to the PR

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments.

ibis/expr/api.py Outdated Show resolved Hide resolved
ibis/expr/api.py Outdated Show resolved Hide resolved
ibis/expr/api.py Show resolved Hide resolved
Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jreback jreback merged commit d91ad9e into ibis-project:master Mar 2, 2021
@jreback
Copy link
Contributor

jreback commented Mar 2, 2021

thanks @emilyreff7 very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants