-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow expressions in pandas join predicates #1138
Conversation
ibis/pandas/execution.py
Outdated
| root_table, = column_op.root_tables() | ||
| return name, new_column, root_table | ||
|
|
||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these be functions rather than computed at import time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This serves only to be a unique suffix that we can depend on to use for selecting out duplicate column names in a Selection operation. As a counterexample, if there were a unique suffix for every join operation then we wouldn't be able to select out any overlapping columns in downstream operations because we wouldn't know what to look for when pulling them out.
8f77ab1
to
1b67820
Compare
1b67820
to
45fb9c1
Compare
|
Having a look |
45fb9c1
to
16dddcb
Compare
|
I'll rebase this on top of #1149 once that is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments, but otherwise looks good to me
ibis/pandas/execution.py
Outdated
| how=how, left_on=on[left_op], right_on=on[right_op], | ||
| suffixes=_JOIN_SUFFIXES, | ||
| ) | ||
| return result.drop(to_drop, axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get away with not assigning new columns to left and right (and potentially avoiding some computation) and instead passing the join columns as Series arguments to left_on and right_on, or does that not work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that worked. Thanks for the suggestion.
| def test_join_with_duplicate_non_key_columns(how, left, right, df1, df2): | ||
| left = left.mutate(x=left.value * 2) | ||
| right = right.mutate(x=right.other_value * 3) | ||
| expr = left.join(right, left.key == right.key, how=how) | ||
|
|
||
| # This is undefined behavior because `x` is duplicated. This is difficult | ||
| # to detect | ||
| with pytest.raises(ValueError): | ||
| expr.execute() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, bummer. Is there any consensus amongst SQL engines about how to handle this case, or do they generally all error out?
|
I'm going to merge these changes into #1149 since they interact in a way that will introduce a bug if merged separately. |
Closes #1137.
The algorithm used to compute the individual expressions of an
Equalsoperation in aJoinpredicate is the following:dictwhose keys are the left and right tables and whose values are a list of join keys coming from the left or right table respectively. code.dictwhose keys are the left and right table and whose values aredicts mapping a generated column name to apandas.Seriesobject. code.Equalsoperation): code.Equalsoperation:Equals, returning: code.pd.Seriesif the expression is not aTableColumncoming from the root tables for the predicateonargument ofpd.merge. code.