Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Allow expressions in pandas join predicates #1138

Closed
wants to merge 2 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Aug 24, 2017

Closes #1137.

The algorithm used to compute the individual expressions of an Equals operation in a Join predicate is the following:

  1. Initialize a dict whose keys are the left and right tables and whose values are a list of join keys coming from the left or right table respectively. code.
  2. Initialize a dict whose keys are the left and right table and whose values are dicts mapping a generated column name to a pandas.Series object. code.
  3. for each join predicate (which is, by assertion, an Equals operation): code.
    1. for the left and right sides of the Equals operation:
      1. compute the current side of the Equals, returning: code.
        1. a column name
        2. a pd.Series if the expression is not a TableColumn coming from the root tables for the predicate
        3. the root table of the predicate
      2. append the column name to the list of keys to join on (values are keyed by the root table of the predicate)
      3. insert any new columns into the dict of dicts for new columns
  4. The left and right keys are passed to the on argument of pd.merge. code.
  5. drop any new columns. code.

@cpcloud cpcloud self-assigned this Aug 24, 2017
@cpcloud cpcloud added feature Features or general enhancements pandas The pandas backend ux User experience related issues labels Aug 24, 2017
@cpcloud cpcloud added this to the 0.11.3 milestone Aug 24, 2017
root_table, = column_op.root_tables()
return name, new_column, root_table


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't these be functions rather than computed at import time?

Copy link
Member Author

@cpcloud cpcloud Aug 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serves only to be a unique suffix that we can depend on to use for selecting out duplicate column names in a Selection operation. As a counterexample, if there were a unique suffix for every join operation then we wouldn't be able to select out any overlapping columns in downstream operations because we wouldn't know what to look for when pulling them out.

@cpcloud cpcloud changed the title WIP: Allow expressions in pandas join predicates ENH: Allow expressions in pandas join predicates Aug 29, 2017
@cpcloud cpcloud requested a review from wesm August 29, 2017 17:10
@wesm
Copy link
Member

wesm commented Aug 29, 2017

Having a look

@cpcloud
Copy link
Member Author

cpcloud commented Aug 29, 2017

I'll rebase this on top of #1149 once that is merged.

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, but otherwise looks good to me

how=how, left_on=on[left_op], right_on=on[right_op],
suffixes=_JOIN_SUFFIXES,
)
return result.drop(to_drop, axis=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get away with not assigning new columns to left and right (and potentially avoiding some computation) and instead passing the join columns as Series arguments to left_on and right_on, or does that not work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that worked. Thanks for the suggestion.

def test_join_with_duplicate_non_key_columns(how, left, right, df1, df2):
left = left.mutate(x=left.value * 2)
right = right.mutate(x=right.other_value * 3)
expr = left.join(right, left.key == right.key, how=how)

# This is undefined behavior because `x` is duplicated. This is difficult
# to detect
with pytest.raises(ValueError):
expr.execute()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, bummer. Is there any consensus amongst SQL engines about how to handle this case, or do they generally all error out?

@cpcloud
Copy link
Member Author

cpcloud commented Aug 31, 2017

I'm going to merge these changes into #1149 since they interact in a way that will introduce a bug if merged separately.

@cpcloud cpcloud closed this Aug 31, 2017
@cpcloud cpcloud deleted the fix-join-filter branch August 31, 2017 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements pandas The pandas backend ux User experience related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants