Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More general joining #68

Closed
jankislinger opened this issue Sep 4, 2018 · 5 comments
Closed

More general joining #68

jankislinger opened this issue Sep 4, 2018 · 5 comments

Comments

@jankislinger
Copy link
Contributor

I have fixed a typo (right_in -> right_on) and reversed logic in one if statement in function that creates join parameters. See the changes here:
master...jankislinger:fix-join-multiple-by

Now it can be used to join tables on columns with different names:

import pandas as pd
from dfply import *

a = pd.DataFrame({
    'x1': ['A', 'B', 'C'],
    'x2': [1, 2, 3]
})
b = pd.DataFrame({
    'x4': ['A', 'B', 'D'],
    'x3': [True, False, True]
})

a >> inner_join(b, by=('x1', 'x4'))

It would be also convenient to be able to use multiple by statements. For example expression

a >> inner_join(b, by=['x1', ('x2', 'x3')])

could be used as

a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x3'])

If you agree I would modify the code and create a PR.

@sharpe5
Copy link

sharpe5 commented Sep 4, 2018 via email

@jankislinger
Copy link
Contributor Author

The only issue I can see is if you use by as list of length 2, e.g. by=['x1', 'x2']. It is not clear whether you want to use x1 from left table and x2 from the right one or both columns from both tables. I would suggest using list for multiple columns and tuple for different names of the same column. The third option is to use dictionary (which is btw the closest to the implementation in dplyr).

List (two columns to join by, same names in bot data frames):

a >> inner_join(b, by=['x1', 'x2'])
a.merge(b, left_on=['x1', 'x2'], right_on=['x1', 'x2'])

Tuple (single column, different names):

a >> inner_join(b, by=('x1', 'x2'))
a.merge(b, left_on='x1', right_on='x2')

@sharpe5
Copy link

sharpe5 commented Sep 4, 2018 via email

@jankislinger
Copy link
Contributor Author

jankislinger commented Sep 4, 2018

I agree that by=['x1', 'x2'] should use both columns in both tables. But having to rename column before join is annoying. That's why I would use tuples for that case.

See this commit:
https://github.com/jankislinger/dfply/commit/2d892186eeda4f837e0f46a63ef45434b5c5b502

@kieferk
Copy link
Owner

kieferk commented Sep 4, 2018

Cool thank you. Will merge the PR now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants