Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add bind_rows and bind_cols #411

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

Techzune
Copy link

A very rudimentary implementation of the dplyr equivalent.
Similar to join, when piping you must specify all involved dataframes.

e.g.: one >> bind_rows(_, two) or bind_rows(one, two)

@Techzune Techzune requested a review from machow as a code owner March 31, 2022 15:03
@machow
Copy link
Owner

machow commented Apr 1, 2022

Hey--thanks for your PR! I hope it's okay, I added a couple commits for...

  • Allowing from siuba import bind_rows
  • Basic tests. I marked the dplyr behaviors that seem useful with pytest.mark.xfail!
  • A page in the docs, largely translated from the dplyr bind docs

Any chance you are interested in trying to implement the last couple pieces? :) It seems like there are just a couple dplyr behaviors left to get most of its bind_rows functionality!

I've listed out some of their key features below, and am happy to help with whatever is useful!

bind_rows

  • _id argument to create a column indicating which dataframe the row came from
  • support for dictionaries
  • support for lists of DataFrames (this seems unnecessary in python, which can unpack things using *[])

bind_cols

import pandas as pd

df1 = pd.DataFrame({'x': [0,1]}, index = [0, 1])
df2 = pd.DataFrame({'y': [1, 2]}, index = [1, 2])

# note that this also doesn't work
# pd.concat([df1, df2], axis=1, ignore_index=True)

pd.concat([df1, df2], axis=1)
     x    y
0  0.0  NaN
1  1.0  1.0
2  NaN  2.0

Copy link
Owner

@machow machow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this! Added some feedback in a comment. I'm still feeling out which dplyr bind_rows behaviors are most useful, and would love to get your feedback on what pieces are most useful to have

@Techzune
Copy link
Author

Techzune commented Apr 9, 2022

Ignore those previous bind_rows commits. It was a long week, and I didn't read the docs you added! 😅
Let me implement that real quick.

@Techzune
Copy link
Author

Techzune commented Apr 9, 2022

✨ there we go!

@Techzune Techzune requested a review from machow April 9, 2022 17:43
@machow
Copy link
Owner

machow commented Apr 10, 2022

Ah, thanks a ton! I'm running the tests, and can take a closer look tonight or tomorrow!

I noticed there were a few places (like mutate) where the variable result was changed to df_result, was that to make it easier to understand at a glance?

@Techzune
Copy link
Author

Ah, thanks a ton! I'm running the tests, and can take a closer look tonight or tomorrow!

I noticed there were a few places (like mutate) where the variable result was changed to df_result, was that to make it easier to understand at a glance?

Gah crud! That was a mistake on my end. I renamed my "result" variable, and I guess VS Code said "oooh rename this one!" I tend to write my variables as a definition of what they are in my data science work-- for example, a dataframe always starts with df_ and a list starts with list_. However, I'm not always perfect at it. Anyway, I'm sure that line change should be omitted from the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants