Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy enhancement unifies the return of iterating over GroupBy #42795 #47719

Closed
wants to merge 72 commits into from

Conversation

ahmedibrhm
Copy link
Contributor

@ahmedibrhm ahmedibrhm commented Jul 14, 2022

@pep8speaks
Copy link

pep8speaks commented Jul 14, 2022

Hello @ahmedibrhm! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2022-07-26 22:07:40 UTC

@ahmedibrhm ahmedibrhm marked this pull request as ready for review July 19, 2022 22:53
@ahmedibrhm ahmedibrhm marked this pull request as draft July 19, 2022 23:38
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be useful to put up a proof of concept to demonstrate what a behavior change would look like once a deprecation is enforced. However, when doing so it would be helpful to mark the PR as a draft since we do not want to merge it yet.

There are also many things changing here that I would not expect. I think if you aren't iterating over the group, then there should be no change in behavior. I highlighted a few examples below.

@@ -806,7 +806,7 @@ def test_groupby_as_index_cython(df):
msg = "The default value of numeric_only"
with tm.assert_produces_warning(FutureWarning, match=msg):
result = grouped.mean()
expected = data.groupby(["A"]).mean()
expected = data.groupby("A").mean()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in #47761
I thought it's better to generalize the rule of using not using a list when grouping by a single key as groupby is being iterated over in other functions. So I thought it will be a good idea to generalize the rule.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it's better to generalize the rule of...

I do not understand what this means. Can you expand on it? Also, it's not clear to me - is the result of data.groupby(["A"]).mean() different from what the main branch currently produces?

Comment on lines +64 to +65
bymodi = fix_groupby_singlelist_input(self.by)
grouped = self.data.groupby(bymodi)[self.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because .hist and .box use groupby internally in a single way. For example if I did hist by ['a','b','c','d'] the results will be like (a,), (b,), (c,), (d,).
some plotting functions and the pivot table are actually iterating over groupby.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here, grouped is only being used in L66 immediately below, right?

self.bins = [self._calculate_bins(group) for key, group in grouped]

In this usage, only the group is being used and the key is ignored. So why is this needed if it's only the key changing?

@ahmedibrhm ahmedibrhm marked this pull request as draft July 26, 2022 15:32
@ahmedibrhm
Copy link
Contributor Author

@rhshadrach
Do you think in changing the behaviour we should change it from its root or only the final result.

What I mean is that when iterating over groupby the group_keys_seq variable appear, so do you think we should change the variable itself in case the user passed single element in a list or just change the iter function to return the required behavior?
I am asking this because group_keys_seq appear in other different functions so it may affect them as well.

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Aug 27, 2022
@mroeschke
Copy link
Member

@ahmedibrhm are you interested in continuing this PR and applying the deprecation? We are looking to enforce this for the next release

@mroeschke
Copy link
Member

Thanks for the pull request, but I believe this was handled by #50064 so closing. If I misunderstood happy to reopen

@mroeschke mroeschke closed this Dec 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: consistent types in output of df.groupby
4 participants