Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby.nth should be a filter #49262

Merged
merged 8 commits into from
Nov 11, 2022
Merged

Conversation

rhshadrach
Copy link
Member

2 3.0
A B
1 1 2.0
2 2 3.0

NaNs denote group exhausted when using dropna
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description probably needs updating

@@ -338,7 +338,7 @@ Groupby/resample/rolling
- Bug in :meth:`DataFrameGroupBy.sample` raises ``ValueError`` when the object is empty (:issue:`48459`)
- Bug in :meth:`Series.groupby` raises ``ValueError`` when an entry of the index is equal to the name of the index (:issue:`48567`)
- Bug in :meth:`DataFrameGroupBy.resample` produces inconsistent results when passing empty DataFrame (:issue:`47705`)
-
- Bug in :meth:`.DataFrameGroupBy.nth` and :meth:`.SeriesGroupBy.nth` would treat operation as a aggregation whereas it is a filtration; in particular, the result index no longer contains the groupers but rather is filtered from the original index of the input (:issue:`13666`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally like consistency now that nth is a filter, but I think this should be called out in it's own "notable bug fix" section

@rhshadrach
Copy link
Member Author

Thanks @mroeschke - ready for another look.


NaNs denote group exhausted when using dropna
When the specified ``n`` is larger than any of the groups, an
empty DataFrame is returned

>>> g.nth(3, dropna='any')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this example should be shown in the whatsnew? On the surface, this example appears to be quite different from before

@WillAyd
Copy link
Member

WillAyd commented Oct 25, 2022

Hmm I'm not sure I agree with this approach - doesn't this introduce an even larger inconsistency now between nth / first / last? Do we have other groupby functions that work this way?

@rhshadrach
Copy link
Member Author

Hmm I'm not sure I agree with this approach - doesn't this introduce an even larger inconsistency now between nth / first / last? Do we have other groupby functions that work this way?

first and last are reductions - they return 1 value per group. nth on the other hand can also return zero or multiple rows. It is because of this I think we should be treating nth as a filtration. The other filtrations, head, tail, and filter all behave as in this PR, namely they merely filter the rows from the input object. In particular, groupby(...).head() now behaves the same as groupby(...).nth([0, 1, 2, 3, 4]).

@WillAyd
Copy link
Member

WillAyd commented Oct 25, 2022

Yea I definitely see the argument. I think I'm +/-0 . IIUC this makes nth closer to filtering on rank, but with a predetermined option for a tiebreaker

@rhshadrach
Copy link
Member Author

rhshadrach commented Oct 25, 2022

IIUC this makes nth closer to filtering on rank

nth goes off of the input order, making it somewhat different from rank.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with this change as nth as a filter makes sense to me and to align the corresponding output. IIRC there was an issue to write groupby().head/tail in terms with nth which this change would help IIUC.

@WillAyd WillAyd merged commit 9fefc8f into pandas-dev:main Nov 11, 2022
@WillAyd
Copy link
Member

WillAyd commented Nov 11, 2022

Awesome work @rhshadrach

@rhshadrach rhshadrach deleted the nth_filter branch November 11, 2022 13:00
codamuse pushed a commit to codamuse/pandas that referenced this pull request Nov 12, 2022
mliu08 pushed a commit to mliu08/pandas that referenced this pull request Nov 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby.nth() labelling conventions changed from 0.17 -> 0.18
3 participants