-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BUG: Inconsistent behavior of Groupby with None values with filter (#… #63178
Conversation
rhshadrach
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Please always add tests. Does this also handle the tuple case on L667?
pandas/core/groupby/groupby.py
Outdated
| for name in names | ||
| ) | ||
|
|
||
| elif any(isna(k) for k in self.indices.keys()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is expensive - this function is only ever called currently with names a list of length 1, and the rest of the method is O(1) in terms of self.indices. It's called from the inner loop of DataFramGroupBy.fitler as we're iterating over each group. This seems avoidable.
I believe we could change this function to just accept a single name (rather than a list) and then have a special case:
if isna(name):
return self.indices.get(np.nan, [])There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think self.indices.get(np.nan, []) won't work as the Nan value in the self.indices can not be accessed reliable before changing the keys from Nan to np.nan. I think I have a working solution though. Will supply the updated version of the PR tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
self.indices.get(np.nan, [])won't work as theNanvalue in theself.indicescan not be accessed reliable before changing the keys fromNantonp.nan.
Isn't this what I suggested to do in #63178 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I misread your first comment from two days ago. To make sure we are on the same page, we can change the function _get_indices(self, names) to _get_indices(self, name). Changing the list for a single name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It is only ever used with a single name today.
pandas/core/groupby/groupby.py
Outdated
| names = (converter(name) for name in names) | ||
|
|
||
| return [self.indices.get(name, []) for name in names] | ||
| indices = {np.nan if isna(k) else k: v for k, v in self.indices.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems better to do this on indices cached property directly, and only in the case where there is a NaN value with if not self.dropna and self.result_index.hasnans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will adjust.
…andas-dev#62501) - Add test cases - Add tuple support - Incorporate feedback
edd8a1f to
8d2126a
Compare
|
@koskampt - I opened #63202 to give some idea of what I'm thinking. If you like that, can incorporate it here. But still open to alternative solutions that do not iterate through Even with such a solution, will still want to see the result of running the groupby ASVs to evaluate performance impact. I can also help assist here if desired. |
…62501)
doc/source/whatsnew/v2.3.4.rstfile if fixing a bug or adding a new feature.