Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix Grouper with a datetime key in conjunction with another key #51414

Merged
merged 9 commits into from Apr 8, 2023

Conversation

juleek
Copy link
Contributor

@juleek juleek commented Feb 15, 2023

Description of the bug

The crux of the issue was in BaseGrouper::groups:

to_groupby = zip(*(ping.grouping_vector for ping in self.groupings))

We are going to iterate over to_groupby, which is a zip of ping.grouping_vector.

groupings are populated in get_grouper, below is an excerpt of relevant code that we executed in case of grouping by date:

        # Taking this branch because grouping_vector is `TimeGrouper`
        elif isinstance(grouping_vector, Grouper):
            newgrouper, newobj = grouping_vector._get_grouper(self.obj, validate=False)
            # newgrouper is BinGrouper after the line above
            if isinstance(newgrouper, ops.BinGrouper):
                grouping_vector = newgrouper

after this, Grouping::grouping_vector is BinGrouper, and this is what zip() will try to iterate over.

Before the fix, when we were iterating over BinGrouper, its base class __iter__ was used, which was using self.indices, which was:

defaultdict(<class 'list'>, {Timestamp('2000-01-01 00:00:00'): [0, 1, 2], 
                             Timestamp('2000-01-02 00:00:00'): [3, 4, 5]})

When you iterate over a dictionary in Python, you iterate over the keys, and more importantly we have only two of them here. Second Grouping has array in its grouping vector consisting of 6 elements:

['a' 'b' 'a' 'b' 'a' 'b']

so, the zip() function just truncated it.

Fix

By adding __iter__ in BinGrouper, the zip() function now iterate over the elements from self.groupings[0].grouping_vector, which is, in our case:

DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-02',
               '2000-01-02', '2000-01-02'],
              dtype='datetime64[ns]', name='b', freq=None)

@juleek
Copy link
Contributor Author

juleek commented Feb 16, 2023

@jbrockmendel @rhshadrach Can you please have a look and/or advise? Thanks.

@@ -1085,6 +1085,9 @@ def groups(self):
}
return result

def __iter__(self) -> Iterator[Hashable]:
return iter(self.groupings[0].grouping_vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the parent class method wrong?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to add description. Please, see the description of the PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to refactor the indices here to align better with the base class rather than override iter. But I'm not certain if that's a good idea and I'm okay with this fix. cc @jbrockmendel

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel - friendly ping.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. i had in mind something more like @rhshadrach's suggestion, that needing this here might indicate that the self.indices (which the parent class method iterates over) might be wrong

@juleek
Copy link
Contributor Author

juleek commented Feb 23, 2023

@jbrockmendel I fixed all your comments. Can you please review the PR? Thanks.

@juleek
Copy link
Contributor Author

juleek commented Feb 25, 2023

@jbrockmendel I am wondering whether you had time to have a look the PR? If not, should I add someone else to the PR to review it?

Also, am I missing something or the tests a quite flaky?
For example, right now NumPy is failing with this:

Fatal Python error: Segmentation fault

@jbrockmendel
Copy link
Member

Please be patient, there are a lot of PRs to review.

The segfault is unrelated and is affecting all PRs at the moment. You can ignore it.

@juleek
Copy link
Contributor Author

juleek commented Mar 9, 2023

@jbrockmendel have you had time to have a look?

@juleek
Copy link
Contributor Author

juleek commented Mar 18, 2023

@jbrockmendel Any news, please?

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; I've opened #52468 to track a possible followup.

@rhshadrach rhshadrach added this to the 2.1 milestone Apr 8, 2023
@rhshadrach rhshadrach merged commit ebe484a into pandas-dev:main Apr 8, 2023
38 checks passed
@rhshadrach
Copy link
Member

Thanks @juleek!

@taytzehao taytzehao mentioned this pull request Aug 14, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pd.Grouper with a datetime key in conjunction with another key generates incorrect number of group keys.
4 participants