-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: GroupBy.get_group doesnt work with TimeGrouper #6914
Conversation
for label, bin in zip(self.binlabels, self.bins): | ||
if i < bin: | ||
indices[label] = list(range(i, bin)) | ||
i = bin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this assume each group is contiguous / sorted?
I have a feeling there is a more efficient way to do this get this out, @jreback ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is what _groupby_indices
does (a cython routine). also make an example that has an unsorted bins for testing as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that binlabels
and bins
are always sorted before passed to BinGrouper
. Is it incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so. in any event, does not _groupby_indices
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following is the _groupby_indices
result. It returns correct index even if the input is not sorted?
>>> import pandas.algos as _algos
>>> import pandas.core.common as com
>>> values = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'C']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([0, 1, 2, 3, 4]), 'C': array([7]), 'B': array([5, 6])}
>>> values = ['B', 'B', 'C', 'A', 'A', 'A', 'A', 'A']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([3, 4, 5, 6, 7]), 'C': array([2]), 'B': array([0, 1])}
But BinGrouper
doesn't know actual data index by itself, so I'm not sure what results are actually correct. Should it return the same indices regardless of bins
order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, different. BinGrouper.indices
must be a dict which key is timestamp. The logic cannot be replaced with _groupby_indices
.
>>> b.indices
defaultdict(<type 'list'>, {Timestamp('2013-12-31 00:00:00', offset='M'): [6, 7], Timestamp('2013-10-31 00:00:00', offset='M'): [2, 3, 4, 5], Timestamp('2013-01-31 00:00:00', offset='M'): [0, 1]})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _get_indicies_dict
; don't reinvent the wheel here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using _get_indices_dict
is not easy, because BinGrouper
doesn't have information which can be passed to the function as it is.
>>> import pandas as pd
>>> import datetime
>>> from pandas.core.groupby import GroupBy, _get_indices_dict
>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
'Quantity': [1,3,5,1,8,1,9,3],
'Date' : [
datetime.datetime(2013,1,1,13,0), datetime.datetime(2013,1,1,13,5),
datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
datetime.datetime(2013,12,2,12,0), datetime.datetime(2013,12,2,14,0),]})
>>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date'))
>>> grouped.grouper.bins
[2 2 2 2 2 2 2 2 2 6 6 8]
>>> grouped.grouper.binlabels
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-31, ..., 2013-12-31]
Length: 12, Freq: M, Timezone: None
Thus, I have to convert it using the similar logic as current implementation.
>>> indices = []
>>> i = 0
>>> for j, bin in enumerate(grouped.grouper.bins):
>>> if i < bin:
>>> indices.extend([j] * (bin - i))
>>> i = bin
>>> indices = np.array(indices)
>>> indices
[ 0 0 9 9 9 9 11 11]
And _get_indices_dict
returns keys as tuple, further conversion required.
>>> _get_indices_dict([indices], [grouped.grouper.binlabels])
{(numpy.datetime64('2013-10-31T09:00:00.000000000+0900'),): array([2, 3, 4, 5]), (numpy.datetime64('2013-12-31T09:00:00.000000000+0900'),): array([6, 7]), (numpy.datetime64('2013-01-31T09:00:00.000000000+0900'),): array([0, 1])}
Do you have better logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not do something like (?):
{g.grouper.levels[k] for k, v in pd.core.groupby._groupby_indices(b.bins).iteritems()}
may be faster...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, but failed in test. I think simply passing bins
to existing method shouldn't work, because bins
are corresponding to frequencies to be split, not index. Thus its length can differ from index.
Using dataframe in above example, returnes values are different from expected.
>>> list(pd.core.groupby._groupby_indices(grouped.grouper.bins).iteritems())
[(8, array([11])), (2, array([0, 1, 2, 3, 4, 5, 6, 7, 8])), (6, array([ 9, 10]))]
Thanks, this is definitely a bug and tests look good! I reckon there's a more efficient way to get indices though. |
@sinhrks as a side note, pls make sure that your examples are easily copy-pasted (e.g. put the |
put this release note next to or with the one for #5267 |
@sinhrks ping when you update |
BUG: GroupBy.get_group doesnt work with TimeGrouper
ok...this is fine, if you think of someway to reuse the current groupby functions for that thanks! |
get_group
raisesAttributeError
when the group is created byTimeGrouper
.