BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

sinhrks · 2014-04-19T18:23:39Z

get_group raises AttributeError when the group is created by TimeGrouper.

>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
                   'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                   'Quantity': [1,3,5,1,8,1,9,3],
                   'Date' : [
                    datetime(2013,1,1,13,0), datetime(2013,1,1,13,5),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,12,2,12,0), datetime(2013,12,2,14,0),]})

>>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date'))
>>> grouped.get_group(pd.Timestamp('2013-12-31'))
AttributeError: 'DataFrameGroupBy' object has no attribute 'indices'

hayd · 2014-04-21T06:03:02Z

pandas/core/groupby.py

+        for label, bin in zip(self.binlabels, self.bins):
+            if i < bin:
+                indices[label] = list(range(i, bin))
+                i = bin


Does this assume each group is contiguous / sorted?

I have a feeling there is a more efficient way to do this get this out, @jreback ?

I think this is what _groupby_indices does (a cython routine). also make an example that has an unsorted bins for testing as well.

I understand that binlabels and bins are always sorted before passed to BinGrouper. Is it incorrect?

I think so. in any event, does not _groupby_indices work?

Following is the _groupby_indices result. It returns correct index even if the input is not sorted?

>>> import pandas.algos as _algos >>> import pandas.core.common as com >>> values = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'C'] >>> _algos.groupby_indices(com._ensure_object(values)) {'A': array([0, 1, 2, 3, 4]), 'C': array([7]), 'B': array([5, 6])} >>> values = ['B', 'B', 'C', 'A', 'A', 'A', 'A', 'A'] >>> _algos.groupby_indices(com._ensure_object(values)) {'A': array([3, 4, 5, 6, 7]), 'C': array([2]), 'B': array([0, 1])}

But BinGrouper doesn't know actual data index by itself, so I'm not sure what results are actually correct. Should it return the same indices regardless of bins order?

Yes, different. BinGrouper.indices must be a dict which key is timestamp. The logic cannot be replaced with _groupby_indices.

>>> b.indices defaultdict(<type 'list'>, {Timestamp('2013-12-31 00:00:00', offset='M'): [6, 7], Timestamp('2013-10-31 00:00:00', offset='M'): [2, 3, 4, 5], Timestamp('2013-01-31 00:00:00', offset='M'): [0, 1]})

use _get_indicies_dict; don't reinvent the wheel here

I think using _get_indices_dict is not easy, because BinGrouper doesn't have information which can be passed to the function as it is.

>>> import pandas as pd >>> import datetime >>> from pandas.core.groupby import GroupBy, _get_indices_dict >>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(), 'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(), 'Quantity': [1,3,5,1,8,1,9,3], 'Date' : [ datetime.datetime(2013,1,1,13,0), datetime.datetime(2013,1,1,13,5), datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0), datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0), datetime.datetime(2013,12,2,12,0), datetime.datetime(2013,12,2,14,0),]}) >>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date')) >>> grouped.grouper.bins [2 2 2 2 2 2 2 2 2 6 6 8] >>> grouped.grouper.binlabels <class 'pandas.tseries.index.DatetimeIndex'> [2013-01-31, ..., 2013-12-31] Length: 12, Freq: M, Timezone: None

Thus, I have to convert it using the similar logic as current implementation.

>>> indices = [] >>> i = 0 >>> for j, bin in enumerate(grouped.grouper.bins): >>> if i < bin: >>> indices.extend([j] * (bin - i)) >>> i = bin >>> indices = np.array(indices) >>> indices [ 0 0 9 9 9 9 11 11]

And _get_indices_dict returns keys as tuple, further conversion required.

>>> _get_indices_dict([indices], [grouped.grouper.binlabels]) {(numpy.datetime64('2013-10-31T09:00:00.000000000+0900'),): array([2, 3, 4, 5]), (numpy.datetime64('2013-12-31T09:00:00.000000000+0900'),): array([6, 7]), (numpy.datetime64('2013-01-31T09:00:00.000000000+0900'),): array([0, 1])}

Do you have better logic?

Can we not do something like (?):

{g.grouper.levels[k] for k, v in pd.core.groupby._groupby_indices(b.bins).iteritems()}

may be faster...

Thanks, but failed in test. I think simply passing bins to existing method shouldn't work, because bins are corresponding to frequencies to be split, not index. Thus its length can differ from index.

Using dataframe in above example, returnes values are different from expected.

>>> list(pd.core.groupby._groupby_indices(grouped.grouper.bins).iteritems()) [(8, array([11])), (2, array([0, 1, 2, 3, 4, 5, 6, 7, 8])), (6, array([ 9, 10]))]

hayd · 2014-04-21T06:04:45Z

Thanks, this is definitely a bug and tests look good!

I reckon there's a more efficient way to get indices though.

jreback · 2014-04-21T12:16:20Z

@sinhrks as a side note, pls make sure that your examples are easily copy-pasted (e.g. put the from datetime import datetime), though IMHO, pls just import datetime

jreback · 2014-04-21T12:22:23Z

put this release note next to or with the one for #5267

jreback · 2014-04-21T18:00:38Z

@sinhrks ping when you update

BUG: GroupBy.get_group doesnt work with TimeGrouper

jreback · 2014-04-28T14:08:08Z

ok...this is fine, if you think of someway to reuse the current groupby functions for that indicies pls do a pr in the future

thanks!

hayd reviewed Apr 21, 2014
View reviewed changes

hayd added this to the 0.14.0 milestone Apr 21, 2014

jreback added Bug labels Apr 21, 2014

BUG: Groupby.get_group doesnt work with TimeGrouper

85157f0

jreback added a commit that referenced this pull request Apr 28, 2014

Merge pull request #6914 from sinhrks/getgroup

4614ac8

BUG: GroupBy.get_group doesnt work with TimeGrouper

jreback merged commit 4614ac8 into pandas-dev:master Apr 28, 2014

sinhrks deleted the getgroup branch April 29, 2014 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

sinhrks commented Apr 19, 2014

hayd Apr 21, 2014

jreback Apr 21, 2014

sinhrks Apr 22, 2014

jreback Apr 22, 2014

sinhrks Apr 22, 2014

sinhrks Apr 25, 2014

jreback Apr 27, 2014

sinhrks Apr 28, 2014

hayd Apr 29, 2014

sinhrks Apr 30, 2014

hayd commented Apr 21, 2014

jreback commented Apr 21, 2014

jreback commented Apr 21, 2014

jreback commented Apr 21, 2014

jreback commented Apr 28, 2014

BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

Conversation

sinhrks commented Apr 19, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hayd commented Apr 21, 2014

jreback commented Apr 21, 2014

jreback commented Apr 21, 2014

jreback commented Apr 21, 2014

jreback commented Apr 28, 2014