Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: GroupBy.get_group doesnt work with TimeGrouper #6914

Merged
merged 1 commit into from
Apr 28, 2014

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 19, 2014

get_group raises AttributeError when the group is created by TimeGrouper.

>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
                   'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                   'Quantity': [1,3,5,1,8,1,9,3],
                   'Date' : [
                    datetime(2013,1,1,13,0), datetime(2013,1,1,13,5),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,10,1,20,0), datetime(2013,10,2,10,0),
                    datetime(2013,12,2,12,0), datetime(2013,12,2,14,0),]})

>>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date'))
>>> grouped.get_group(pd.Timestamp('2013-12-31'))
AttributeError: 'DataFrameGroupBy' object has no attribute 'indices'

for label, bin in zip(self.binlabels, self.bins):
if i < bin:
indices[label] = list(range(i, bin))
i = bin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this assume each group is contiguous / sorted?

I have a feeling there is a more efficient way to do this get this out, @jreback ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what _groupby_indices does (a cython routine). also make an example that has an unsorted bins for testing as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that binlabels and bins are always sorted before passed to BinGrouper. Is it incorrect?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. in any event, does not _groupby_indices work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following is the _groupby_indices result. It returns correct index even if the input is not sorted?

>>> import pandas.algos as _algos
>>> import pandas.core.common as com
>>> values = ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'C']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([0, 1, 2, 3, 4]), 'C': array([7]), 'B': array([5, 6])}

>>> values = ['B', 'B', 'C', 'A', 'A', 'A', 'A', 'A']
>>> _algos.groupby_indices(com._ensure_object(values))
{'A': array([3, 4, 5, 6, 7]), 'C': array([2]), 'B': array([0, 1])}

But BinGrouper doesn't know actual data index by itself, so I'm not sure what results are actually correct. Should it return the same indices regardless of bins order?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, different. BinGrouper.indices must be a dict which key is timestamp. The logic cannot be replaced with _groupby_indices.

>>> b.indices
defaultdict(<type 'list'>, {Timestamp('2013-12-31 00:00:00', offset='M'): [6, 7], Timestamp('2013-10-31 00:00:00', offset='M'): [2, 3, 4, 5], Timestamp('2013-01-31 00:00:00', offset='M'): [0, 1]})

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _get_indicies_dict; don't reinvent the wheel here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using _get_indices_dict is not easy, because BinGrouper doesn't have information which can be passed to the function as it is.

>>> import pandas as pd
>>> import datetime
>>> from pandas.core.groupby import GroupBy, _get_indices_dict

>>> df = pd.DataFrame({'Branch' : 'A A A A A A A B'.split(),
                   'Buyer': 'Carl Mark Carl Carl Joe Joe Joe Carl'.split(),
                   'Quantity': [1,3,5,1,8,1,9,3],
                   'Date' : [
                    datetime.datetime(2013,1,1,13,0), datetime.datetime(2013,1,1,13,5),
                    datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
                    datetime.datetime(2013,10,1,20,0), datetime.datetime(2013,10,2,10,0),
                    datetime.datetime(2013,12,2,12,0), datetime.datetime(2013,12,2,14,0),]})

>>> grouped = df.groupby(pd.Grouper(freq='1M',key='Date'))
>>> grouped.grouper.bins
[2 2 2 2 2 2 2 2 2 6 6 8]
>>> grouped.grouper.binlabels
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-31, ..., 2013-12-31]
Length: 12, Freq: M, Timezone: None

Thus, I have to convert it using the similar logic as current implementation.

>>> indices = []
>>> i = 0
>>> for j, bin in enumerate(grouped.grouper.bins):
>>>     if i < bin:
>>>         indices.extend([j] * (bin - i))
>>>         i = bin
>>> indices = np.array(indices)
>>> indices
[ 0  0  9  9  9  9 11 11]

And _get_indices_dict returns keys as tuple, further conversion required.

>>> _get_indices_dict([indices], [grouped.grouper.binlabels])
{(numpy.datetime64('2013-10-31T09:00:00.000000000+0900'),): array([2, 3, 4, 5]), (numpy.datetime64('2013-12-31T09:00:00.000000000+0900'),): array([6, 7]), (numpy.datetime64('2013-01-31T09:00:00.000000000+0900'),): array([0, 1])}

Do you have better logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not do something like (?):

{g.grouper.levels[k] for k, v in pd.core.groupby._groupby_indices(b.bins).iteritems()}

may be faster...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, but failed in test. I think simply passing bins to existing method shouldn't work, because bins are corresponding to frequencies to be split, not index. Thus its length can differ from index.

Using dataframe in above example, returnes values are different from expected.

>>> list(pd.core.groupby._groupby_indices(grouped.grouper.bins).iteritems())
[(8, array([11])), (2, array([0, 1, 2, 3, 4, 5, 6, 7, 8])), (6, array([ 9, 10]))]

@hayd hayd added this to the 0.14.0 milestone Apr 21, 2014
@hayd
Copy link
Contributor

hayd commented Apr 21, 2014

Thanks, this is definitely a bug and tests look good!

I reckon there's a more efficient way to get indices though.

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@sinhrks as a side note, pls make sure that your examples are easily copy-pasted (e.g. put the from datetime import datetime), though IMHO, pls just import datetime

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

put this release note next to or with the one for #5267

@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@sinhrks ping when you update

jreback added a commit that referenced this pull request Apr 28, 2014
BUG: GroupBy.get_group doesnt work with TimeGrouper
@jreback jreback merged commit 4614ac8 into pandas-dev:master Apr 28, 2014
@jreback
Copy link
Contributor

jreback commented Apr 28, 2014

ok...this is fine, if you think of someway to reuse the current groupby functions for that indicies pls do a pr in the future

thanks!

@sinhrks sinhrks deleted the getgroup branch April 29, 2014 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants