Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Unable to aggregate TimeGrouper #7453

Closed
sinhrks opened this issue Jun 14, 2014 · 6 comments
Closed

BUG: Unable to aggregate TimeGrouper #7453

sinhrks opened this issue Jun 14, 2014 · 6 comments
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite
Milestone

Comments

@sinhrks
Copy link
Member

sinhrks commented Jun 14, 2014

Derived from #7373. There seems to be 3 issues related to TimeGrouper aggregation.

1. var, std, mean

var/std/mean raises ValueError when group key contains NaT.

import pandas as pd
import numpy as np

data = np.random.randn(20, 4)
df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
df['dt'] = [datetime.datetime(2013, 1, 1), datetime.datetime(2013, 1, 2),
            datetime.datetime(2013, 1, 3), datetime.datetime(2013, 1, 4),
            datetime.datetime(2013, 1, 5)] * 4
df['dt_nat'] = [datetime.datetime(2013, 1, 1), datetime.datetime(2013, 1, 2),
                pd.NaT, datetime.datetime(2013, 1, 4),
                datetime.datetime(2013, 1, 5)] * 4

df.groupby(pd.TimeGrouper(key='dt', freq='D')).mean()
# OK
df.groupby(pd.TimeGrouper(key='dt_nat', freq='D')).mean()
# ValueError: month must be in 1..12
2. size (#7600)

size raises AttributeError regardless of NaT existence.

df.groupby(pd.TimeGrouper(key='dt', freq='D')).size()
# AttributeError: 'BinGrouper' object has no attribute 'groupings'
3. first, last, nth

It looks work, but TimeGrouper outputs different result from normal groupby.

df.groupby('dt').first()
#                    A         B         C         D  key     dt_nat
# dt                                                                
#2013-01-01 -1.868691 -0.554116 -0.094949  0.009740    1 2013-01-01
#2013-01-02  0.272139 -0.106543  1.319331 -0.532377    2 2013-01-02
#2013-01-03 -1.637544  2.699557 -0.164414 -1.451295    3        NaT
#2013-01-04  1.642609 -0.313832  0.494468 -0.698104    4 2013-01-04
#2013-01-05 -1.554106  1.230299 -1.408515 -0.000722    5 2013-01-05


df.groupby(pd.TimeGrouper(key='dt', freq='D')).first()
#                    A         B         C         D  key     dt_nat
# dt                                                                
#2013-01-01 -1.868691 -0.554116 -0.094949  0.009740    1 2013-01-01
#2013-01-02  0.272139 -0.106543  1.319331 -0.532377    2 2013-01-02
#2013-01-03 -1.637544  2.699557 -0.164414 -1.451295    3        NaT
#2013-01-04  1.642609 -0.313832  0.494468 -0.698104    4 2013-01-04
#2013-01-05 -0.024332  1.668172 -0.328200  1.731480    5 2013-01-05

# Compare 5th row

I assume the difference derived from BinGrouper sorts rows differently from normal groupby. Thus, result of normal groupby and TimeGrouper can differ.

df.groupby('dt').get_group(datetime.datetime(2013, 1, 5))
#            A         B         C         D         dt     dt_nat
#4   0.632937  0.224670 -0.201186 -0.340428 2013-01-05 2013-01-05
#9  -1.238944 -0.031075 -1.173326 -0.314716 2013-01-05 2013-01-05
#14  2.108985  0.993430  1.300605  1.452049 2013-01-05 2013-01-05
#19  0.315452 -0.817634 -0.526728  0.201415 2013-01-05 2013-01-05

df.groupby(pd.TimeGrouper(key='dt', freq='D')).get_group(datetime.datetime(2013, 1, 5))
#            A         B         C         D         dt     dt_nat
#9  -1.238944 -0.031075 -1.173326 -0.314716 2013-01-05 2013-01-05
#4   0.632937  0.224670 -0.201186 -0.340428 2013-01-05 2013-01-05
#14  2.108985  0.993430  1.300605  1.452049 2013-01-05 2013-01-05
#19  0.315452 -0.817634 -0.526728  0.201415 2013-01-05 2013-01-05
@jreback jreback added this to the 0.14.1 milestone Jun 14, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 26, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback added Testing pandas testing functions or related to the test suite Difficulty Novice labels Feb 17, 2016
@jreback
Copy link
Contributor

jreback commented Feb 17, 2016

first 2 look fixed, just need validation tests. Then can deal with 3rd issue separately.

@jreback jreback modified the milestones: 0.18.1, Next Major Release Feb 17, 2016
@jreback
Copy link
Contributor

jreback commented Apr 10, 2016

we still need tests for the first 2 parts of this issue (validation tests), yes?

@sinhrks
Copy link
Member Author

sinhrks commented Apr 10, 2016

No, tested in here.

The last remaining is nth, and I'll enable it once #11039 is merged (then close #12839 and complete).

@benrifkind
Copy link

Not sure if this fits in here but I have another issue with nth when I groupby a TimeGrouper and another categorical variable. The categorical variable gets dropped in the aggregation step.

Here's an example

df = pd.DataFrame({'cat': ['cat0']*2 + ['cat1']*2, 
              'date':[pd.datetime(2016,1,1), pd.datetime(2016,1,2)]*2,
             'val':np.arange(1,5)})

# cat date    val
# 0   cat0    2016-01-01  1
# 1   cat0    2016-01-02  2
# 2   cat1    2016-01-01  3
# 3   cat1    2016-01-02  4

This works like I would expect

df.set_index("date").groupby([pd.TimeGrouper("2D"), "cat"]).last()

# date    cat  val
# 2016-01-01  cat0    2
# 2016-01-01  cat1    4

But this does not

df.set_index("date").groupby([pd.TimeGrouper("2D"), "cat"]).nth(-1)

# date   val 
# 2016-01-02  2
# 2016-01-02  4

@jreback
Copy link
Contributor

jreback commented May 19, 2016

you might be using an older version

In [1]: df = pd.DataFrame({'cat': ['cat0']*2 + ['cat1']*2, 
              'date':[pd.datetime(2016,1,1), pd.datetime(2016,1,2)]*2,
             'val':np.arange(1,5)})

In [2]: df.set_index("date").groupby([pd.TimeGrouper("2D"), "cat"]).last()
Out[2]: 
                 val
date       cat      
2016-01-01 cat0    2
           cat1    4

In [3]: df.set_index("date").groupby([pd.TimeGrouper("2D"), "cat"]).nth(-1)
Out[3]: 
                 val
date       cat      
2016-01-01 cat0    2
           cat1    4

In [4]: pd.__version__
Out[4]: u'0.18.1'

@benrifkind
Copy link

Yup. You're right. Just updated from 0.18.0 to 0.18.1 and it works. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants