groupby.mean, etc, doesn't recognize timedelta64 #5724

Closed
hsharrison opened this Issue Dec 17, 2013 · 9 comments

Comments

Projects
None yet
8 participants

See http://stackoverflow.com/questions/20625982/split-apply-combine-on-pandas-timedelta-column
related as well: http://stackoverflow.com/questions/20789976/python-pandas-dataframe-1st-line-issue-with-datetime-timedelta/20802902#20802902

I have a DataFrame with a column of timedeltas (actually upon inspection the dtype is timedelta64[ns] or '<m8[ns]'), and I'd like to do a split-combine-apply, but the timedelta column is being dropped:

import pandas as pd

import numpy as np

pd.__version__
Out[3]: '0.13.0rc1'

np.__version__
Out[4]: '1.8.0'

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

data
Out[8]: 
         f1        f2              td
0  0.990140  0.948313 00:00:00.003066
1  0.277125  0.993549 00:00:00.001443
2  0.016427  0.581129 00:00:00.009257
3  0.048662  0.512215 00:00:00.000702
4  0.846301  0.179160 00:00:00.000396
5  0.568323  0.419887 00:00:00.000266
6  0.328182  0.919897 00:00:00.006138
7  0.292882  0.213219 00:00:00.008876
8  0.623332  0.003409 00:00:00.000322
9  0.650436  0.844180 00:00:00.006873

[10 rows x 3 columns]

data.groupby(data.index < 5).mean()
Out[9]: 
             f1        f2
False  0.492631  0.480118
True   0.435731  0.642873

[2 rows x 2 columns]

Or, forcing pandas to try the operation on the 'td' column:

data.groupby(data.index < 5)['td'].mean()
---------------------------------------------------------------------------
DataError                                 Traceback (most recent call last)
<ipython-input-12-88cc94e534b7> in <module>()
----> 1 data.groupby(data.index < 5)['td'].mean()

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in mean(self)
    417         """
    418         try:
--> 419             return self._cython_agg_general('mean')
    420         except GroupByError:
    421             raise

/path/to/lib/python3.3/site-packages/pandas-0.13.0rc1-py3.3-linux-x86_64.egg/pandas/core/groupby.py in _cython_agg_general(self, how, numeric_only)
    669 
    670         if len(output) == 0:
--> 671             raise DataError('No numeric types to aggregate')
    672 
    673         return self._wrap_aggregated_output(output, names)

DataError: No numeric types to aggregate

However, taking the mean of the column works fine, so numeric operations should be possible:

data['td'].mean()
Out[11]: 
0   00:00:00.003734
dtype: timedelta64[ns]
Contributor

hayd commented Dec 18, 2013

Worth mentioning that DataFrame's mean doesn't do this either, I think it should (convert those originally datelike to date):

In [11]: data.mean()
Out[11]: 
f1          0.528609
f2          0.583264
td    5975598.700000
dtype: float64

Compare to data.mean(1) which ignores date columns (correctly imo as you're going across dtypes).

Contributor

cancan101 commented Dec 18, 2013

Interestingly, mean works on Series but not DataFrame (this is using the new timedelta formatting code #5701):

In [10]: pd.to_timedelta(list(range(5)), unit='D')
Out[10]: 
0   0 days
1   1 days
2   2 days
3   3 days
4   4 days
dtype: timedelta64[ns]
In [9]: pd.to_timedelta(list(range(5)), unit='D').mean()
Out[9]: 
0   2 days
dtype: timedelta64[ns]
In [6]: pd.DataFrame(pd.to_timedelta(list(range(5)), unit='D')).mean()
Out[6]: 
0   55785 days, 14:53:21.659060
dtype: timedelta64[ns]
Contributor

jreback commented Dec 18, 2013

@hayd actually the results of data.mean() are correct. The result for td is in nanoseconds. It IS however possible to return an object array that is correct, .eg.

In [34]: Series([0.1,0.2,timedelta(10)])
Out[34]: 
0                 0.1
1                 0.2
2    10 days, 0:00:00
dtype: object

so @hayd maybe create a separate issue for this type of inference (Its just a bit of inference detection in nanops.py/_reduce)

Contributor

jreback commented Dec 18, 2013

@hsharrison as far as the groupby; this is just not implemented ATM in groupby.py; its not that difficult, just needs to follow basically what datetime64 stuff does

@jreback jreback modified the milestone: 0.15.0, 0.14.0 Feb 18, 2014

Contributor

danielballan commented Apr 15, 2014

See #6884.

@jreback jreback modified the milestone: 0.14.0, 0.15.0 Apr 16, 2014

@jreback jreback modified the milestone: 0.14.1, 0.14.0 May 5, 2014

@jreback jreback modified the milestone: 0.15.0, 0.14.1, 0.15.1 Jul 1, 2014

@jreback jreback modified the milestone: 0.15.1, 0.15.0 Sep 9, 2014

@jreback jreback modified the milestone: 0.16.0, Next Major Release Mar 6, 2015

Workaround: Use .describe() (which includes the mean) rather than .mean()

See https://gist.github.com/tomfitzhenry/d36ebba697a1f3eeefcb for demo.

Contributor

danielballan commented Oct 15, 2015

Wow. Any insight into why that works? I would not have expected a convenience method to take a different code path.

zaxliu commented Nov 3, 2015

Also has the same problem. @tomfitzhenry 's solution works.

kjam commented Feb 3, 2016

Seeing as this keeps getting kicked to the next release, you can also perform sum and count. Referencing @hsharrison's original dataframe from the bug report:

data = pd.DataFrame(np.random.rand(10, 3), columns=['f1', 'f2', 'td'])

data['td'] *= 10000000

data['td'] = pd.Series(data['td'], dtype='<m8[ns]')

grouped_df = data.groupby(data.index < 5)

mean_df = grouped_df.mean()

In [14]: mean_df
Out[14]: 
             f1        f2
False  0.271488  0.614299
True   0.535522  0.476918

mean_df['td'] = grouped_df['td'].sum() / grouped_df['td'].count()

In [16]: mean_df
Out[16]: 
             f1        f2              td
False  0.271488  0.614299 00:00:00.004187
True   0.535522  0.476918 00:00:00.003278

@jreback jreback modified the milestone: 0.18.1, Next Major Release Feb 24, 2016

@jreback jreback modified the milestone: 0.18.1, 0.18.2 Apr 25, 2016

@jreback jreback modified the milestone: 0.18.2, 0.19.0 Jul 6, 2016

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 4, 2017

@jreback jreback BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes #5724
be7d2d2

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 5, 2017

@jreback jreback BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes #5724
045bcb5

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 9, 2017

@jreback jreback BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes #5724
75fc113

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 10, 2017

@jreback jreback BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes #5724
f64a309

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 11, 2017

@jreback jreback BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes #5724
c250302

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 11, 2017

@jreback jreback BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False

closes #5724
768fce1

jreback closed this in 8e13da2 Jan 18, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@jreback @AnkurDedania jreback + AnkurDedania BUG: make sure that we are passing thru kwargs to groupby
BUG: allow timedelta64 to work in groupby with numeric_only=False
closes #5724

Author: Jeff Reback <jeff@reback.net>

Closes #15054 from jreback/groupby_arg and squashes the following commits:

768fce1 [Jeff Reback] BUG: make sure that we are passing thru kwargs to groupby BUG: allow timedelta64 to work in groupby with numeric_only=False
d8d866f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment