Resample category data with timedelta index #12169

Closed
mapa17 opened this Issue Jan 28, 2016 · 3 comments

Comments

Projects
None yet
3 participants

mapa17 commented Jan 28, 2016

Hi,

I get a very strange behavior when i try to resample categorical data with and timedelta index, as compared to a datetime index.

>> d1 = pd.DataFrame({'Group_obj': 'A'}, index=pd.date_range('2000-1-1', periods=20, freq='s'))
>> d1['Group'] = d1['Group_obj'].astype('category')
>> d1
                    Group_obj Group
2000-01-01 00:00:00         A     A
2000-01-01 00:00:01         A     A
2000-01-01 00:00:02         A     A
2000-01-01 00:00:03         A     A
2000-01-01 00:00:04         A     A
2000-01-01 00:00:05         A     A
2000-01-01 00:00:06         A     A
2000-01-01 00:00:07         A     A
2000-01-01 00:00:08         A     A
2000-01-01 00:00:09         A     A
2000-01-01 00:00:10         A     A
2000-01-01 00:00:11         A     A
2000-01-01 00:00:12         A     A
2000-01-01 00:00:13         A     A
2000-01-01 00:00:14         A     A
2000-01-01 00:00:15         A     A
2000-01-01 00:00:16         A     A
2000-01-01 00:00:17         A     A
2000-01-01 00:00:18         A     A
2000-01-01 00:00:19         A     A

>> corr = d1.resample('10s', how=lambda x: (x.value_counts().index[0]))
>> corr
                    Group_obj Group
2000-01-01 00:00:00         A     A
2000-01-01 00:00:10         A     A

>> corr.dtypes
Group_obj    object
Group        object
dtype: object

>> d2 = d1.set_index(pd.to_timedelta(list(range(20)), unit='s'))
>> fxx = d2.resample('10s', how=lambda x: (x.value_counts().index[0]))
>> fxx
         Group_obj  Group
00:00:00         A    NaN
00:00:10         A    NaN

>> fxx.dtypes
Group_obj     object
Group        float64
dtype: object

It seems to me the aggregated result in case of using timedelta as an index for the category is always NaN.
Should this be?

Thx

PS: is there a way to specify the dtype for the aggregated columns?

Contributor

jreback commented Jan 28, 2016

hmm, does appear a little buggy.

you shouldn't need to specify the dtype on aggregations they are inferred. Here I think there is an embedded exception which is caught in stead of actuallly computing correctly.

jreback added this to the 0.18.0 milestone Jan 28, 2016

Contributor

jreback commented Jan 28, 2016

I look after #11841 as the timedelta resampling is tested a bit more there (but not enough!)

Contributor

BranYang commented Feb 4, 2016

The root cause of this issue is that, when construct Series from a dict with TimedeltaIndex as key, it will treat the value as float64. See pandas/core/series.py, from line 172 to 185

try:
    if isinstance(index, DatetimeIndex):
        if len(data):
            # coerce back to datetime objects for lookup
            data = _dict_compat(data)
            data = lib.fast_multiget(data, index.astype('O'),
                                     default=np.nan)
        else:
            data = np.nan
    elif isinstance(index, PeriodIndex):
        data = ([data.get(i, nan) for i in index]
                if data else np.nan)
    else:
        data = lib.fast_multiget(data, index.values,
                                 default=np.nan)

I believe just change isinstance(index, PeriodIndex): to isinstance(index, (PeriodIndex, TimedeltaIndex): would solve this issue

Before

In [5]: fxx = d2.resample('10s', how=lambda x: (x.value_counts().index[0]))

In [6]: fxx
Out[6]:
         Group_obj  Group
00:00:00         A    NaN
00:00:10         A    NaN

After

In [5]: fxx = d2.resample('10s', how=lambda x: (x.value_counts().index[0]))

In [6]: fxx
Out[6]:
         Group_obj Group
00:00:00         A     A
00:00:10         A     A

@BranYang BranYang added a commit to BranYang/pandas that referenced this issue Feb 9, 2016

@BranYang BranYang Fix #12169 - Resample category data with timedelta index 7cf1be9

jreback closed this in e9558d3 Feb 10, 2016

@cldy cldy added a commit to cldy/pandas that referenced this issue Feb 11, 2016

@BranYang @cldy BranYang + cldy Fix #12169 - Resample category data with timedelta index
closes #12169

Author: Bran Yang <snowolfy@163.com>

Closes #12271 from BranYang/issue12169 and squashes the following commits:

4a5605f [Bran Yang] add tests to Series/test_constructors; and update whatsnew
7cf1be9 [Bran Yang] Fix #12169 - Resample category data with timedelta index
fa1e2c8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment