Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resample category data with timedelta index #12169

Closed
mapa17 opened this issue Jan 28, 2016 · 3 comments
Closed

Resample category data with timedelta index #12169

mapa17 opened this issue Jan 28, 2016 · 3 comments
Labels
Bug Categorical Categorical Data Type Resample resample method
Milestone

Comments

@mapa17
Copy link

mapa17 commented Jan 28, 2016

Hi,

I get a very strange behavior when i try to resample categorical data with and timedelta index, as compared to a datetime index.

>> d1 = pd.DataFrame({'Group_obj': 'A'}, index=pd.date_range('2000-1-1', periods=20, freq='s'))
>> d1['Group'] = d1['Group_obj'].astype('category')
>> d1
                    Group_obj Group
2000-01-01 00:00:00         A     A
2000-01-01 00:00:01         A     A
2000-01-01 00:00:02         A     A
2000-01-01 00:00:03         A     A
2000-01-01 00:00:04         A     A
2000-01-01 00:00:05         A     A
2000-01-01 00:00:06         A     A
2000-01-01 00:00:07         A     A
2000-01-01 00:00:08         A     A
2000-01-01 00:00:09         A     A
2000-01-01 00:00:10         A     A
2000-01-01 00:00:11         A     A
2000-01-01 00:00:12         A     A
2000-01-01 00:00:13         A     A
2000-01-01 00:00:14         A     A
2000-01-01 00:00:15         A     A
2000-01-01 00:00:16         A     A
2000-01-01 00:00:17         A     A
2000-01-01 00:00:18         A     A
2000-01-01 00:00:19         A     A

>> corr = d1.resample('10s', how=lambda x: (x.value_counts().index[0]))
>> corr
                    Group_obj Group
2000-01-01 00:00:00         A     A
2000-01-01 00:00:10         A     A

>> corr.dtypes
Group_obj    object
Group        object
dtype: object

>> d2 = d1.set_index(pd.to_timedelta(list(range(20)), unit='s'))
>> fxx = d2.resample('10s', how=lambda x: (x.value_counts().index[0]))
>> fxx
         Group_obj  Group
00:00:00         A    NaN
00:00:10         A    NaN

>> fxx.dtypes
Group_obj     object
Group        float64
dtype: object

It seems to me the aggregated result in case of using timedelta as an index for the category is always NaN.
Should this be?

Thx

PS: is there a way to specify the dtype for the aggregated columns?

@jreback
Copy link
Contributor

jreback commented Jan 28, 2016

hmm, does appear a little buggy.

you shouldn't need to specify the dtype on aggregations they are inferred. Here I think there is an embedded exception which is caught in stead of actuallly computing correctly.

@jreback jreback added Bug Resample resample method Categorical Categorical Data Type Difficulty Intermediate labels Jan 28, 2016
@jreback jreback added this to the 0.18.0 milestone Jan 28, 2016
@jreback
Copy link
Contributor

jreback commented Jan 28, 2016

I look after #11841 as the timedelta resampling is tested a bit more there (but not enough!)

@BranYang
Copy link
Contributor

BranYang commented Feb 4, 2016

The root cause of this issue is that, when construct Series from a dict with TimedeltaIndex as key, it will treat the value as float64. See pandas/core/series.py, from line 172 to 185

try:
    if isinstance(index, DatetimeIndex):
        if len(data):
            # coerce back to datetime objects for lookup
            data = _dict_compat(data)
            data = lib.fast_multiget(data, index.astype('O'),
                                     default=np.nan)
        else:
            data = np.nan
    elif isinstance(index, PeriodIndex):
        data = ([data.get(i, nan) for i in index]
                if data else np.nan)
    else:
        data = lib.fast_multiget(data, index.values,
                                 default=np.nan)

I believe just change isinstance(index, PeriodIndex): to isinstance(index, (PeriodIndex, TimedeltaIndex): would solve this issue

Before

In [5]: fxx = d2.resample('10s', how=lambda x: (x.value_counts().index[0]))

In [6]: fxx
Out[6]:
         Group_obj  Group
00:00:00         A    NaN
00:00:10         A    NaN

After

In [5]: fxx = d2.resample('10s', how=lambda x: (x.value_counts().index[0]))

In [6]: fxx
Out[6]:
         Group_obj Group
00:00:00         A     A
00:00:10         A     A

cldy pushed a commit to cldy/pandas that referenced this issue Feb 11, 2016
closes pandas-dev#12169

Author: Bran Yang <snowolfy@163.com>

Closes pandas-dev#12271 from BranYang/issue12169 and squashes the following commits:

4a5605f [Bran Yang] add tests to Series/test_constructors; and update whatsnew
7cf1be9 [Bran Yang] Fix pandas-dev#12169 - Resample category data with timedelta index
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants