Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra Bin with Pandas Resample in 0.11.0 #4076

Closed
waltaskew opened this issue Jun 28, 2013 · 10 comments · Fixed by #6690
Closed

Extra Bin with Pandas Resample in 0.11.0 #4076

waltaskew opened this issue Jun 28, 2013 · 10 comments · Fixed by #6690
Labels
Bug Resample resample method
Milestone

Comments

@waltaskew
Copy link

I've got a pandas data frame defined like this, using pandas 0.11.0:

    last_4_weeks_range = pandas.date_range(                                
            start=datetime.datetime(2001, 5, 4), periods=28)               
    last_4_weeks = pandas.DataFrame(                                       
        [{'REST_KEY': 1, 'DLY_TRN_QT': 80, 'DLY_SLS_AMT': 90,              
            'COOP_DLY_TRN_QT': 30, 'COOP_DLY_SLS_AMT': 20}] * 28 +         
        [{'REST_KEY': 2, 'DLY_TRN_QT': 70, 'DLY_SLS_AMT': 10,              
            'COOP_DLY_TRN_QT': 50, 'COOP_DLY_SLS_AMT': 20}] * 28,          
        index=last_4_weeks_range.append(last_4_weeks_range))               
    last_4_weeks.sort(inplace=True)

and when I go to resample it:

In [265]: last_4_weeks.resample('7D', how='sum')
Out[265]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  REST_KEY
2001-05-04               280              560          700        1050        21
2001-05-11               280              560          700        1050        21
2001-05-18               280              560          700        1050        21
2001-05-25               280              560          700        1050        21
2001-06-01                 0                0            0           0         0

I end up with an extra empty bin I wouldn't expect to see -- 2001-06-01. I wouldn't expect that bin to be there, as my 28 days are evenly divisible into the 7 day resample I'm performing. I've tried messing around with the closed kwarg, but I can't escape that extra bin. This seems like a bug, and it messes up my mean calculations when I try to do

In [266]: last_4_weeks.groupby('REST_KEY').resample('7D', how='sum').mean(level=0)
Out[266]: 
          COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  REST_KEY
REST_KEY                                                                      
1                      112              168          504         448       5.6
2                      112              280           56         392      11.2

as the numbers are being divided by 5 rather than 4. (I also wouldn't expect REST_KEY to show up in the aggregation columns as it's part of the groupby, but that's really a smaller problem.)

@waltaskew
Copy link
Author

This is curiously not the case if I pass how='count' -- no extra bin is returned. This makes me suspect a bug:

In [8]: last_4_weeks.resample('7D', how='count')
Out[8]: 
2001-05-04  COOP_DLY_SLS_AMT    14
            COOP_DLY_TRN_QT     14
            DLY_SLS_AMT         14
            DLY_TRN_QT          14
            REST_KEY            14
2001-05-11  COOP_DLY_SLS_AMT    14
            COOP_DLY_TRN_QT     14
            DLY_SLS_AMT         14
            DLY_TRN_QT          14
            REST_KEY            14
2001-05-18  COOP_DLY_SLS_AMT    14
            COOP_DLY_TRN_QT     14
            DLY_SLS_AMT         14
            DLY_TRN_QT          14
            REST_KEY            14
2001-05-25  COOP_DLY_SLS_AMT    14
            COOP_DLY_TRN_QT     14
            DLY_SLS_AMT         14
            DLY_TRN_QT          14
            REST_KEY            14
dtype: int64

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

a somewhat related issue in master is that there's no longer zeros there, there's garbage values.

this is a bug in how python vs. cythonized methods work, for example passing a lambda works

In [5]: last_4_weeks.resample('7D',how=lambda x:mean(x))
Out[5]:
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  \
2001-05-04                20               40           50          75
2001-05-11                20               40           50          75
2001-05-18                20               40           50          75
2001-05-25                20               40           50          75

            REST_KEY
2001-05-04       1.5
2001-05-11       1.5
2001-05-18       1.5
2001-05-25       1.5

@ghost ghost assigned cpcloud Jul 3, 2013
@waltaskew
Copy link
Author

This also seems to act differently with different resample frequencies. With a frequency of 'AS', how='sum' yields the correct answer while how=lambda x: numpy.sum(x) does not:

In [14]: last_4_weeks.resample('AS', how='mean')
Out[14]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  REST_KEY
2001-01-01                20               40           50          75       1.5

In [15]: last_4_weeks.resample('AS', how=lambda x: numpy.mean(x))
Out[15]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  REST_KEY
2001-01-01               NaN              NaN          NaN         NaN       NaN

In [16]: last_4_weeks.resample('AS', how='sum')
Out[16]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  REST_KEY
2001-01-01              1120             2240         2800        4200        84

In [17]: last_4_weeks.resample('AS', how=lambda x: numpy.sum(x))
Out[17]: 
            COOP_DLY_SLS_AMT  COOP_DLY_TRN_QT  DLY_SLS_AMT  DLY_TRN_QT  REST_KEY
2001-01-01                 0                0            0           0         0

@cpcloud
Copy link
Member

cpcloud commented Aug 1, 2013

your last example is an issue with NaN handling

@krapfn
Copy link

krapfn commented Aug 26, 2013

I have also been having issues with resample adding extra bins (also in 0.11.0), and just thought I'd add that I can also see it even when the number of bins is not evenly divisible:

>>> x = pandas.DataFrame(numpy.random.randn(9, 3), index=pandas.date_range('2000-1-1', periods=9))
>>> x
                   0         1         2
2000-01-01 -1.191405  0.645320  1.308088
2000-01-02  1.229103 -0.727613  0.488344
2000-01-03  0.885808  1.381995 -0.955914
2000-01-04 -1.013526 -0.225070 -0.163507
2000-01-05  0.670316 -0.828281 -0.233381
2000-01-06  1.357537  1.446020 -0.661463
2000-01-07  0.335799  0.952127  0.591679
2000-01-08 -0.083534  1.025077 -0.146682
2000-01-09 -1.338294  1.919551  0.446385
>>> x.resample('5D')
                   0         1              2
2000-01-01  0.116059  0.049270   8.872589e-02
2000-01-06  0.067877  1.335694   5.747979e-02
2000-01-11  0.591679  0.146682  3.952525e-322

I don't have any particular insight to add, but maybe this extra info will help...

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

@cpcloud 0.13 or push?

@cpcloud
Copy link
Member

cpcloud commented Sep 28, 2013

like to do 0.13 but got a lot on my plate already ... let me see if there's anything else i can push to 0.14 in favor of this

@jreback
Copy link
Contributor

jreback commented Sep 28, 2013

up2u

@jreback
Copy link
Contributor

jreback commented Oct 4, 2013

pushing for now...can always pull back!

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

Ok
On Oct 4, 2013 4:21 PM, "jreback" notifications@github.com wrote:

pushing for now...can always pull back!


Reply to this email directly or view it on GitHubhttps://github.com//issues/4076#issuecomment-25727391
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants