New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Weird results grouping data by day #1580

Closed
danse opened this Issue Jul 7, 2012 · 7 comments

Comments

Projects
None yet
3 participants
@danse

danse commented Jul 7, 2012

Hi I was feeding pandas (0.8.0rc2) with dates and found some errors. The
amounts from following csv file are grouped by date, but the sums for some days
are wrong:

2011-02-02 resulting: 0 correct: 40
2011-08-21 resulting: 3 correct: 133
2012-10-22 resulting: 157 correct: 27

This is the script I am running:

import sys
import pandas

f=pandas.read_csv(sys.stdin, index_col=1, parse_dates=True)

f.sort()
f=f.resample('D', how='sum')
f['amount'] = f['amount'].fillna(0)

f.to_csv(sys.stdout)

Maybe I'm using the time series methods in a wrong way.

The file with data is not too long, it is hosted here: https://raw.github.com/danse/sparkles/master/cleaned.csv

@danse

This comment has been minimized.

danse commented Jul 7, 2012

This is weird because all other sums look correct. Notice that the 2011-08-21 date has an entry out of order in the .csv file, but this is why I sort the frame before resampling.

For the 2012-02-02 entry, the behaviour is as such: the correct value (40) is shifted to the next day, and from that day to the end of the file, all entries are shifted by one day.

@danse

This comment has been minimized.

danse commented Jul 7, 2012

I am using python 3.2.3

@changhiskhan

This comment has been minimized.

Contributor

changhiskhan commented Jul 11, 2012

By default, DataFrame.sort is NOT inplace.

try f.sort().resample('D', how='sum')

@wesm

This comment has been minimized.

Member

wesm commented Jul 11, 2012

This is definitely a bug; the resampling code was not checking for monotonicity (sortedness) in the data, thus the bug. I'm adding a check (and sorting if not), and this problem goes away.

@wesm wesm closed this in 9a8ad65 Jul 11, 2012

@danse

This comment has been minimized.

danse commented Jul 12, 2012

Thanks, using sorting not in place fixed the most of the errors, but the shift in resampling at a same point still remains. I reduced the data file to few rows. Processing the following rows:

amount,date
370,2012-02-01T00:00:00
0,2012-02-01T22:27:48.633911
20,2012-02-02T13:27:24.828871
20,2012-02-02T21:35:41.482386
20,2012-02-04T12:13:37.426859
20,2012-02-04T21:42:11.164113
200,2012-02-05T18:48:24.171116
50,2012-02-06T19:11:07.339103
18,2012-02-07T20:11:59.232420
21,2012-02-11T09:16:25.231366
40,2012-02-11T09:16:44.850074
35,2012-02-13T00:28:00.666619
40,2012-02-13T20:16:40.714301

With the following script:

import sys
import pandas

f=pandas.read_csv(sys.stdin, index_col=1, parse_dates=True)
f=f.sort().resample('D', how='sum')
f['amount'] = f['amount'].fillna(0)
f.to_csv(sys.stdout)

The result is:

,amount
2012-02-01 00:00:00,370.0
2012-02-02 00:00:00,0.0
2012-02-03 00:00:00,40.0
2012-02-04 00:00:00,0.0
2012-02-05 00:00:00,40.0
2012-02-06 00:00:00,200.0
2012-02-07 00:00:00,50.0
2012-02-08 00:00:00,18.0
2012-02-09 00:00:00,0.0
2012-02-10 00:00:00,0.0
2012-02-11 00:00:00,0.0
2012-02-12 00:00:00,61.0
2012-02-13 00:00:00,0.0
2012-02-14 00:00:00,75.0

So from 2012-02-02 all the sums are shifted by one day. I do not have permission to reopen the issue; should I fill in a new one?

@wesm wesm reopened this Jul 12, 2012

@wesm

This comment has been minimized.

Member

wesm commented Jul 12, 2012

I think what you're really looking for is one of:

In [9]: f.sort().resample('D', how='sum', kind='period')
Out[9]: 
             amount
01-Feb-2012     370
02-Feb-2012      40
03-Feb-2012     NaN
04-Feb-2012      40
05-Feb-2012     200
06-Feb-2012      50
07-Feb-2012      18
08-Feb-2012     NaN
09-Feb-2012     NaN
10-Feb-2012     NaN
11-Feb-2012      61
12-Feb-2012     NaN
13-Feb-2012      75

or

In [11]: from pandas.tseries.tools import normalize_date

In [12]: f.groupby(normalize_date).sum()
Out[12]: 
            amount
2012-02-01     370
2012-02-02      40
2012-02-04      40
2012-02-05     200
2012-02-06      50
2012-02-07      18
2012-02-11      61
2012-02-13      75

The thing about the resampling algorithm is that it segments the data by bin edges, then has to assign a label to each bin. This actually gives you what you want as timestamps:

In [16]: f.sort().resample('D', how='sum', label='left', closed='left')
Out[16]: 
            amount
2012-02-01     370
2012-02-02      40
2012-02-03     NaN
2012-02-04      40
2012-02-05     200
2012-02-06      50
2012-02-07      18
2012-02-08     NaN
2012-02-09     NaN
2012-02-10     NaN
2012-02-11      61
2012-02-12     NaN
2012-02-13      75
@danse

This comment has been minimized.

danse commented Jul 12, 2012

Thanks, kind="period" will be perfect. normalize_data is not a solution because I need to fill the gaps. I prefer using kind="period" rather than label left, closed left. I thought that it was a bin issue, but I was misleaded by the fact that the shift happened at a precise point in data. However on original data everything is correct using kind period. Many thanks! :)

@danse danse closed this Jul 12, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment