Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Behavior of mean() with timedelta64 #9442

Closed
tjcrone opened this issue Feb 7, 2015 · 6 comments · Fixed by #10926
Closed

Strange Behavior of mean() with timedelta64 #9442

tjcrone opened this issue Feb 7, 2015 · 6 comments · Fixed by #10926
Labels
Bug Timedelta Timedelta data type
Milestone

Comments

@tjcrone
Copy link

tjcrone commented Feb 7, 2015

xref #6549

When a dataframe with a timedelta64 is very large, the mean() function does not work as expected. In the following code, the mean is incorrect but all the other stats are fine:

import numpy as np
import pandas as pd

dates1 = pd.date_range(start='20000101T000000', end='20150101T230000', freq='1H').values
dates2 = dates1 + np.random.randint(0*60*60*1000000000, 10*24*60*60*1000000000, len(dates1))
df = pd.DataFrame({'dates1':dates1, 'dates2':dates2})
df['tdiff'] = dates2-dates1
df['fdiff'] = df['tdiff'].apply(lambda x: float(x.item())/1000000000/3600/24)
df.describe()

By making the length smaller, by changing the start date in the above example:

dates1 = pd.date_range(start='20140101T000000', end='20150101T230000', freq='1H').values

The correct result is obtained. (The mean of the timedelta should be about 5 days.) Is this an open bug?

Version:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.2
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@shoyer
Copy link
Member

shoyer commented Feb 7, 2015

@tjcrone please add an example of the output you see from a command and what exactly you expected.

@tjcrone
Copy link
Author

tjcrone commented Feb 7, 2015

When the dataframe is very large, in the case of the first example, the output is:

In [8]: df.describe()
Out[8]: 
| tdiff | fdiff
count | 131520 | 131520.000000
mean | 0 days 03:14:29.151921 | 5.005132
std | 2 days 21:21:04.522906 | 2.889636
min | 0 days 00:00:03.949751 | 0.000046
25% | 2 days 12:09:09.231888 | 2.506357
50%  | 5 days 00:27:26.602058  | 5.019058
75% | 7 days 12:12:39.632537 | 7.508792
max | 9 days 23:59:55.066284 | 9.999943

I expect the mean in tdiff to mirror fdiff. When the dataframe is not large we get:

Out[15]: 
 |  tdiff  |  fdiff
count | 8784 | 8784.000000
mean | 4 days 22:56:31.430177 | 4.955919
std | 2 days 21:47:46.782711 | 2.908180
min | 0 days 00:00:01.128115 | 0.000013
25%  | 2 days 10:26:07.091579 | 2.434804
50% | 4 days 21:41:41.835346 | 4.903956
75% | 7 days 11:48:15.255344 | 7.491843
max  | 9 days 23:57:21.197158 |  9.998162

Which is the expected result. The second column is calculated by taking the seconds and converting them to floats. I believe that the timedelta64 object being used in the mean function may be overflowing.

@shoyer shoyer added Bug Timedelta Timedelta data type labels Feb 7, 2015
@shoyer
Copy link
Member

shoyer commented Feb 7, 2015

Ah, yes. This does look like an overflow issue.

In fact, we do not allow aggregations like mean for datetime64 objects for exactly these sort of reasons:

In [19]: df.dates1.mean()
TypeError: reduction operation 'mean' not allowed for this dtype

In this case, it looks like we have only inconsistently disabled it for timedelta64. Compare:

In [21]: df.describe()
Out[21]:
                        tdiff          fdiff
count                  131520  131520.000000
mean   0 days 02:46:03.645599       4.985393
std    2 days 21:10:21.517002       2.882193
min    0 days 00:00:12.732912       0.000147
25%    2 days 11:41:42.032980       2.487292
50%    4 days 23:23:59.916127       4.974999
75%    7 days 11:28:44.708439       7.478295
max    9 days 23:59:52.538085       9.999914

In [22]: df.mean()
Out[22]:
fdiff    4.985393
dtype: float64

So the immediate consistency fix would be to make aggregation on timedelta64 objects always raises an error/skips. In the long term, it would be nice to fully support appropriate aggregation operations for datetime and timedelta types. Let me if you're interested in working on that...

@tjcrone
Copy link
Author

tjcrone commented Feb 8, 2015

You showed that the mean of a datetime64 object is not returned. That seems appropriate. With timedelta64 objects, an incorrect value is returned when calling mean directly. I don't see any inconsistencies, just incorrect values:

In [2]: df.tdiff.mean()
Out[2]:
Timedelta('0 days 02:40:53.248336')

What is especially strange is that std() returns the correct value, because std() should require the correct mean value:

In [4]: df.tdiff.std()
Out[4]:
Timedelta('2 days 21:06:29.824063')

A correct calculation of the mean for timedeltas is a one-liner. Please feel free to incorporate:

In [12]: pd.to_timedelta(df.tdiff.apply(lambda x: float(x.item())).sum()/len(df.tdiff.index))
Out[12]:
Timedelta('4 days 23:33:47.520090')

@jreback
Copy link
Contributor

jreback commented Feb 8, 2015

the std is a marginal algo that will not overflow, mean however is simpy sum()/n, and sum() does overflow.

This is because precision is at the ns level. and when summing these overflow is pretty easy. To work-around you can change the precision and it will work.

pd.Timedelta(df['tdiff'].astype('timedelta64[ms]').mean(),unit='ms')
Out[21]: Timedelta('5 days 00:15:58.979021')

@tjcrone you generally do not want to use .apply if there are vectorized operations available.

@jreback
Copy link
Contributor

jreback commented Feb 8, 2015

@tjcrone if you are interested in doing a pull-request. Would be ok with doing something like the above (its actually easy as you don't have to worry about conversions, just divide the i8 by say 1e6 before and after mean), in core.internals.nansum only for the i8 (maybe look at the other functions to see if they need this fix)

@jreback jreback added this to the 0.16.1 milestone Mar 17, 2015
@jreback jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015
@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 29, 2015
jreback added a commit to jreback/pandas that referenced this issue Aug 29, 2015
jreback added a commit that referenced this issue Aug 29, 2015
BUG: Bug in incorrection computation of .mean() on timedelta64[ns] because of overflow #9442
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timedelta Timedelta data type
Projects
None yet
3 participants