Strange Behavior of mean() with timedelta64 #9442

tjcrone · 2015-02-07T19:07:25Z

When a dataframe with a timedelta64 is very large, the mean() function does not work as expected. In the following code, the mean is incorrect but all the other stats are fine:

import numpy as np
import pandas as pd

dates1 = pd.date_range(start='20000101T000000', end='20150101T230000', freq='1H').values
dates2 = dates1 + np.random.randint(0*60*60*1000000000, 10*24*60*60*1000000000, len(dates1))
df = pd.DataFrame({'dates1':dates1, 'dates2':dates2})
df['tdiff'] = dates2-dates1
df['fdiff'] = df['tdiff'].apply(lambda x: float(x.item())/1000000000/3600/24)
df.describe()

By making the length smaller, by changing the start date in the above example:

dates1 = pd.date_range(start='20140101T000000', end='20150101T230000', freq='1H').values

The correct result is obtained. (The mean of the timedelta should be about 5 days.) Is this an open bug?

Version:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: C
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.2
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

shoyer · 2015-02-07T21:51:11Z

@tjcrone please add an example of the output you see from a command and what exactly you expected.

tjcrone · 2015-02-07T22:09:40Z

When the dataframe is very large, in the case of the first example, the output is:

In [8]: df.describe()
Out[8]: 
| tdiff | fdiff
count | 131520 | 131520.000000
mean | 0 days 03:14:29.151921 | 5.005132
std | 2 days 21:21:04.522906 | 2.889636
min | 0 days 00:00:03.949751 | 0.000046
25% | 2 days 12:09:09.231888 | 2.506357
50%  | 5 days 00:27:26.602058  | 5.019058
75% | 7 days 12:12:39.632537 | 7.508792
max | 9 days 23:59:55.066284 | 9.999943

I expect the mean in tdiff to mirror fdiff. When the dataframe is not large we get:

Out[15]: 
 |  tdiff  |  fdiff
count | 8784 | 8784.000000
mean | 4 days 22:56:31.430177 | 4.955919
std | 2 days 21:47:46.782711 | 2.908180
min | 0 days 00:00:01.128115 | 0.000013
25%  | 2 days 10:26:07.091579 | 2.434804
50% | 4 days 21:41:41.835346 | 4.903956
75% | 7 days 11:48:15.255344 | 7.491843
max  | 9 days 23:57:21.197158 |  9.998162

Which is the expected result. The second column is calculated by taking the seconds and converting them to floats. I believe that the timedelta64 object being used in the mean function may be overflowing.

shoyer · 2015-02-07T22:27:27Z

Ah, yes. This does look like an overflow issue.

In fact, we do not allow aggregations like mean for datetime64 objects for exactly these sort of reasons:

In [19]: df.dates1.mean()
TypeError: reduction operation 'mean' not allowed for this dtype

In this case, it looks like we have only inconsistently disabled it for timedelta64. Compare:

In [21]: df.describe()
Out[21]:
                        tdiff          fdiff
count                  131520  131520.000000
mean   0 days 02:46:03.645599       4.985393
std    2 days 21:10:21.517002       2.882193
min    0 days 00:00:12.732912       0.000147
25%    2 days 11:41:42.032980       2.487292
50%    4 days 23:23:59.916127       4.974999
75%    7 days 11:28:44.708439       7.478295
max    9 days 23:59:52.538085       9.999914

In [22]: df.mean()
Out[22]:
fdiff    4.985393
dtype: float64

So the immediate consistency fix would be to make aggregation on timedelta64 objects always raises an error/skips. In the long term, it would be nice to fully support appropriate aggregation operations for datetime and timedelta types. Let me if you're interested in working on that...

tjcrone · 2015-02-08T13:40:46Z

You showed that the mean of a datetime64 object is not returned. That seems appropriate. With timedelta64 objects, an incorrect value is returned when calling mean directly. I don't see any inconsistencies, just incorrect values:

In [2]: df.tdiff.mean()
Out[2]:
Timedelta('0 days 02:40:53.248336')

What is especially strange is that std() returns the correct value, because std() should require the correct mean value:

In [4]: df.tdiff.std()
Out[4]:
Timedelta('2 days 21:06:29.824063')

A correct calculation of the mean for timedeltas is a one-liner. Please feel free to incorporate:

In [12]: pd.to_timedelta(df.tdiff.apply(lambda x: float(x.item())).sum()/len(df.tdiff.index))
Out[12]:
Timedelta('4 days 23:33:47.520090')

jreback · 2015-02-08T13:53:43Z

the std is a marginal algo that will not overflow, mean however is simpy sum()/n, and sum() does overflow.

This is because precision is at the ns level. and when summing these overflow is pretty easy. To work-around you can change the precision and it will work.

pd.Timedelta(df['tdiff'].astype('timedelta64[ms]').mean(),unit='ms')
Out[21]: Timedelta('5 days 00:15:58.979021')

@tjcrone you generally do not want to use .apply if there are vectorized operations available.

jreback · 2015-02-08T13:55:45Z

@tjcrone if you are interested in doing a pull-request. Would be ok with doing something like the above (its actually easy as you don't have to worry about conversions, just divide the i8 by say 1e6 before and after mean), in core.internals.nansum only for the i8 (maybe look at the other functions to see if they need this fix)

…cause of overflow pandas-dev#9442

BUG: Bug in incorrection computation of .mean() on timedelta64[ns] because of overflow #9442

shoyer added Bug Timedelta Timedelta data type labels Feb 7, 2015

jreback added this to the 0.16.1 milestone Mar 17, 2015

This was referenced Mar 17, 2015

Overflow adding pd.timedelta to pd.Timestamp #6549

Closed

Numerically unstable mean calculation for Timedeltas. #9670

Closed

jreback added Difficulty Intermediate labels Apr 4, 2015

jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015

jreback mentioned this issue May 1, 2015

Median returns odd value when applied to timedeltas #10040

Closed

jreback mentioned this issue May 18, 2015

Mean overflows for integer dtypes #10155

Closed

jreback modified the milestones: Next Major Release, 0.17.0 Aug 15, 2015

jreback modified the milestones: 0.17.0, Next Major Release Aug 29, 2015

jreback added a commit to jreback/pandas that referenced this issue Aug 29, 2015

BUG: Bug in incorrection computation of .mean() on timedelta64[ns] be…

1da4a6d

…cause of overflow pandas-dev#9442

jreback mentioned this issue Aug 29, 2015

BUG: Bug in incorrection computation of .mean() on timedelta64[ns] because of overflow #9442 #10926

Merged

jreback closed this as completed in #10926 Aug 29, 2015

jreback added a commit that referenced this issue Aug 29, 2015

Merge pull request #10926 from jreback/timedelta_mean

ba0704f

BUG: Bug in incorrection computation of .mean() on timedelta64[ns] because of overflow #9442

jorisvandenbossche mentioned this issue Jan 31, 2019

implement+test mean for datetimelike EA/Index/Series #24757

Merged

4 tasks

patrickmckenna mentioned this issue Apr 23, 2022

Add more overflow tests for timedelta64 operations #46854

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange Behavior of mean() with timedelta64 #9442

Strange Behavior of mean() with timedelta64 #9442

tjcrone commented Feb 7, 2015

shoyer commented Feb 7, 2015

tjcrone commented Feb 7, 2015

shoyer commented Feb 7, 2015

tjcrone commented Feb 8, 2015

jreback commented Feb 8, 2015

jreback commented Feb 8, 2015

Strange Behavior of mean() with timedelta64 #9442

Strange Behavior of mean() with timedelta64 #9442

Comments

tjcrone commented Feb 7, 2015

shoyer commented Feb 7, 2015

tjcrone commented Feb 7, 2015

shoyer commented Feb 7, 2015

tjcrone commented Feb 8, 2015

jreback commented Feb 8, 2015

jreback commented Feb 8, 2015