BUG: df.apply handles np.timedelta64 as timestamp, should be timedelta #7778

Closed
stharrold opened this Issue Jul 17, 2014 · 7 comments

Comments

Projects
None yet
3 participants

I think there may be a bug with the row-wise handling of numpy.timedelta64 data types when using DataFrame.apply. As a check, the problem does not appear when using DataFrame.applymap. The problem may be related to #4532, but I'm unsure. I've included an example below.

This is only a minor problem for my use-case, which is cross-checking timestamps from a counter/timer card. I can easily work around the issue with DataFrame.itertuples etc.

Thank you for your time and for making such a useful package!

Example

Version

Import and check versions.

$ date
Thu Jul 17 16:28:38 CDT 2014
$ conda update pandas
Fetching package metadata: ..
# All requested packages already installed.
# packages in environment at /Users/harrold/anaconda:
#
pandas                    0.14.1               np18py27_0  
$ ipython
Python 2.7.8 |Anaconda 2.0.1 (x86_64)| (default, Jul  2 2014, 15:36:00) 
Type "copyright", "credits" or "license" for more information.

IPython 2.1.0 -- An enhanced Interactive Python.
Anaconda is brought to you by Continuum Analytics.
Please check out: http://continuum.io/thanks and https://binstar.org
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from __future__ import print_function

In [2]: import numpy as np

In [3]: import pandas as pd

In [4]: pd.util.print_versions.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.8.final.0
python-bits: 64
OS: Darwin
OS-release: 11.4.2
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.1.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 1.5
pytz: 2014.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: 0.999
httplib2: 0.8
apiclient: 1.2
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Create test data

Using subset of original raw data as example.

In [5]: datetime_start = np.datetime64(u'2014-05-31T01:23:19.9600345Z')

In [6]: timedeltas_elapsed = [30053400, 40053249, 50053098]

Compute datetimes from elapsed timedeltas, then create differential timedeltas from datetimes. All elements are either type numpy.datetime64 or numpy.timedelta64.

In [7]: df = pd.DataFrame(dict(datetimes = timedeltas_elapsed))

In [8]: df = df.applymap(lambda elt: np.timedelta64(elt, 'us'))

In [9]: df = df.applymap(lambda elt: np.datetime64(datetime_start + elt))

In [10]: df['differential_timedeltas'] = df['datetimes'] - df['datetimes'].shift()

In [11]: print(df)
                      datetimes  differential_timedeltas
0 2014-05-31 01:23:50.013434500                      NaT
1 2014-05-31 01:24:00.013283500          00:00:09.999849
2 2014-05-31 01:24:10.013132500          00:00:09.999849
Expected behavior

With element-wise handling using DataFrame.applymap, all elements are correctly identified as datetimes (timestamps) or timedeltas.

In [12]: print(df.applymap(lambda elt: type(elt)))
                          datetimes     differential_timedeltas
0  <class 'pandas.tslib.Timestamp'>  <type 'numpy.timedelta64'>
1  <class 'pandas.tslib.Timestamp'>  <type 'numpy.timedelta64'>
2  <class 'pandas.tslib.Timestamp'>  <type 'numpy.timedelta64'>
Bug

With row-wise handling using DataFrame.apply, all elements are type pandas.tslib.Timestamp. I expected 'differential_timedeltas' to be type numpy.timedelta64 or another type of timedelta, not a type of datetime (timestamp).

In [13]: # For 'datetimes':

In [14]: print(df.apply(lambda row: type(row['datetimes']), axis=1))
0    <class 'pandas.tslib.Timestamp'>
1    <class 'pandas.tslib.Timestamp'>
2    <class 'pandas.tslib.Timestamp'>
dtype: object

In [15]: # For 'differential_timedeltas':

In [16]: print(df.apply(lambda row: type(row['differential_timedeltas']), axis=1))
0      <class 'pandas.tslib.NaTType'>
1    <class 'pandas.tslib.Timestamp'>
2    <class 'pandas.tslib.Timestamp'>
dtype: object
Contributor

jreback commented Jul 17, 2014

actually this is a sympton of a more insidius issue.

try df.values. This is a 'common-dtype' for the frame, unfortunately its wrong, it should be object, and NOT datetime64[ns]. So some logic messed up here: https://github.com/pydata/pandas/blob/master/pandas/core/internals.py#L3673

I think it needs a tiny tweek (and test cases of course!).

interested in a pull-request?

jreback added this to the 0.15.0 milestone Jul 17, 2014

Thanks for the fast response!

I'll give it a go, but as a newbie this may be over my head to do in a time-efficient manner. Guess I'll start reading through http://pandas.pydata.org/developers.html ...

I'll reply to this thread when I have some progress on the patch. Thanks again.

Member

cpcloud commented Jul 17, 2014

If you have questions along the way don't hesitate to ask, we don't bite

Contributor

jreback commented Jul 17, 2014

@stharrold turns out this was a bit non-trivial (and needed extra testing). fixed in #7779

timedelta == need_a_scalar_type! (though just using timedelta itself so its ok

@jreback Wow, that's quite a patch! Thank you! I'm glad I could help with the bug report.

Contributor

jreback commented Jul 18, 2014

@stharrold thanks....found a couple of other odd conversions at the same time

jreback closed this in #7779 Jul 18, 2014

Contributor

jreback commented Jul 18, 2014

@stharrold thanks for the report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment