Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: cleanup for Timestamp arithmetic #8916

Closed
wants to merge 4 commits into from

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Nov 28, 2014

Fixes #8865 (Timestamp - Timestamp -> Timedelta)

This PR cleans up and extends Timestamp arithmetic similarly to my treatment for Timedelta in #8884.

It includes a new to_datetime64() method, and arithmetic now works between Timestamp and ndarrays. I also ensure comparison operations work properly between all of (Timestamp, Timedelta, NaT) and ndarrays.

Implementation notes: wide use of the NotImplemented singleton let me cleanup some complex logic. I also strove to reduce the tight-coupling of Timestamp/Timedelta to pandas itself by removing use of the _typ property in tslib (I honestly don't quite understand why it needs to exist) and by not treating series/index any differently from any other ndarray-like objects.

CC @jreback @jorisvandenbossche @immerrr

Fixes GH8865 (Timestamp - Timestamp -> Timedelta)

This PR cleans up and extends `Timestamp` arithmetic similarly to the
treatment for `Timedelta` in GH8884.

It includes a new `to_datetime64()` method, and arithmetic now works between
Timestamp and ndarrays. I also ensured comparison operations work properly
between all of (Timestamp, Timedelta, NaT) and ndarrays.

Implementation notes: wide use of the `NotImplemented` singleton let me
cleanup many of these complex cases. I also strove to reduce the tight-
coupling of `Timestamp`/`Timedelta` to pandas itself by removing use of the
`_typ` property in tslib (I honestly don't quite understand why it needs to
exist) and by not treating series/index any differently from any other
ndarray-like object.
@jreback jreback added Bug Timedelta Timedelta data type Timeseries labels Nov 28, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 28, 2014
@jreback
Copy link
Contributor

jreback commented Nov 28, 2014

So _typ is needed if you need type comparisons (isinstance) in cython when you can't import things (well you can, but you don' generally want to), e.g. TimedeltaIndex and such.

However you eliminated the need for much of that, so +1 (though I think you went too far and now accept clear wrong types, e.g. dti + dti which must raise a TypeError)

Need to check return types on a matrix of operations, they should all be pandas types (Timestamp/Timedelta/TimedeltaIndex/DateIndex), when operated with pandas types (e.g. a np.datetime64 + np.timedelta64) obviously will yield a np.datetime64

So need to wrap all return ops (which is a pass thru if its the right type already). The only wrinkle is that you need to determine array/scalars (I think simple enough to detect if one of the operands is an array-like (e.g. has a dtype), then you return DatetimeIndex/TimedeltaIndex).

I think this needs a systematic test (their are some in tseries/test_base, so can prob just add on
And fully-compare e.g. don't just compare the values, but test the types as well

ts = Timestamp('20130101')
In [5]: ts-pd.to_datetime(['1999-12-31']).values
Out[5]: array([86400000000000], dtype='timedelta64[ns]')

This must raise a TypeError. It is a clear operation that should simply not be allowed. This is the point of a TypeError.

In [6]: pd.date_range('20130101',periods=3)+ts                                 
NotImplementedError: 

Furthermore Series / Timedelta / Timestamp ops need to be consistent with these. So need to matrix tests these (these raise the appropriate TypeErrors). I am still big -1 on ALWAYS raising NotImplementedError. This is just not very useful. So i'd like you to enumerate all the cases so they can be discussed (e.g. td + tdi, td - tdi, ts + tdi)....etc. Indicate where you would put NotImplementedError and TypeError, etc.

We still have the side issue of:

dti+dti, dti-dti 'work' but are set operations. need to think about this again.

@jreback
Copy link
Contributor

jreback commented Nov 28, 2014

numpy does this and its completely useless. It should at the very least raise an error. I think this is just broken.

In [12]: pd.date_range('20130101',periods=3).values+ts.value
Out[12]: 
array(['2056-01-01T19:00:00.000000000-0500',
       '2056-01-02T19:00:00.000000000-0500',
       '2056-01-03T19:00:00.000000000-0500'], dtype='datetime64[ns]')

@shoyer
Copy link
Member Author

shoyer commented Nov 28, 2014

Travis was being very slow last night so I posted this PR anyways, but it looks like I still have a few fix-ups to do.

My intention was not to break any existing behavior with this change -- may need some more tests to ensure that. I don't think I changed dti + dti or changed the code pathways for any (1d, 1d) operations. I was intentionally sticking to only stuff involving a scalar.

I actually don't think we should ever raise NotImplementedError in arithmetic or comparison operations (that is old code, maybe the path was not exposed before). NotImplemented is totally distinct though, and manifests itself to users via a helpful TypeError.

Agreed RE the numpy bug. You'll notice I added explicit TypeErrors in tslib.pyx to avoid those.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2014

when you have a chance can you rebase

@shoyer
Copy link
Member Author

shoyer commented Dec 3, 2014

This change turns out to be much trickier to get right than I initially expected, due to a combination of factors:

  1. Timestamp optionally supports timezones and offests, which cannot be represented in datetime64 objects.
  2. Thus, we need to support (Timestamp, array of Timestamp) operations (e.g., for tz aware Timestamps or for comparisons to NaT). The easiest way to do this is to let numpy ndararys do their automatic broadcasting thing.
  3. OTOH, to do vectorized operations, setting a higher array priority is necessary (this conflicts with automatic broadcasting).
  4. np.datetime64[ns] will silently convert to int64 under some conditions due to the numpy bugs (sigh), breaking some of the obvious solutions.
  5. DataFrame operations with dates are already sort of broken, but I would like to avoid making them worse

So I'm not sure I'll be able to get this ready in time for 0.15.2 this weekend.

All this makes me really look forward to switching to dynd for handling datetime arrays, instead of np.datetime64. datetime64 can't be relied on as much more than a boxed array of integers (actually, sometime I wonder if that would have been an easier choice...).

@jreback jreback modified the milestones: 0.16.0, 0.15.2 Dec 3, 2014
@jreback
Copy link
Contributor

jreback commented Dec 3, 2014

@shoyer I don't think we'll be switching for datetime to dynd anytime soon. These these are not implemented their either. Theoretically yes we will be able to as the broadcasting will be correct with tz's, but practically the implementer (pandas here needs to actually be aware) and potentially do things. So yes dynd could help with this. But this is very close to being correct now w/o using much numpy and avoiding the pitfalls.

@shoyer
Copy link
Member Author

shoyer commented Dec 4, 2014

@jreback I am going to make a separate PR to get some of the easy changes from here into 0.15.2 (e.g., adding the to_datetime64 method). You already fixed Timestamp - Timestamp in #8981.

@jreback
Copy link
Contributor

jreback commented Dec 4, 2014

sounds good.

@shoyer
Copy link
Member Author

shoyer commented Dec 9, 2014

Closing -- this needs a reboot/refactor

@shoyer shoyer closed this Dec 9, 2014
@shoyer shoyer removed this from the 0.16.0 milestone Dec 9, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Timestamp - Timestamp should be pd.Timedelta, not datetime.timedelta
2 participants