Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp() is wrong if np.datetime64 < 1678-01-01 #4065

Closed
timcera opened this issue Jun 28, 2013 · 23 comments · Fixed by #4926
Closed

Timestamp() is wrong if np.datetime64 < 1678-01-01 #4065

timcera opened this issue Jun 28, 2013 · 23 comments · Fixed by #4926
Labels
API Design Datetime Datetime data dtype
Milestone

Comments

@timcera
Copy link
Contributor

timcera commented Jun 28, 2013

The following should work, but since it doesn't should raise an error...

In [116]: pd.Timestamp(np.datetime64('1677-01-01'),'D')
Out[116]: Timestamp('2261-07-22 23:34:33.709551616', tz=None)

In [117]: pd.Timestamp(np.datetime64('1678-01-01'),'D')
Out[117]: Timestamp('1678-01-01 00:00:00', tz=None)

In [121]: pd.version
Out[121]: '0.11.1.dev-06cd915'

In [125]: sys.version
Out[125]: '3.3.2 (default, May 21 2013, 15:40:45) \n[GCC 4.8.0 20130502 (prerelease)]'

@SleepingPills
Copy link
Contributor

I'll try to tackle this since I already mucked around with Timestamp.

@jreback
Copy link
Contributor

jreback commented Jul 2, 2013

great!

@danbirken
Copy link
Contributor

The heart of the bug was not doing a _check_dts_bounds() in _get_datetime64_nanos when converting from a datetime64 that wasn't in [ns] units.

But fixing this lead to another issue: the way _check_dts_bounds() worked previously, it did a broad check on the validity of a pandas_datetimestruct.year, and then assuming that triggered, it did a more fine grained check using the integer value of the given datetime64. However, if you have a datetime64 in another unit, the integer value isn't useful for comparison (because by definition it will map into a valid datetime64 somewhere in the [ns] range), so you don't really know if it is valid or not.

So if you have a datetime64 that might be any unit, the only way to verify if it is in bounds is by just looking at the pandas_datetimestruct and just seeing if the values are in bounds or not. So I changed _check_dts_bounds() to do bounds checking this way.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

can you add some tests with to_datetime (which ultimtately call s array_to_datetime) with a mix of np.datetime64 that have some valid, some invalid dtypes? (like you are doing with Timestamp, but with an array of them); the code paths are different, but should yield the same results

e.g.

result1 = np.array([ Timestamp(v) for v in test_values ])
result2 = pd.to_datetime(test_values,box=False)

assume several cases, test_values are valid (with various types of dtypes), and another case of an invalid value

thxs..

this is great stuff getting all of these bugs out of the way!

@danbirken
Copy link
Contributor

Yeah this is a good point, I need to do a little more work here.

When you have a situation like:

In [10]: pd.to_datetime(['1000-01-01', '1000-01-02'])
Out[10]: array(['1000-01-01', '1000-01-02'], dtype=object)

You give it strings, it gives back strings because it can't convert them to valid datetime64[ns]-es because they are out of bounds.

Currently, if you give it datetime64s it can't deal with, it just maps them randomly into the datetime64[ns] range:

In [8]: pd.to_datetime([np.datetime64('1000-01-01')], box=False)
Out[8]: array(['2169-02-08T15:09:07.419103232-0800'], dtype='datetime64[ns]')

This is bad, but at least it has 2 nice properties: it doesn't throw errors and it always returns datetime64[ns]s.

So I think there are a few potential solutions. First of all, in all cases if coerce=True, we just convert the out-of-bounds dt64s into NaTs, so we'll ignore that case:

a) pd.to_datetime([np.datetime64('1000-01-01')], box=False) throws an error, pandas can't deal with it

b) pd.to_datetime([np.datetime64('1000-01-01')], box=False) returns back an array of datetime64[whatever]s. This will probably lead to weird situations elsewhere because I think a lot of code assumes it is dealing with datetime64[ns]es. I just did a quick test and if pd.to_datetime() returns the datetime64[whatevers]s right back, some other aspect of the code maps them to datetime64[ns]es. I didn't investigate too deeply because I assume it would be a big deal, if not pandas would just support arbitrary units and it doesn't.

c) pd.to_datetime([np.datetime64('1000-01-01')], box=False) returns back an array of strings ['1000-01-01']. IE, if we get an out-of-bounds datetime64, just convert it to a string and return that.

I like solution (c). Even though it is kind of surprising, it generally does the right thing and will make sure developers don't have to worry about non datetime64[ns]-es cropped up all over the place in these edge cases. Then in the future if pandas ever supports datetime64s of various units, it will be easy delete the code which converts the dt64s to strings and just return them.

Though I'm amenable to any of the above solutions or a different solution, no hugely strong feelings.

Also if this is confusing or I am missing something obvious, please let me know. I'm pretty new to this section of the code.

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@danbirken

c) is the current/right way, but with one additional check

In [6]: x = np.datetime64('1000-01-01')

In [7]: x
Out[7]: numpy.datetime64('1000-01-01')

In [8]: x.dtype
Out[8]: dtype('<M8[D]')

There is a bug lurking that allows a np.datetime64 < ns range in on conversion (either the bounds are not checked or the conversion is wrong)

This should raise be NaT if coerce=True, or raise the bounds error (which ALWAYS raises)

better to let the user know that this is not an acceptable value

side issue is whether their are raises if coerce=True and the date is out of bounds (I don't know what it does currently, my thinking is that if you pass coerce=Truethen just make it NaTeven if out-of-bounds)

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

@danbirken another related issue: #3944

@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

related is #4066 as well

@danbirken
Copy link
Contributor

Alright, I'm a little confused as to the exact interface you are looking for and while digging through the code and all of these related issues i've found a few more potential fixes, so here are some clarification statements. Let me know if any of these are wrong:

  1. pd.to_datetime(value) == pd.to_datetime([value])[0] should be true in all cases (there are some cases right now where this isn't true)

  2. pd.to_datetime('1000-01-01', errors='ignore', box=False) should return '1000-01-01'

  3. pd.to_datetime('1000-01-01', errors='raise', box=False) should throw ValueError

  4. pd.to_datetime(np.datetime64('1000-01-01'), errors='ignore', box=False) should return '1000-01-01' --- ie, datetime64 converted to a string because it can't be represented as a datetime64[ns]

  5. pd.to_datetime(np.datetime64('1000-01-01'), errors='raise', box=False) should throw ValueError

  6. pd.to_datetime('1000-01-01', coerce=True) should return a NaT

  7. pd.to_datetime(np.datetime64('1000-01-01'), coerce=True) should return a NaT

@jreback
Copy link
Contributor

jreback commented Sep 23, 2013

can you explain 4 again?

when is 1) not true?

all others look good

@danbirken
Copy link
Contributor

For 1)

In [8]: pd.to_datetime([np.datetime64('2000-01-01')], box=False)[0].dtype
Out[8]: dtype('<M8[ns]')

In [9]: pd.to_datetime(np.datetime64('2000-01-01'), box=False).dtype
Out[9]: dtype('<M8[D]')

IE pd.to_datetime() will not fix the datetime64[] range type if it isn't wrapped in an array. It is a simple fix, I'm just double checking the behavior.

For 4)

It is basically this case:

import pandas as pd
import numpy as np

ix = [np.datetime64('1000-01-01', 'D'), np.datetime64('1000-01-02', 'D')]
df = pd.DataFrame({'col1': [1, 2]}, index=ix)
print df

This is the current behavior:

                            col1
2169-02-08 23:09:07.419103     1
2169-02-09 23:09:07.419103     2

as far as I see, this could:

a) throw an error

b) don't change it, let the current behavior continue

c) convert the datetime64s to strings (or maybe datetime.datetimes?), so you get this output:

            col1
1000-01-01     1
1000-01-02     2

d) something else

@jreback
Copy link
Contributor

jreback commented Sep 24, 2013

ok....I understand 1); thought it was fixed, but good catch

for 4)

I think if you detect this (an 'invalid' np.datetime64):

if coerce is True, then obviously just NaT them
if errors='raise', then raise

otherwise, return convert to an object array of datetimes (these won't be further converted anywhere)

np.datetime64('1000-01-01').item() is the datetime

In [52]: ix2 = np.array([datetime.date(1000,1,1),datetime.date(1000,1,1)])

In [53]: df = pd.DataFrame({'col1': [1, 2]}, index=ix2)

In [54]: df
Out[54]: 
            col1
1000-01-01     1
1000-01-01     2

In [55]: df.index
Out[55]: Index([1000-01-01, 1000-01-01], dtype=object)

In [56]: df.index[0]
Out[56]: datetime.date(1000, 1, 1)

I am not 100% in love with this, becuase its 'probably' a mistake

but ok for now

@jreback
Copy link
Contributor

jreback commented Sep 26, 2013

@danbirken how's this coming?

@danbirken
Copy link
Contributor

I'm 90% done but I'm currently in Munich for Octoberfest which is slowing
my progress. I'll have an updated patch within a couple days. Sorry about
the delay.

-Dan

On Thu, Sep 26, 2013 at 1:35 PM, jreback notifications@github.com wrote:

@danbirken https://github.com/danbirken how's this coming?


Reply to this email directly or view it on GitHubhttps://github.com//issues/4065#issuecomment-25201401
.

@jreback
Copy link
Contributor

jreback commented Sep 27, 2013

@danbirken enjoy the beer!

@danbirken
Copy link
Contributor

Alright, well it seems my change has triggered some incompatibilities on other platforms which I need to investigate (I just setup python2.6 build env and reproduced them locally, so that is good). Unfortunately I'm still on vacation and having less time for work than expected so I can't get to this now.

So basically, you can ignore this change for the time being.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

np...lmk....

danbirken added a commit to danbirken/pandas that referenced this issue Oct 4, 2013
To fix the bug, this change adds bounds checking to
_get_datetime64_nanos() for numpy datetimes that aren't already in [ns]
units.

Additionally, it updates _check_dts_bounds() to do the bound check just
based off the pandas_datetimestruct, by comparing to the minimum and
maximum valid pandas_datetimestructs for datetime64[ns].  It is simpler
and more accurate than the previous system.

Also includes a number of small refactors/fixes to deal with new error
cases that didn't exist when invalid datetime64s were just silently
coerced into the valid datetime64[ns] range.
@danbirken
Copy link
Contributor

Everything builds fine on windows.

This change is an improvement, but it also does add a bit of complexity. As I note in the commit message, the fact that bounds checking didn't exist for datetime64s was hiding other issues (like pickling/unpickling dt64s in numpy 1.6, putting datetime64s into DatetimeIndexes now can throw errors) which I need to also fix here.

I think I probably would like to also refactor array_to_datetime at some point to simplify it, but it would be easier to do that after this change was committed.

Also all of this datetime64 stuff is all much easier to deal with in numpy 1.7. numpy 1.6 has a bunch of annoying issues. Is pandas always going to support numpy 1.6 or is there some timetable to only support 1.7?

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

believe me @danbirken we feel your pain about numpy 1.6 ... we've jumped through crazy hoops to make sure it works. not sure about the timeline

@jreback
Copy link
Contributor

jreback commented Oct 7, 2013

@danbirken hows this coming? what are the remaining 1.6 issues?

@danbirken
Copy link
Contributor

Oh I don't think there are any remaining issues. This code should be good
for all versions and solve the issues in the bug. I was just saying that
dealing with numpy 1.6 was annoying :)

-Dan

On Mon, Oct 7, 2013 at 10:21 PM, jreback notifications@github.com wrote:

@danbirken https://github.com/danbirken hows this coming? what are the
remaining 1.6 issues?


Reply to this email directly or view it on GitHubhttps://github.com//issues/4065#issuecomment-25846373
.

@jreback
Copy link
Contributor

jreback commented Oct 7, 2013

@danbirken concur !

@jreback
Copy link
Contributor

jreback commented Oct 7, 2013

@danbirken thanks...another bug squashed!

feel free to take on issues as you can!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants