Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"high" datetime conversion incorrect #5863

Closed
CarstVaartjes opened this issue Jan 6, 2014 · 7 comments · Fixed by #5864

Comments

@CarstVaartjes
Copy link

commented Jan 6, 2014

Hi,

I have a situation where I have "far away" end dates (3 examples included below). Normal Python strptime conversion handles this correctly, but the to_datetime functionality does a Martin McFly and ends up in the 1800s. See this example that reproduces it:

# input
time_ser = pd.Series([np.nan, '2013-04-08 00:00:00.000', '2013-06-04 00:00:00.000', '2013-09-06 00:00:00.000', '2013-10-02 00:00:00.000', '2013-10-03 00:00:00.000', '2013-10-30 00:00:00.000', '2013-10-31 00:00:00.000', '2013-11-30 00:00:00.000', '2013-12-02 00:00:00.000', '2013-12-17 00:00:00.000', '2013-12-31 00:00:00.000', '2014-01-15 00:00:00.000', '2014-01-31 00:00:00.000', '2014-02-15 00:00:00.000', '2014-02-28 00:00:00.000', '2014-03-15 00:00:00.000', '2014-03-31 00:00:00.000', '2014-06-15 00:00:00.000', '2014-06-30 00:00:00.000', '2099-12-31 00:00:00.000', '2999-12-31 00:00:00.000', '9990-12-31 00:00:00.000', '9999-12-31 00:00:00.000'])

# see last 3 results
pd.to_datetime(time_ser, '%Y-%m-%d %H:%M:%S.%f')

# normal python
import math
for x in sorted(time_ser.unique()):
    if isinstance(x, basestring):
        print '{} converts to {}'.format( x, datetime.datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f'))

# This is the erroneous result:
21   1830-11-22 00:50:52.580896768
22   1807-03-30 05:56:08.066277376
23   1816-03-29 05:56:08.066277376

I'm running Pandas 0.12 on Ubuntu 12.04LTS with Python 2.7.3. I could not find this issue registered yet.

Kind regards,

Carst

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2014

try with master
should raise

@jorisvandenbossche

This comment has been minimized.

@CarstVaartjes

This comment has been minimized.

Copy link
Author

commented Jan 6, 2014

Thanks, just built and installed 0.13-dev, it does not give a raise a warning or error however, but returns the series as an object (still a lot better than 0.12!). try this in master:

time_ser = pd.Series([np.nan, '2013-04-08 00:00:00.000', '2013-06-04 00:00:00.000', '2013-09-06 00:00:00.000', '2013-10-02 00:00:00.000', '2013-10-03 00:00:00.000', '2013-10-30 00:00:00.000', '2013-10-31 00:00:00.000', '2013-11-30 00:00:00.000', '2013-12-02 00:00:00.000', '2013-12-17 00:00:00.000', '2013-12-31 00:00:00.000', '2014-01-15 00:00:00.000', '2014-01-31 00:00:00.000', '2014-02-15 00:00:00.000', '2014-02-28 00:00:00.000', '2014-03-15 00:00:00.000', '2014-03-31 00:00:00.000', '2014-06-15 00:00:00.000', '2014-06-30 00:00:00.000', '2099-12-31 00:00:00.000', '2999-12-31 00:00:00.000', '9990-12-31 00:00:00.000', '9999-12-31 00:00:00.000'])
pd.to_datetime(time_ser, '%Y-%m-%d %H:%M:%S.%f')[-3:]
pd.to_datetime(time_ser, '%Y-%m-%d %H:%M:%S.%f').dtype

time_ser = pd.Series([np.nan, '2013-04-08 00:00:00.000', '2013-06-04 00:00:00.000', '2013-09-06 00:00:00.000', '2013-10-02 00:00:00.000', '2013-10-03 00:00:00.000', '2013-10-30 00:00:00.000', '2013-10-31 00:00:00.000', '2013-11-30 00:00:00.000', '2013-12-02 00:00:00.000', '2013-12-17 00:00:00.000', '2013-12-31 00:00:00.000', '2014-01-15 00:00:00.000', '2014-01-31 00:00:00.000', '2014-02-15 00:00:00.000', '2014-02-28 00:00:00.000', '2014-03-15 00:00:00.000', '2014-03-31 00:00:00.000', '2014-06-15 00:00:00.000', '2014-06-30 00:00:00.000', '2099-12-31 00:00:00.000'])
pd.to_datetime(time_ser, '%Y-%m-%d %H:%M:%S.%f')
pd.to_datetime(time_ser, '%Y-%m-%d %H:%M:%S.%f').dtype

The second one (without out-of-range dates) will convert, the first one will not. (@joris: I will try to do the period index/frequency solution and see if that works out, also with HDF5 writing/reading)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2014

sorry by default it will do that
you can pass errors='raise' to force it to fail
or coerce=True to make non valid dates be NaT

neither of your examples will be valid for HDF5 FYI

unless you use coerce=True

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 6, 2014

don't pass them as Series the. it will raise I believe
on e you make it a Series it will automatically be object

@CarstVaartjes

This comment has been minimized.

Copy link
Author

commented Jan 6, 2014

with errors='raise' I indeed get a nice: pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2999-12-31 00:00:00. :)

0.13 handles this nicely enough to be honest with the object return (0.12 got me rather confused) so this can be closed for me! thank you very much for your help

I'll think on how to handle this for my own specific problem. The most elegant way would be to have a integer-based "date" class that pandas and hdf5 just see as int (but that gives issues with nans though).
I guess i'll go a bit down a dirty route: we made something to replace all attribute / non-numeric columns with an integer-index columns that refer to separately saved (unique) value series. Those value series sometimes get pickled anyway by HDF5 because of unicodes, but it also saves a lot of space in general as most of my data often has repeating values in the attribute columns. So I guess I will have to save dates as objects in a standard format there and then later transform them using standard python functionality when needed.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 7, 2014

was a slight error here in any event (the np.nan was raising when passed with a format, and a non-strng would fail)

here's the behavior (which matches current)

In [1]: ts = Series([np.nan, '2013-04-08 00:00:00.000', '9999-12-31 00:00:00.000'])

In [2]: ts
Out[2]: 
0                        NaN
1    2013-04-08 00:00:00.000
2    9999-12-31 00:00:00.000
dtype: object

In [4]: pd.to_datetime(ts, format='%Y-%m-%d %H:%M:%S.%f')
Out[4]: 
0                        NaN
1    2013-04-08 00:00:00.000
2    9999-12-31 00:00:00.000
dtype: object

In [5]: pd.to_datetime(ts, format='%Y-%m-%d %H:%M:%S.%f',coerce=True)
Out[5]: 
0                   NaT
1   2013-04-08 00:00:00
2                   NaT
dtype: datetime64[ns]

In [6]: pd.to_datetime(ts, format='%Y-%m-%d %H:%M:%S.%f',errors='raise')
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 9999-12-31 00:00:00

@jreback jreback closed this in #5864 Jan 7, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.