ENH: add origin to to_datetime #15828

Closed
wants to merge 6 commits into
from

Conversation

Projects
None yet
5 participants
Contributor

jreback commented Mar 28, 2017

closes #11276
closes #11745

superseded #11470

jreback added this to the 0.20.0 milestone Mar 28, 2017

Contributor

jreback commented Mar 28, 2017

@jorisvandenbossche @chris-b1

ok, rebase and updated original PR. pls have a look.

IIRC there were some more test cases needed, but dont' really remember

doc/source/timeseries.rst
+
+ pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01'))
+
+The default is set at ``origin='epoch'``, which defaults to ``1970-01-01 00:00:00``.
@chris-b1

chris-b1 Mar 28, 2017

Contributor

'epoch' seems non-descriptive - in a sense, aren't all origins epochs? Maybe default should be 'unix' like @shoyer had suggested?

@jreback

jreback Mar 28, 2017

Contributor

Unix time (also known as POSIX time or epoch time) ?

https://en.wikipedia.org/wiki/Unix_time

@chris-b1

chris-b1 Mar 28, 2017

Contributor

Certainly the unix epoch date is implied by "epoch time" but it's also a general term, that could refer to epoch timekeeping in general.

https://en.wikipedia.org/wiki/Epoch_(reference_date)#Computing

pandas/tseries/tools.py
+ - If 'epoch', origin is set to 1970-01-01.
+ - If 'julian', unit must be 'D', and origin is set to beginning of
+ Julian Calendar. Julian day number 0 is assigned to the day starting
+ at noon on January 1, 4713 BC.
@chris-b1

chris-b1 Mar 28, 2017

Contributor

Not critical, but could expand to other semi-common origins @bashtage mentions here.
#11470 (comment)

@jreback

jreback Mar 28, 2017

Contributor

hmm, maybe, can always add if people really want this.

codecov bot commented Mar 28, 2017 edited

Codecov Report

Merging #15828 into master will decrease coverage by 0.03%.
The diff coverage is 91.11%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15828      +/-   ##
==========================================
- Coverage   90.98%   90.95%   -0.04%     
==========================================
  Files         143      143              
  Lines       49449    49464      +15     
==========================================
- Hits        44993    44991       -2     
- Misses       4456     4473      +17
Flag Coverage Δ
#multiple ?
#single ?
Impacted Files Coverage Δ
pandas/tseries/tdi.py 90.37% <100%> (+0.03%) ⬆️
pandas/tseries/tools.py 85.2% <90.24%> (+0.26%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/io/packers.py 87.57% <0%> (-0.94%) ⬇️
pandas/core/common.py 90.96% <0%> (-0.34%) ⬇️
pandas/io/html.py 84.48% <0%> (-0.33%) ⬇️
pandas/core/series.py 94.86% <0%> (-0.11%) ⬇️
pandas/core/frame.py 97.56% <0%> (-0.1%) ⬇️
pandas/compat/numpy/function.py 94.19% <0%> (-0.04%) ⬇️
pandas/core/indexing.py 94.08% <0%> (-0.03%) ⬇️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 67cc021...ebb4acd. Read the comment docs.

Contributor

jreback commented Mar 28, 2017

ok changed epoch -> unix and updated docs.

pandas/tseries/tools.py
infer_datetime_format : boolean, default False
If True and no `format` is given, attempt to infer the format of the
datetime strings, and if it can be inferred, switch to a faster
method of parsing them. In some cases this can increase the parsing
speed by ~5-10x.
+ origin : scalar convertible to Timestamp / string ('julian', 'unix'),
+ default 'unix'.
@jorisvandenbossche

jorisvandenbossche Mar 29, 2017

Owner

The problem is that this does not work ... type explanation should be on one line (for good html docs) (ref numpy/numpydoc#87)

@jreback

jreback Mar 29, 2017

Contributor

so pretty much make this shorter then?

pandas/tseries/tools.py
@@ -297,8 +312,13 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False,
>>> %timeit pd.to_datetime(s,infer_datetime_format=False)
1 loop, best of 3: 471 ms per loop
- """
+ Using non-epoch origins to parse date
+ >>> pd.to_datetime([1,2,3], unit='D', origin=pd.Timestamp('1960-01-01'))
@jorisvandenbossche

jorisvandenbossche Mar 29, 2017

Owner

blank line between the two lines above

pandas/tseries/tools.py
+ offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
+ except tslib.OutOfBoundsDatetime:
+ raise ValueError(
+ "origin {} is Out of Bounds".format(origin))
@jorisvandenbossche

jorisvandenbossche Mar 29, 2017

Owner

We should add a todo item to solve this

@jreback

jreback Mar 29, 2017

Contributor

what part, OutOfBounds dates generally?

@jorisvandenbossche

jorisvandenbossche Mar 29, 2017

Owner

No, I mean for origin specifically (by circumventing the Timestamp creation), to have something like pd.to_datetime(.., origin='0001-01-01') working if the end timestamp is not out of bounds.

Eg by using dateutil.parser.parse(origin), but then we still need to convert that to the correct offset

pandas/tseries/tools.py
+ "to a Timestamp".format(origin))
+
+ if offset is not None:
+ result = result + offset
@jorisvandenbossche

jorisvandenbossche Mar 29, 2017

Owner

As I commented in the original PR (#11470 (comment)), I think it can solve some corner cases to do the origin handling before the actual parsing (as you do for the 'julian' case).

A bit an artificial example, but to show one of the problems:

In [11]: pd.to_datetime(200*365, unit='D')
Out[11]: Timestamp('2169-11-13 00:00:00')

In [12]: pd.to_datetime(200*365, unit='D', origin='1870-01-01')
Out[12]: Timestamp('2069-11-13 00:00:00')

In [13]: pd.to_datetime(300*365, unit='D', origin='1870-01-01')
...
OutOfBoundsDatetime: cannot convert input with unit 'D'

So the last one is actually not an OutOfBounds datetime, but just because we first parse the number as epoch, it raises this error. If we first subtract the correct value from the numeric argument, and then parse as epoch, the above will work fine.

@jreback

jreback Mar 29, 2017

Contributor

fixed! this was tricky.

Owner

jorisvandenbossche commented Mar 29, 2017 edited

Some cases where something goes wrong:

  • This should raise an error (if the origin is first processed and the argument adapted for it, then this will raise an error):

    In [2]: pd.to_datetime("2005-01-01", origin="1960-01-01")
    Out[2]: Timestamp('1995-01-01 00:00:00')
    
  • Not sure how this can return 12:00 as the hour? (EDIT -> this is probably just the definition? "starting at noon")

    In [4]: pd.to_datetime(2456658, origin='julian', unit='D')
    Out[4]: Timestamp('2013-12-31 12:00:00')
    
  • The error message in case of out of bounds here:

    In [3]: pd.to_datetime(1, origin="julian", unit='D')
    ....
    OutOfBoundsDatetime: cannot convert input -2440586.5 with the unit 'D'
    

    is a bit unfortunate (the "input -2440586.5"). But that is of course the consequence of processing the origin first and adapting the arg based on that, before passing to the actual parsing methods ... (so that's a clear drawback of this approach I suggested also for the other origins). Not sure whether this could easily be solved ..

doc/source/whatsnew/v0.20.0.txt
+to_datetime has gained an origin parameter
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+``pd.to_datetime`` has gained a new parameter, ``origin``, to define an offset
@jorisvandenbossche

jorisvandenbossche Mar 29, 2017

Owner

let's use 'reference date' instead of 'offset' here as well (as you did in the docstring)

Contributor

jreback commented Mar 29, 2017

Here are the results with new push.

In [1]: pd.to_datetime(200*365, unit='D')
Out[1]: Timestamp('2169-11-13 00:00:00')

In [2]: pd.to_datetime(200*365, unit='D', origin='1870-01-01')
Out[2]: Timestamp('2069-11-13 00:00:00')

In [3]: pd.to_datetime(300*365, unit='D', origin='1870-01-01')
Out[3]: Timestamp('2169-10-20 00:00:00')

In [4]: pd.to_datetime("2005-01-01", origin="1960-01-01")
ValueError: '2005-01-01' is not compatible with origin='1960-01-01'; it must be numeric with a unit specified 

In [5]: pd.to_datetime(2456658, origin='julian', unit='D')
Out[5]: Timestamp('2013-12-31 12:00:00')

In [6]: pd.to_datetime(1, origin="julian", unit='D')
OutOfBoundsDatetime: 1 is Out of Bounds for origin='julian'

note [1], [2] (were correct before)
[5] seems fine as well.

pandas/tseries/tools.py
- origin : scalar convertible to Timestamp / string ('julian', 'unix'),
- default 'unix'.
- Define reference date. The numeric values would be parsed as number
- of units (defined by `unit`) since this reference date.
@jorisvandenbossche

jorisvandenbossche Apr 2, 2017

Owner

I would leave this sentence in, it is a good explanation of what an origin is.

@jorisvandenbossche

New changes look good!

+ if np.any(arg > j_max) or np.any(arg < j_min):
+ raise tslib.OutOfBoundsDatetime(
+ "{original} is Out of Bounds for "
+ "origin='julian'".format(original=original))
@jorisvandenbossche

jorisvandenbossche Apr 2, 2017

Owner

You can have the same problem with a custom defined origin (not unix of julian)

@jreback

jreback Apr 2, 2017

Contributor

yes, but that is handled below (this is only for the julian section). I do raise the same error (w/o the julian reference)

@jreback

jreback Apr 2, 2017

Contributor
    elif origin not in ['unix', 'julian']:

        # arg must be a numeric
        original = arg
        if not ((is_scalar(arg) and (is_integer(arg) or is_float(arg))) or
                is_numeric_dtype(np.asarray(arg))):
            raise ValueError(
                "'{arg}' is not compatible with origin='{origin}'; "
                "it must be numeric with a unit specified ".format(
                    arg=arg,
                    origin=origin))

        # we are going to offset back to unix / epoch time
        try:
            offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
        except tslib.OutOfBoundsDatetime:
            raise tslib.OutOfBoundsDatetime(
                "origin {} is Out of Bounds".format(origin))
        except ValueError:
            raise ValueError("origin {} cannot be converted "
                             "to a Timestamp".format(origin))
pandas/tseries/tools.py
+ # this should be lossless in terms of precision
+ offset = offset // tslib.Timedelta(1, unit=unit)
+
+ arg = np.asarray(arg)
@jorisvandenbossche

jorisvandenbossche Apr 2, 2017

Owner

Wouldn't it be simpler to check if it is a list and then convert to array? (or check that is is not a scalar/series/index)

Then the arg = arg + offset does just what you want, and the conversion back as below is not needed

sumitbinnani and others added some commits Mar 28, 2017

@sumitbinnani @jreback sumitbinnani BUG: Series creation with datetime64 with non-ns unit as object dtype
closes #11275
closes #11745
ad7356e
@jreback jreback update docs / tests 6a8a779
@jreback jreback rename epoch -> unix a24e88c
@jreback jreback add Timedelta floordiv ops
provide out-of-bounds for julian dates
56663a5
@jreback jreback bug fix 209591a
@jreback jreback doc fixes & cleanup
ebb4acd

jreback closed this in cd24fa9 Apr 2, 2017

Contributor

jreback commented Apr 2, 2017

@jorisvandenbossche make those changes and merged. thanks for the review.

And thanks for picking up the PR!

@linebp linebp added a commit to linebp/pandas that referenced this pull request Apr 17, 2017

@jreback @linebp jreback + linebp ENH: add origin to to_datetime
closes #11276
closes #11745
superseded #11470

Author: Jeff Reback <jeff@reback.net>
Author: Sumit Binnani <sumit.binnani@gmail.com>

Closes #15828 from jreback/datetime-unit and squashes the following commits:

ebb4acd [Jeff Reback] doc fixes & cleanup
209591a [Jeff Reback] bug fix
56663a5 [Jeff Reback] add Timedelta floordiv ops
a24e88c [Jeff Reback] rename epoch -> unix
6a8a779 [Jeff Reback] update docs / tests
ad7356e [Sumit Binnani] BUG: Series creation with datetime64 with non-ns unit as object dtype
934625d
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment