Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
ENH: add origin to to_datetime #15828
Conversation
jreback
added Enhancement Timeseries
labels
Mar 28, 2017
jreback
added this to the
0.20.0
milestone
Mar 28, 2017
jreback
referenced
this pull request
Mar 28, 2017
Closed
ENH: Adding origin parameter in pd.to_datetime #11470
|
ok, rebase and updated original PR. pls have a look. IIRC there were some more test cases needed, but dont' really remember |
jreback
referenced
this pull request
Mar 28, 2017
Closed
Add origin parameter to Timestamp/to_datetime epoch support. #11745
| + | ||
| + pd.to_datetime([1, 2, 3], unit='D', origin=pd.Timestamp('1960-01-01')) | ||
| + | ||
| +The default is set at ``origin='epoch'``, which defaults to ``1970-01-01 00:00:00``. |
chris-b1
Mar 28, 2017
Contributor
'epoch' seems non-descriptive - in a sense, aren't all origins epochs? Maybe default should be 'unix' like @shoyer had suggested?
chris-b1
Mar 28, 2017
Contributor
Certainly the unix epoch date is implied by "epoch time" but it's also a general term, that could refer to epoch timekeeping in general.
https://en.wikipedia.org/wiki/Epoch_(reference_date)#Computing
| + - If 'epoch', origin is set to 1970-01-01. | ||
| + - If 'julian', unit must be 'D', and origin is set to beginning of | ||
| + Julian Calendar. Julian day number 0 is assigned to the day starting | ||
| + at noon on January 1, 4713 BC. |
chris-b1
Mar 28, 2017
Contributor
Not critical, but could expand to other semi-common origins @bashtage mentions here.
#11470 (comment)
codecov
bot
commented
Mar 28, 2017
•
Codecov Report
@@ Coverage Diff @@
## master #15828 +/- ##
==========================================
- Coverage 90.98% 90.95% -0.04%
==========================================
Files 143 143
Lines 49449 49464 +15
==========================================
- Hits 44993 44991 -2
- Misses 4456 4473 +17
Continue to review full report at Codecov.
|
|
ok changed |
| infer_datetime_format : boolean, default False | ||
| If True and no `format` is given, attempt to infer the format of the | ||
| datetime strings, and if it can be inferred, switch to a faster | ||
| method of parsing them. In some cases this can increase the parsing | ||
| speed by ~5-10x. | ||
| + origin : scalar convertible to Timestamp / string ('julian', 'unix'), | ||
| + default 'unix'. |
jorisvandenbossche
Mar 29, 2017
Owner
The problem is that this does not work ... type explanation should be on one line (for good html docs) (ref numpy/numpydoc#87)
| @@ -297,8 +312,13 @@ def to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, | ||
| >>> %timeit pd.to_datetime(s,infer_datetime_format=False) | ||
| 1 loop, best of 3: 471 ms per loop | ||
| - """ | ||
| + Using non-epoch origins to parse date | ||
| + >>> pd.to_datetime([1,2,3], unit='D', origin=pd.Timestamp('1960-01-01')) |
| + offset = tslib.Timestamp(origin) - tslib.Timestamp(0) | ||
| + except tslib.OutOfBoundsDatetime: | ||
| + raise ValueError( | ||
| + "origin {} is Out of Bounds".format(origin)) |
jorisvandenbossche
Mar 29, 2017
Owner
No, I mean for origin specifically (by circumventing the Timestamp creation), to have something like pd.to_datetime(.., origin='0001-01-01') working if the end timestamp is not out of bounds.
Eg by using dateutil.parser.parse(origin), but then we still need to convert that to the correct offset
| + "to a Timestamp".format(origin)) | ||
| + | ||
| + if offset is not None: | ||
| + result = result + offset |
jorisvandenbossche
Mar 29, 2017
Owner
As I commented in the original PR (#11470 (comment)), I think it can solve some corner cases to do the origin handling before the actual parsing (as you do for the 'julian' case).
A bit an artificial example, but to show one of the problems:
In [11]: pd.to_datetime(200*365, unit='D')
Out[11]: Timestamp('2169-11-13 00:00:00')
In [12]: pd.to_datetime(200*365, unit='D', origin='1870-01-01')
Out[12]: Timestamp('2069-11-13 00:00:00')
In [13]: pd.to_datetime(300*365, unit='D', origin='1870-01-01')
...
OutOfBoundsDatetime: cannot convert input with unit 'D'
So the last one is actually not an OutOfBounds datetime, but just because we first parse the number as epoch, it raises this error. If we first subtract the correct value from the numeric argument, and then parse as epoch, the above will work fine.
|
Some cases where something goes wrong:
|
| +to_datetime has gained an origin parameter | ||
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
| + | ||
| +``pd.to_datetime`` has gained a new parameter, ``origin``, to define an offset |
jorisvandenbossche
Mar 29, 2017
Owner
let's use 'reference date' instead of 'offset' here as well (as you did in the docstring)
This was referenced Mar 29, 2017
|
Here are the results with new push.
note [1], [2] (were correct before) |
| - origin : scalar convertible to Timestamp / string ('julian', 'unix'), | ||
| - default 'unix'. | ||
| - Define reference date. The numeric values would be parsed as number | ||
| - of units (defined by `unit`) since this reference date. |
jorisvandenbossche
Apr 2, 2017
Owner
I would leave this sentence in, it is a good explanation of what an origin is.
| + if np.any(arg > j_max) or np.any(arg < j_min): | ||
| + raise tslib.OutOfBoundsDatetime( | ||
| + "{original} is Out of Bounds for " | ||
| + "origin='julian'".format(original=original)) |
jorisvandenbossche
Apr 2, 2017
Owner
You can have the same problem with a custom defined origin (not unix of julian)
jreback
Apr 2, 2017
Contributor
yes, but that is handled below (this is only for the julian section). I do raise the same error (w/o the julian reference)
jreback
Apr 2, 2017
Contributor
elif origin not in ['unix', 'julian']:
# arg must be a numeric
original = arg
if not ((is_scalar(arg) and (is_integer(arg) or is_float(arg))) or
is_numeric_dtype(np.asarray(arg))):
raise ValueError(
"'{arg}' is not compatible with origin='{origin}'; "
"it must be numeric with a unit specified ".format(
arg=arg,
origin=origin))
# we are going to offset back to unix / epoch time
try:
offset = tslib.Timestamp(origin) - tslib.Timestamp(0)
except tslib.OutOfBoundsDatetime:
raise tslib.OutOfBoundsDatetime(
"origin {} is Out of Bounds".format(origin))
except ValueError:
raise ValueError("origin {} cannot be converted "
"to a Timestamp".format(origin))
| + # this should be lossless in terms of precision | ||
| + offset = offset // tslib.Timedelta(1, unit=unit) | ||
| + | ||
| + arg = np.asarray(arg) |
jorisvandenbossche
Apr 2, 2017
Owner
Wouldn't it be simpler to check if it is a list and then convert to array? (or check that is is not a scalar/series/index)
Then the arg = arg + offset does just what you want, and the conversion back as below is not needed
sumitbinnani
and others
added some commits
Mar 28, 2017
jreback
closed this
in cd24fa9
Apr 2, 2017
|
@jorisvandenbossche make those changes and merged. thanks for the review. |
Winand
referenced
this pull request
Apr 3, 2017
Open
Support more sas7bdat date/datetime formats #15871
|
And thanks for picking up the PR! |
linebp
added a commit
to linebp/pandas
that referenced
this pull request
Apr 17, 2017
|
|
jreback + linebp |
934625d
|
jreback commentedMar 28, 2017
closes #11276
closes #11745
superseded #11470