Add origin parameter to Timestamp/to_datetime epoch support. #11745

Closed
jtkiley opened this Issue Dec 2, 2015 · 14 comments

Comments

Projects
None yet
3 participants
@jtkiley

jtkiley commented Dec 2, 2015

When using SAS or Stata data, dates are represented as the number of days since 1/1/1960, and other statistical software uses different origin dates. With that in mind, it would be nice to have an origin date that can be specified. See also, #3969.

It's a relatively simple thing, and not hard to work around, of course. However, I end up dealing with date formatting on just about every data set I import, and I imagine that lots of others do, too.

Currently, I do something like this:

import pandas as pd
import datetime

EPOCH1960 = datetime.date(1970, 1, 1) - datetime.date(1960, 1, 1)

data = pd.read_stata('./data.dta')
data['date'] = pd.to_datetime(data['date'],unit='D') - EPOCH1960

In R, the as.Date() function takes an origin parameter for numeric types (see, manual). So, in R, the date part would simply be:

data$date <- as.Date(data$date, origin = '1960-01-01')
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Dec 2, 2015

Contributor

sure this could be a parameter to to_datetime (only) as that is the datetime converter
could work alongside the unit parameter (like #11276).

In fact that should be origin rather than unit now that I think about it. xref #11470

Contributor

jreback commented Dec 2, 2015

sure this could be a parameter to to_datetime (only) as that is the datetime converter
could work alongside the unit parameter (like #11276).

In fact that should be origin rather than unit now that I think about it. xref #11470

@jreback jreback added this to the Next Major Release milestone Dec 2, 2015

@jreback jreback modified the milestones: 0.18.0, Next Major Release Dec 13, 2015

@jreback jreback modified the milestones: Next Major Release, 0.18.0 Feb 2, 2016

@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 20, 2016

@jreback jreback modified the milestones: 0.18.2, 0.18.1 May 1, 2016

jreback added a commit to jreback/pandas that referenced this issue May 1, 2016

@jreback jreback modified the milestones: 0.18.1, 0.18.2 May 1, 2016

@jreback jreback modified the milestones: Next Major Release, 0.18.2 Jul 6, 2016

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Sep 6, 2016

Contributor

@jtkiley Is read_stata not returning the correct dates by default?

Contributor

bashtage commented Sep 6, 2016

@jtkiley Is read_stata not returning the correct dates by default?

@jtkiley

This comment has been minimized.

Show comment
Hide comment
@jtkiley

jtkiley Dec 16, 2016

It does not (pandas 0.19.1). Below, see one of my Stata datasets (originally written by R). In this example, I've formatted the date like my original post but without the epoch adjustment. I have some type stuff going on here that I'd fix in a real project, but you can see that the separate year variable is 10 years off from the date.

lpermno       date id_ticker    year  \
12060.0 2014-03-02        GE  2004.0   
86868.0 2016-11-05        GS  2006.0   
24643.0 2018-07-29        AA  2008.0   
NaN 2011-02-01       DAL  2001.0

Here's the same data in Stata, also without adjustment, though I formatted the date using format %td date.

lpermno	date	id_ticker	year
12060	01mar2004	GE	2004
86868	05nov2006	GS	2006
24643	28jul2008	AA	2008
.z	31jan2001	DAL	2001

jtkiley commented Dec 16, 2016

It does not (pandas 0.19.1). Below, see one of my Stata datasets (originally written by R). In this example, I've formatted the date like my original post but without the epoch adjustment. I have some type stuff going on here that I'd fix in a real project, but you can see that the separate year variable is 10 years off from the date.

lpermno       date id_ticker    year  \
12060.0 2014-03-02        GE  2004.0   
86868.0 2016-11-05        GS  2006.0   
24643.0 2018-07-29        AA  2008.0   
NaN 2011-02-01       DAL  2001.0

Here's the same data in Stata, also without adjustment, though I formatted the date using format %td date.

lpermno	date	id_ticker	year
12060	01mar2004	GE	2004
86868	05nov2006	GS	2006
24643	28jul2008	AA	2008
.z	31jan2001	DAL	2001

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Dec 22, 2016

Contributor

@jtkiley Could you post a small DTA that demonstrated this issue?

Contributor

bashtage commented Dec 22, 2016

@jtkiley Could you post a small DTA that demonstrated this issue?

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Jan 19, 2017

Contributor

@jtkiley Any chance for sharing a DTA with this issue?

Contributor

bashtage commented Jan 19, 2017

@jtkiley Any chance for sharing a DTA with this issue?

@jtkiley

This comment has been minimized.

Show comment
Hide comment
@jtkiley

jtkiley Jan 20, 2017

Sorry for the delay. Here's one that I reduced down (columns and rows) to what you see above. It was originally written by R and then reduced and saved using Stata. It continues to exhibit this issue.

It's also zipped to make Github happy.
data_epoch.zip

jtkiley commented Jan 20, 2017

Sorry for the delay. Here's one that I reduced down (columns and rows) to what you see above. It was originally written by R and then reduced and saved using Stata. It continues to exhibit this issue.

It's also zipped to make Github happy.
data_epoch.zip

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Jan 20, 2017

Contributor

I can't reproduce it. When I use read_stata, I get:

   lpermno     date id_ticker    year
0  12060.0  16131.0        GE  2004.0
1  86868.0  17110.0        GS  2006.0
2  24643.0  17741.0        AA  2008.0
3      NaN  15006.0       DAL  2001.0

which is identical to what Stata shows. When I convert date to a data column in Stata using format %td date and save it as `date_epoch_td.dta', reading this gets

   lpermno       date id_ticker    year
0  12060.0 2004-03-01        GE  2004.0
1  86868.0 2006-11-05        GS  2006.0
2  24643.0 2008-07-28        AA  2008.0
3      NaN 2001-01-31       DAL  2001.0

which seems to be correct.

Contributor

bashtage commented Jan 20, 2017

I can't reproduce it. When I use read_stata, I get:

   lpermno     date id_ticker    year
0  12060.0  16131.0        GE  2004.0
1  86868.0  17110.0        GS  2006.0
2  24643.0  17741.0        AA  2008.0
3      NaN  15006.0       DAL  2001.0

which is identical to what Stata shows. When I convert date to a data column in Stata using format %td date and save it as `date_epoch_td.dta', reading this gets

   lpermno       date id_ticker    year
0  12060.0 2004-03-01        GE  2004.0
1  86868.0 2006-11-05        GS  2006.0
2  24643.0 2008-07-28        AA  2008.0
3      NaN 2001-01-31       DAL  2001.0

which seems to be correct.

@jtkiley

This comment has been minimized.

Show comment
Hide comment
@jtkiley

jtkiley Jan 20, 2017

@bashtage Right. The problem is when you convert the epoch time using data['date'] = pd.to_datetime(data['date'],unit='D'). If you use my adjustment above, it's right. If not, you end up with the results I showed above. The interpretation problem results from Stata using 1/1/1960 and pandas using 1/1/1970 as the base of epoch time.

jtkiley commented Jan 20, 2017

@bashtage Right. The problem is when you convert the epoch time using data['date'] = pd.to_datetime(data['date'],unit='D'). If you use my adjustment above, it's right. If not, you end up with the results I showed above. The interpretation problem results from Stata using 1/1/1960 and pandas using 1/1/1970 as the base of epoch time.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Jan 20, 2017

Contributor

@jtkiley I see. I thought it was a bug in read_stata. FWIW if you export your Stata dates as dates, and not integers/floats then read_stata will correctly use the 1960 epoch date when reading the data in.

Contributor

bashtage commented Jan 20, 2017

@jtkiley I see. I thought it was a bug in read_stata. FWIW if you export your Stata dates as dates, and not integers/floats then read_stata will correctly use the 1960 epoch date when reading the data in.

@jtkiley

This comment has been minimized.

Show comment
Hide comment
@jtkiley

jtkiley Jan 20, 2017

@bashtage That makes sense. I was thinking of a to_datetime parameter for setting the origin, as Stata formats aren't the only place that this occurs, and R has such a parameter (presumably for the same reason).

I often see it when moving data around or pulling it from sources that have a Stata export option, and those often don't come with the date formatting intact. I tend to use those export options (often with Stata for co-author accessibility), assemble data in pandas (R in the past), and then export it in Stata format for sharing and analysis.

jtkiley commented Jan 20, 2017

@bashtage That makes sense. I was thinking of a to_datetime parameter for setting the origin, as Stata formats aren't the only place that this occurs, and R has such a parameter (presumably for the same reason).

I often see it when moving data around or pulling it from sources that have a Stata export option, and those often don't come with the date formatting intact. I tend to use those export options (often with Stata for co-author accessibility), assemble data in pandas (R in the past), and then export it in Stata format for sharing and analysis.

@jtkiley

This comment has been minimized.

Show comment
Hide comment
@jtkiley

jtkiley Jan 20, 2017

It looks like #11470 has the origin parameter basically done. It just has some work left finishing it up.

jtkiley commented Jan 20, 2017

It looks like #11470 has the origin parameter basically done. It just has some work left finishing it up.

jreback added a commit to jreback/pandas that referenced this issue Mar 28, 2017

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 28, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 28, 2017

Contributor

@bashtage note that #15828 just provides the tools to deal with this, should this be an additional parameter to those reader? (is there actually meta data that tells you the origin?). Could certainly just start this as a doc / post-processing step in any event.

Contributor

jreback commented Mar 28, 2017

@bashtage note that #15828 just provides the tools to deal with this, should this be an additional parameter to those reader? (is there actually meta data that tells you the origin?). Could certainly just start this as a doc / post-processing step in any event.

@bashtage

This comment has been minimized.

Show comment
Hide comment
@bashtage

bashtage Mar 28, 2017

Contributor

I don't see a strong reason to allow arbitrary offsets in the Stata interface code. The present version is very loyal to the Stata dta format spec and allowing a semi-random option to be internalized rather than chained seems like the wrong way to do things.

I suppose without explicit support one would have to do something like

pd.to_datetime(dates.astype(np.int64), origin='1-1-1960')

Maybe there would be an easier way to re-originate existing date-times.

Contributor

bashtage commented Mar 28, 2017

I don't see a strong reason to allow arbitrary offsets in the Stata interface code. The present version is very loyal to the Stata dta format spec and allowing a semi-random option to be internalized rather than chained seems like the wrong way to do things.

I suppose without explicit support one would have to do something like

pd.to_datetime(dates.astype(np.int64), origin='1-1-1960')

Maybe there would be an easier way to re-originate existing date-times.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 28, 2017

Contributor

@bashtage makes sense.

Contributor

jreback commented Mar 28, 2017

@bashtage makes sense.

jreback added a commit to jreback/pandas that referenced this issue Mar 28, 2017

jreback added a commit to jreback/pandas that referenced this issue Mar 29, 2017

jreback added a commit to jreback/pandas that referenced this issue Apr 2, 2017

@jreback jreback closed this in cd24fa9 Apr 2, 2017

linebp added a commit to linebp/pandas that referenced this issue Apr 17, 2017

ENH: add origin to to_datetime
closes #11276
closes #11745
superseded #11470

Author: Jeff Reback <jeff@reback.net>
Author: Sumit Binnani <sumit.binnani@gmail.com>

Closes #15828 from jreback/datetime-unit and squashes the following commits:

ebb4acd [Jeff Reback] doc fixes & cleanup
209591a [Jeff Reback] bug fix
56663a5 [Jeff Reback] add Timedelta floordiv ops
a24e88c [Jeff Reback] rename epoch -> unix
6a8a779 [Jeff Reback] update docs / tests
ad7356e [Sumit Binnani] BUG: Series creation with datetime64 with non-ns unit as object dtype
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment