Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support or default to less detailed datetime64 #7307

Closed
Tracked by #46587
CarstVaartjes opened this issue Jun 2, 2014 · 40 comments
Closed
Tracked by #46587

Support or default to less detailed datetime64 #7307

CarstVaartjes opened this issue Jun 2, 2014 · 40 comments
Labels
Datetime Datetime data dtype Enhancement Non-Nano datetime64/timedelta64 with non-nanosecond resolution

Comments

@CarstVaartjes
Copy link

Hi,

I regularly run into issues where I have dates that fall outside of Pandas's datetime standards. Quite a few data sources have defaults such as "9999-12-31" and stuff like that, leading to issues in pandas.

This is because Pandas defaults to nanoseconds where the time span is quite limited.
See: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html
Code Meaning Time span (relative) Time span (absolute)
s second +/- 2.9e12 years [ 2.9e9 BC, 2.9e9 AD]
ms millisecond +/- 2.9e9 years [ 2.9e6 BC, 2.9e6 AD]
us microsecond +/- 2.9e6 years [290301 BC, 294241 AD]
ns nanosecond +/- 292 years [ 1678 AD, 2262 AD]

I first thought it was the unit='s' parameter in to_datetime would work (see: http://pandas.pydata.org/pandas-docs/version/0.14.0/generated/pandas.tseries.tools.to_datetime.html), but this is only for translating a different datetime to nano seconds (I think) and the "ns" detail level seems to be rather hard coded.

I cannot imagine the majority of the use cases needing nano seconds; even going to micro seconds extends the date range to something that in my experience should always work. The nanosecond 2262AD is really limited.

Imho, ideally one should be able to choose the detail level. Is this just me or is this a common issue?

@jreback
Copy link
Contributor

jreback commented Jun 2, 2014

http://pandas-docs.github.io/pandas-docs-travis/gotchas.html#timestamp-limitations

simply use periods and use NaT as appropriate for missing values

@ifmihai
Copy link

ifmihai commented Jul 22, 2014

it's a more common issue than believed

I bump into this issue more often than not

my guess is that at least 99% of pandas users dont need nanoseconds

if this is true,
then it would be common sense to use microseconds as default
and then to have something like
df = P.DataFrame(index=nano_range(start,end))

nano seconds usage is a particular case, not the rule

I really don't understand the rationale behind all this nano planning

But I dont understand how hard it would be to change to microseconds, either

@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

@ifmihai

In [84]: period_range('201-1-1 10:20','9000-1-1 12:00',freq='H')
Out[84]: 
<class 'pandas.tseries.period.PeriodIndex'>
[201-01-01 10:00, ..., 9000-01-01 12:00]
Length: 77130459, Freq: H

you can simply use a PeriodIndex to represent any time span you want. DatetimeIndex and PeriodIndex have different usecases. DatetimeIndex provides maximum resolution within the most common timespans and is thus pretty general. If you need more range but less resolution then use a PeriodIndex. You CAN have it both ways. No need to change the defaults and try to conform everyone to one representation.

@ifmihai
Copy link

ifmihai commented Jul 22, 2014

@jreback
"no need to change the defaults and try to conform everyone to one representation

what about nanoseconds? isnt it the same?

i feel forced to conform to whoever considered nanoseconds is best (which is really not practical for most user cases)
I would surely what to hearthe opinion of him who decided nanosecond unit
listen to what he has to say
I doubt it will be convincing, except probably he was forced by numpy, but that's out of my league

anyway, nanosecond unit receives a big -1 from me
(i already started to hate nanoseconds btw :D)

@jreback
Copy link
Contributor

jreback commented Jul 22, 2014

@ifmihai this was decided quite a long time ago, by @wesm (and I think it was numpy that drove it, not really sure). As I said before, that was one reason Periods were created. Changing the default is a non-starter.

@CarstVaartjes
Copy link
Author

But how can we read a csv file for instance and convert dates to periods? Sincere question, as I'm really struggling with this and the documentation is really unclear here. Because there is

  1. a "to_datetime" where I can give a date format but it will refuse to convert out of nanosecond bound datetimes, so I cannot use it (I will lose data); the units do not work (as in: "unit='s'" does not solve the out of bound issue, while in numpy it does!)
  2. a "to_timestamp", but that's not really anything with a string to date conversion
  3. a "to_period", also not a string to date conversion but "Convert TimeSeries from DatetimeIndex to PeriodIndex"; however I cannot create a DatetimeIndex with my values in the first place because I lose all out of bound values

See also this example:

x_df = DataFrame([[20120101, 20121231], [20130101, 20131231], [20140101, 20141231], [20150101, 99991231]])
x_df.columns = ['date_from', 'date_to']
date_def = '%Y%m%d'
x_df['date_to_2'] = [datetime.datetime.strptime(str(date_val), date_def) for date_val in x_df['date_to']]
x_df['date_to_3'] = [np.datetime64(date_val, unit='s') for date_val in x_df['date_to_2']] #
# it's all objects, even though date_to_3 is a full np.datetime64
x_df.dtypes
list(x_df['date_to_3'])

i cannot make date_to_3 into a period (as far as i know) and while being a perfectly nice np.datetime64 (with unit='s' instead of unit='ns'), Pandas refuses to see it as such.
It's really a Pandas limitation afaik; this really is an issue and Periods do not solve it (in any way that I can see at the moment)

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

Read them in as ints (you don't really need to do anything special to do this), just don't specify 'parse_dates'

In [27]: x_df.to_csv('test.csv',mode='w')

In [28]: !cat test.csv
,date_from,date_to
0,20120101,20121231
1,20130101,20131231
2,20140101,20141231
3,20150101,99991231

In [29]: df = read_csv('test.csv',index_col=0)

In [30]: df
Out[30]: 
   date_from   date_to
0   20120101  20121231
1   20130101  20131231
2   20140101  20141231
3   20150101  99991231

In [31]: df.dtypes
Out[31]: 
date_from    int64
date_to      int64
dtype: object

Define a converter that creates a period from an int

In [32]: def conv(x):
    return Period(year=x/10000,month=x/100 % 100, day=x%100,freq='D')
   ....: 

In [33]: converted = df.applymap(conv)

In [34]: converted
Out[34]: 
    date_from     date_to
0  2012-01-01  2012-12-31
1  2013-01-01  2013-12-31
2  2014-01-01  2014-12-31
3  2015-01-01  9999-12-31

In [35]: converted.iloc[3,0]
Out[35]: Period('2015-01-01', 'D')

In [36]: converted.iloc[3,1]
Out[36]: Period('9999-12-31', 'D')

Of course this could be vectorized / more natural, e.g. a to_periods is prob a good idea.

Give it a go!

@CarstVaartjes
Copy link
Author

Thanks! It works perfectly like this, I will test HDF5 saving & queries this week too.

I would really add this snippet to http://pandas.pydata.org/pandas-docs/stable/timeseries.html (and yes, to_periods would be a really great idea :)

@jreback
Copy link
Contributor

jreback commented Aug 4, 2014

ok, why don't you open a feature request for to_periods (with as complete doc-string as you can)

@CarstVaartjes
Copy link
Author

I've put it on my to do list for later this week (and will unscrupulously copy from to_datetime with the additional Period frequency formats)!

@shoyer
Copy link
Member

shoyer commented Sep 9, 2014

@jreback I know PeriodIndex is the suggested work around for dates that don't fit in ns precision, but there is a difference between periods and datetimes for meaning as well -- periods refer to a periods of time rather than a time point.

For what it's worth, I've played a bit with np.datetime64 and I don't think nanosecond decision was driven by numpy. Numpy seems to prefer us/microsecond precision (there are a few more bugs with ns precision).

I do recognize that this may be too late to change now but I do think this is something worth considering. I suspect there are a lot more users who would be happy with a fixed choice of "us" rather than "ns" precision. Of course, it would also be a lot of work to actually do the implementation (and it's not the top priority for me).

@jreback
Copy link
Contributor

jreback commented Sep 9, 2014

I think this would have to be a hybrid, e.g. carry around the units on the Timestamp/DatetimeIndex. Not a big deal to add it. BUT would need some validation.

I think it IS possible, but prob a bit of work.

@shoyer
Copy link
Member

shoyer commented May 18, 2016

Reopening this -- this is still a recurrent issue, despite the work around of using PeriodIndex.

@shoyer shoyer reopened this May 18, 2016
@jorisvandenbossche jorisvandenbossche added this to the Someday milestone May 18, 2016
@rabernat
Copy link

👍 From me. The date range limitation is a HUGE issue for many scientific communities. I find it very hard to believe that there are more pandas users who want nanoseconds than who want to use dates before 1678.

@jreback
Copy link
Contributor

jreback commented May 18, 2016

As I have stated many times, this would require some major effort. It IS certainly possible, and the extension dtypes are able to support this. But would really need to be spearheaded by someone who this would be very useful.

@jreback jreback added Difficulty Advanced Datetime Datetime data dtype labels May 18, 2016
@CarstVaartjes
Copy link
Author

Hi @jreback! I definitely understand as this https://github.com/pydata/pandas/search?q=ns gives 150 matches!
But the wanted solution matters a lot I think. Replacing it by microseconds instead of nanoseconds would be not that difficult (of course, thorough testing needed etc), but might break things (do we know if people really use nanosecond precision atm?)
Making it dynamic would be more difficult and still have a series of questions (just of the top of my head):

  • how should you be able to specify it? (especially with functions like read_csv)
  • should it convert numpy datetime arrays when you create a DataFrame or use the current precision instead?
  • can one DataFrame contain datetime arrays of different precision?

Can you give some guidance as to what should be the best route for this and what kind of requirements you would have?

P.s. because of HDF5 and Bcolz we couldn't switch to time periods, so we still have this issue and have lots of catch procedures in place to work around it, so solving this would be great for me personally.

@shoyer
Copy link
Member

shoyer commented Jun 7, 2016

Thanks for sharing your perspective, @wesm. I'm pretty much in complete agreement with you.

@jreback
Copy link
Contributor

jreback commented Jun 7, 2016

As a followup on @wesm comments. Periods are a nice alternative for representing out-of-bounds timestamp ranges, see docs here, so you can pretty easily use these.

Recently (thanks to @sinhrks , @MaximilianR ) these are becoming more and more of a first class type.

Certainly, this introduces more user mental effort that using a single less-detailed Timestamp type (and ignoring that Timestamps are point-in-time while Periods are time-spans) this is a practical solution (and has been since 0.8).

and in fact these can represent really long ranges

In [15]: pd.Period(datetime.datetime(1,1,1),freq='us')
Out[15]: Period('0001-01-01 00:00:00.000000', 'U')

In [17]: pd.Period(datetime.datetime(3000,1,1),freq='us')
Out[17]: Period('3000-01-01 00:00:00.000000', 'U')

@CarstVaartjes
Copy link
Author

Hi,

I understand that inside Pandas the Period works really well. But in terms of IO to/from Pandas I'm less sure, typical use cases for me are:

  • reading csv files and other sources that contain out of bound dates (where we use workarounds, altering the data to 31-12-1999 etc., but at the cost of legibility and performance)
  • reading and writing to HDF5
  • reading and writing to bcolz/bquery tables

I think all of these default to numpy datetimes (which in itself does support it, but is overruled by the standard ns casting); so there's lots of uncertainties there for me.
Mind you, I very much understand the headache involved here and I love Pandas and you guys for all the effort you are putting into it. Just wish I had a time machine and go back to bribe Wes with beers for two things (us instead of ns datetimes + grouping not dropping nan values) :)

@ifmihai
Copy link

ifmihai commented Jun 20, 2016

Following what @wesm said

All the more reason we may benefit from a pandas 1.0 breaking release (similar to the Python 3 break), with a pandas 0.X.Y LTS bugfix branch, at some point in the future.

This was on my mind since the beginning of this issue.
I also see it as the most common sense.

Too bad that only few people voted on pydata google group.
10 users is not enough.

@jzwinck
Copy link
Contributor

jzwinck commented Aug 3, 2016

Probably the most likely people to find this page (or the poll) and speak are those who are dissatisfied with nanoseconds. But there are those of us who are happy with nanoseconds. I am.

Here are some uses of precision finer than 1 microsecond:

  • Microsoft FILETIME. 100 ns. One of the most common timestamp formats in the world.
  • Linux clock_gettime(), timer_settime(). 1 ns.
  • Endace ERF (similar to pcap). 233 ps (yes, picoseconds). Rounding to whole nanoseconds is conventional.

In the unlikely event that Pandas switched from always nanoseconds to always microseconds, we would have to stop using its time features, and store nanoseconds as raw int64. Even if we stipulate that nanosecond precision has no practical use to humans, we need to be able to convey timestamps with full accuracy between systems (e.g. to do database queries).

I do sympathize with those whose timestamp needs exceed the life expectancy of people or nations, but systems are generating more sub-microsecond timestamps, and this trend will not reverse. Adding support for longer horizons is good, but we shouldn't lose our nanos.

@jbrockmendel jbrockmendel added the Non-Nano datetime64/timedelta64 with non-nanosecond resolution label Jan 10, 2022
@mroeschke mroeschke removed this from the Someday milestone Oct 13, 2022
@jbrockmendel
Copy link
Member

In 2.0 we support "ns", "us", "ms", and "s". Closing as complete.

@Gabriel-Kissin
Copy link

In 2.0 we support "ns", "us", "ms", and "s". Closing as complete.

In case anyone else gets to the bottom of this thread:
Link to the timeseries user guide, and the what's new in pandas 2.0, which describe this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Non-Nano datetime64/timedelta64 with non-nanosecond resolution
Projects
None yet
Development

No branches or pull requests