Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add datetime64/timedelta64 support to linspace #17437

Closed
wants to merge 14 commits into from
Closed

ENH: add datetime64/timedelta64 support to linspace #17437

wants to merge 14 commits into from

Conversation

l-johnston
Copy link
Contributor

@l-johnston l-johnston commented Oct 3, 2020

Closes #10514
Closes #14700

I picked up PR #14700 and completed the work with improvements:

@seberg seberg self-requested a review October 4, 2020 16:29
@l-johnston
Copy link
Contributor Author

@seberg arange supports standard library datetime.

>>> from datetime import datetime, timedelta
>>> import numpy as np
>>> t1 = datetime(2020,1,1)
>>> t2 = datetime(2020,1,9)
>>> np.arange(t1, t2, timedelta(days=1))
array(['2020-01-01T00:00:00.000000', '2020-01-02T00:00:00.000000',
       '2020-01-03T00:00:00.000000', '2020-01-04T00:00:00.000000',
       '2020-01-05T00:00:00.000000', '2020-01-06T00:00:00.000000',
       '2020-01-07T00:00:00.000000', '2020-01-08T00:00:00.000000'],
      dtype='datetime64[us]')

Should I add this support to linspace?

@charris charris added 56 - Needs Release Note. Needs an entry in doc/release/upcoming_changes component: numpy._core labels Oct 11, 2020
@l-johnston
Copy link
Contributor Author

@charris I added the release note. Is the note sufficient?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I would feel slightly better if users of datetime64 could chime in (@jorisvandenbossche just in case you are interested and have time).

Do we worry about integer overflows which could create new NaTs or incorrect results? The integer code paths fall back to float64, so they will lose precision, but not give completely off results.
We do have the small thing that the timedelta64 step will lose precision compared to the internal float64. That is probably OK, I wonder if we should add it to Notes or so in the documentation?

@debsankha are you interested in having a look at this updated version of your old PR?

numpy/core/function_base.py Outdated Show resolved Hide resolved
numpy/core/tests/test_function_base.py Outdated Show resolved Hide resolved
numpy/core/function_base.py Show resolved Hide resolved
numpy/core/tests/test_function_base.py Show resolved Hide resolved
@eric-wieser
Copy link
Member

I don't agree with that - linspace on floats behaves as:

In [6]: np.linspace(0, np.nan, 5)
Out[6]: array([nan, nan, nan, nan, nan])

so passing in NaT should just result in an array of NaTs

@l-johnston
Copy link
Contributor Author

@eric-wieser When start is NaT, you'll get the expected result consistent with non-datetime types.

>>> np.linspace(np.datetime64("NaT"), np.datetime64("2020-01-01"), 3)
array([       'NaT',        'NaT', '2020-01-01'], dtype='datetime64[D]')
>>> np.linspace(np.nan, 3, 3)
array([nan, nan,  3.])

However, when stop is NaT, the returned array contains nonsense values.

>>> np.linspace(np.datetime64("2020-01-01"), np.datetime64("NaT"), 5)
array([              '2020-01-01',  '-6313183731939619-05-11',
       '-12626367463881258-09-19', '-18939551195822896-01-28',
                            'NaT'], dtype='datetime64[D]')

The internal computation of the sample deltas begins by casting delta, which is a timedelta64, to np.int64 and NaT casts to -9223372036854775808. There is no integer equivalent of NaN. This is the fundamental difference between the datetime code path and the non-datetime code path.

@eric-wieser
Copy link
Member

It sounds like you have a bug to fix for when stop is nat then :)

l-johnston and others added 2 commits October 15, 2020 07:02
@l-johnston
Copy link
Contributor Author

@eric-wieser now the behavior is equivalent to non-datetime types:

>>> np.linspace(np.timedelta64(1, "s"), np.timedelta64("NaT"), 5)
array(['NaT', 'NaT', 'NaT', 'NaT', 'NaT'], dtype='timedelta64[s]')

@l-johnston
Copy link
Contributor Author

@seberg is there anything else I need to do on this PR?

@seberg
Copy link
Member

seberg commented Nov 13, 2020

@l-johnston hmm, I guess I dropped the ball on this due to the retstep problem. We need to come to a decision for that. Whether that is an error, or just documenting that retstep has bigger precision issues with timedelta/datetime. Or maybe there is another solution?

@l-johnston
Copy link
Contributor Author

@seberg In the case of integer output, the returned retstep is a float and has the same precision issue as date-like. For example:

>>> arr, step = np.linspace(0, 1000, 33, dtype="int64", retstep=True)
>>> step
31.25
>>> np.unique(np.diff(arr))
array([31, 32])

I propose documenting the precision issue.

Base automatically changed from master to main March 4, 2021 02:05
@l-johnston
Copy link
Contributor Author

@melissawm @seberg
I believe I have addressed the issues:
Closes #10514
Closes #14700

I picked up PR #14700 and completed the work with improvements:

  • Separated the datetime64/timedelta64 code path from the original improving readability
  • Eliminated the need to specify dtype in the call to linspace
  • Solved the step size issue discussed in ENH: linspace works with datetime64 and timedelta64. ENH: linspace works with datetime64 and timedelta64. #14700
  • Implemented the tests @seberg requested
  • Implemented the NaT behavior @eric-wieser requested
  • Showed that the retstep precision behavior is no different than for integer types
  • Rebased as requested and confirmed the feature still works

Expanding on the retstep issue. Pandas offers the convenience function pandas.timedelta_range() that gives equivalent precision behavior:

>>> import pandas as pd
>>> tdi = pd.timedelta_range("0 s", "1000 s", 33).astype("timedelta64[s]")
>>> np.unique(np.diff(tdi.values))
array([31., 32.])

The difference is pandas function internally coerces start and stop to timedelta64[ns], but the issue is there as well.

@seberg
Copy link
Member

seberg commented Feb 1, 2022

Ping @jbrockmendel, not sure this is up your alley, but if it is any input/opinion is appreciated.

About retstep, I agree that integers also return a float, and share the issue. But at least a float is still strictly numerical, a timedelta is not numerical anymore, so while we can maybe argue this is just fine, I do think there is a bigger surprise here.

@l-johnston
Copy link
Contributor Author

@seberg Is this a matter of implementation or is date-like incompatible with np.linspace() such that it should never be included? It seems to me that np.arange() is used when exact step size matters and that 'np.linspace()` is used when an exact number of array elements matters.

@jbrockmendel
Copy link
Contributor

I haven't worked much with linspace, but we have code similar to this in pandas analogous to arange in pandas.core.arrays._ranges. Most of the code there is handling various overflow cases.

@akorman128
Copy link

Hello, has there been any more progress on this issue?

@l-johnston
Copy link
Contributor Author

l-johnston commented May 4, 2022 via email

@l-johnston
Copy link
Contributor Author

@seberg

About retstep, I agree that integers also return a float, and share the issue. But at least a float is still strictly numerical, a timedelta is not numerical anymore, so while we can maybe argue this is just fine, I do think there is a bigger surprise here.

What is meant by 'bigger surprise'?

@seberg
Copy link
Member

seberg commented May 18, 2022

Well, integer to float does not drop the "unit" information, timedelta to float does, which seems a bit weird to me? If retstep was really the only issue left and we can't resolve it easy (e.g. with others voicing a clear opinion), then can't we defer it and really just refuse to support retstep?

@l-johnston
Copy link
Contributor Author

In the current implementation, the returned retstep is a timedelta64 with unit set by the arguments to the function.

In [1]: import numpy as np

In [3]: np.linspace(np.timedelta64(1, "s"), np.timedelta64(3, "s"), 10, dtype="timedelta64[ms]", retstep=True)
Out[3]:
(array([1000, 1222, 1444, 1666, 1888, 2111, 2333, 2555, 2777, 3000],
       dtype='timedelta64[ms]'),
 numpy.timedelta64(222,'ms'))

In Pandas, they convert the inputs to nanoseconds.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.timedelta_range("1000 ms", "3000 ms", periods=10)
Out[3]:
TimedeltaIndex([          '0 days 00:00:01', '0 days 00:00:01.222222222',
                '0 days 00:00:01.444444444', '0 days 00:00:01.666666666',
                '0 days 00:00:01.888888888', '0 days 00:00:02.111111111',
                '0 days 00:00:02.333333333', '0 days 00:00:02.555555555',
                '0 days 00:00:02.777777777',           '0 days 00:00:03'],
               dtype='timedelta64[ns]', freq=None)
In [4]: np.diff(Out[3])
Out[6]:
array([222222222, 222222222, 222222222, 222222222, 222222223, 222222222,
       222222222, 222222222, 222222223], dtype='timedelta64[ns]')

@l-johnston
Copy link
Contributor Author

@seberg

then can't we defer it and really just refuse to support retstep?

By defer, would it be acceptable to return retstep=None for date-like inputs?

@seberg
Copy link
Member

seberg commented May 19, 2022

I though raising NotImplemented if retstep=True, is returning None more useful?

@l-johnston
Copy link
Contributor Author

@seberg I agree NotImplementedError would be better:

In [1]: import numpy as np
In [2]: np.linspace(np.timedelta64(1, "s"), np.timedelta64(3, "s"), 10, dtype="timedelta64[ms]", retstep=True)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 np.linspace(np.timedelta64(1, "s"), np.timedelta64(3, "s"), 10, dtype="timedelta64[ms]", retstep=True)
File <__array_function__ internals>:180, in linspace(*args, **kwargs)
File .../numpy/numpy/core/function_base.py:130, in linspace(start, stop, num, endpoint, retstep, dtype, axis)
    128 if start.dtype.kind in "mM":
    129     if retstep:
--> 130         raise NotImplementedError("'step` output not supported for date-like inputs")
    131     from numpy.ma import MaskedArray, filled
    132     if dtype is None:
NotImplementedError: 'step` output not supported for date-like inputs

Should we solicit mail list feedback?

@InessaPawson
Copy link
Member

@seberg I agree NotImplementedError would be better:
Should we solicit mail list feedback?

@l-johnston Thank you for working on this PR! If you'd like to get feedback from the NumPy mailing list, please feel free to go ahead.

@l-johnston l-johnston closed this by deleting the head repository Dec 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs decision
Development

Successfully merging this pull request may close these issues.

ENH: support np.datetime64 in linspace
7 participants