-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: linspace works with datetime64 and timedelta64. #14700
base: main
Are you sure you want to change the base?
Conversation
5268629
to
1830257
Compare
numpy/core/function_base.py
Outdated
if num > 1: | ||
step = delta / div | ||
if _nx.any(step == 0): | ||
if (not is_datelike) and _nx.any(step == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be better as if np.issubdtype(step.dtype, np.floating)
, since that's the only case where denormal numbers apply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that, and it fails test_equivalent_to_arange
:
def test_equivalent_to_arange(self):
for j in range(1000):
assert_equal(linspace(0, j, j+1, dtype=int),
> arange(j+1, dtype=int))
E AssertionError:
E Arrays are not equal
E
E Mismatched elements: 1 / 23 (4.35%)
E Max absolute difference: 1
E Max relative difference: 0.06666667
E x: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 14, 16,
E 17, 18, 19, 20, 21, 22])
E y: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
E 17, 18, 19, 20, 21, 22])
In the last line of linspace
, the y[15] happens to be 14.999999999999998
, which becomes 14 when cast to int
, hence failing the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, interesting. Although I wonder if it is generally true that doing the division first is more precise. But it is slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line of code is no longer present in the PR any more, as I refactored it to use ndarray.view
numpy/core/function_base.py
Outdated
# excepting datetime64 and timedelta64 | ||
_mult_inplace = _nx.isscalar(delta) and (not is_datelike) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we cannot do y *= delta
when y
is int
and delta
is timedelta64
.
numpy/core/function_base.py
Outdated
# Convert float/complex array scalars to float, gh-3504 | ||
# and make sure one can use variables that have an __array_interface__, gh-6634 | ||
start = asanyarray(start) * 1.0 | ||
stop = asanyarray(stop) * 1.0 | ||
else: | ||
# Convert float/complex array scalars to float, gh-3504 | ||
start = start * 1.0 | ||
stop = stop * 1.0 | ||
dt = result_type(start, stop, float(num)) | ||
delta = stop - start | ||
y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta)) | ||
|
||
dt = result_type(start, stop, float(num)) | ||
if dtype is None: | ||
dtype = dt | ||
if dtype is None: | ||
dtype = dt | ||
|
||
delta = stop - start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If would be nice if we could do this without any special casing. Does the following work?
start = asanyarray(start)
stop = asanyarray(stop)
dt = result_type(start, stop, float(num))
if dtype is None:
dtype = dt
delta = stop.astype(dt) - start.astype(dt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. result_type(start, stop, float(num))
raises TypeError
if start/stop are datetime64
since type promotion between datetime64
and float
are not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
start = asanyarray(start)
stop = asanyarray(stop)
delta = stop - start
dt = result_type(delta, float(num))
delta = delta.astype(dt)
if dtype is None:
dtype = dt
Edit: this does not handle overflow correctly for signed integers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have worked if type promotion between timedelta64
and float
resulted in float
, but it doesn't:
>>> np.result_type(np.timedelta64(1, 'D'), float(1))
dtype('<m8[D]')
Is this type promotion the intended behaviour? I do not know numpy's internals well enough to judge this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg : Could you please comment on this if you have time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about ndarray.view
? Since timedelta64
is just int64
with units attached (essentially), we can get the desired float
by, e.g., np.asarray(np.timedelta64(1, 'D')).view(np.int64)
. And then dividing by the appropriate amount if we want to normalise to e.g. seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result_type
not returning float is definitely right. However, one could argue that it should just fail in the generic case (and only work as promotion in certain ufuncs). Hameers suggestion to use the knowledge that it is int64 may work. Although have to be careful with NaT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hameerabbasi @seberg : Thank you for your comments. I have tried the ndarray.view
approach and it works. Adopting this led to reducing the number of if/else
blocks and improved readability.
numpy/core/function_base.py
Outdated
# that it is at all possible to draw num equispaced | ||
# samples. | ||
assert dtype is not None, "linspace with {} objects cannot be called "\ | ||
"with dtype=None".format(start.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asserts should not be used for checks that users can run into. This should be a ValueError probably. I could imagine writing it without negation may be a tiny bit nicer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this comment. I have changed this assert into a ValueError
.
numpy/core/function_base.py
Outdated
start = start.astype(dtype) | ||
stop = stop.astype(dtype) | ||
# y cannot be cast to datetime64/timedelta64, but left | ||
# as int64, since later on we will do y * delta, and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just for you, technically, on 32bit platforms, arange probably returns int32. It does not matter though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debsankha, do you know right away why making y
a datetime/timedelta here fails later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg : Sorry for the late reply. Making y
datetime/timedelta doesn't work because of line 170/172, where we do y *= delta
or y = y * delta
. delta
is stop-start
, so timedelta anyway, and if y
is also datetime/timedelta, then the multiplication is not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have adapted the PR following @hameerabbasi 's ndarray.view
idea. Now these lines are no longer present.
numpy/core/function_base.py
Outdated
# Convert float/complex array scalars to float, gh-3504 | ||
# and make sure one can use variables that have an __array_interface__, gh-6634 | ||
start = asanyarray(start) * 1.0 | ||
stop = asanyarray(stop) * 1.0 | ||
else: | ||
# Convert float/complex array scalars to float, gh-3504 | ||
start = start * 1.0 | ||
stop = stop * 1.0 | ||
dt = result_type(start, stop, float(num)) | ||
delta = stop - start | ||
y = _nx.arange(0, num, dtype=dt).reshape((-1,) + (1,) * ndim(delta)) | ||
|
||
dt = result_type(start, stop, float(num)) | ||
if dtype is None: | ||
dtype = dt | ||
if dtype is None: | ||
dtype = dt | ||
|
||
delta = stop - start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result_type
not returning float is definitely right. However, one could argue that it should just fail in the generic case (and only work as promotion in certain ufuncs). Hameers suggestion to use the knowledge that it is int64 may work. Although have to be careful with NaT
.
numpy/core/function_base.py
Outdated
if num > 1: | ||
step = delta / div | ||
if _nx.any(step == 0): | ||
if (not is_datelike) and _nx.any(step == 0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, interesting. Although I wonder if it is generally true that doing the division first is more precise. But it is slower.
@debsankha do you need help moving forward with this? |
Sorry for responding late to the code reviews. But I have adapted my PR taking into account the comments now and replied in the commend threads. |
numpy/core/function_base.py
Outdated
"dtype=None".format(start.dtype)) | ||
if np.isnat(start): | ||
raise ValueError("start is NaT, linspace won't work.") | ||
if np.isnat(stop): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately the spelling is _nx
in this file. Will review more later, but this causes tests to fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. I have fixed it and the testsuite now passes.
numpy/core/function_base.py
Outdated
# samples. | ||
if dtype is None: | ||
raise ValueError("linspace with {} objects cannot be called with" | ||
"dtype=None".format(start.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, more coming later maybe, so no need to jump on anything. A bit of nitpicking (I can fix it up also if you prefer):
The upper string needs a space at the end. Maybe linspace()
(python does it, but I do not think we currenty do this really, so no matter). I would slightly prefer to write something like requires the `dtype` argument to be set to the desired output dtype.
The ValueErrors below could also be phrased a bit nicer, but I I think they probably should just be removed or did I forget why the check was there? linspace
with NaN returns an array full of NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review :)
The upper string needs a space at the end. Maybe
linspace()
(python does it, but I do not think we currenty do this really, so no matter). I would slightly prefer to write something likerequires the `dtype` argument to be set to the desired output dtype.
Done.
The ValueErrors below could also be phrased a bit nicer,
Done
but I I think they probably should just be removed or did I forget why the check was there?
linspace
with NaN returns an array full of NaN.
If we remove those checks, we get:
In [14]: np.linspace(start=np.array('2019-11-12', 'M8[s]'),
stop=np.array('NaT', 'M8[s]')
num=3,
dtype='M8[s]')
Out[14]:
array([ '2019-11-12T00:00:00', '146138514308-05-25T07:32:16',
'NaT'], dtype='datetime64[s]')
The reason is, stop(=NaT)
gets converted to -9223372036854775808
via .view(np.int64)
.
Another fact:
In [24]: np.linspace(start=np.nan, stop=100, num=3)
Out[24]: array([ nan, nan, 100.])
What do you think should be done? 'NaT' check for start
and stop
or no checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe we should jump through the hoops and set everything to NaT? I am not sure, an error could work for now I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK with both. If we are returning an array of NaT
, I think for sake of consistency we should return an array of nan
for the non-datetime equivalent case as well.
Therefore I would (slightly) prefer to just raise an error for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I need to take a step back here (and probably mainly remember all the other things that were tricky). Why can't step
have the correct timedelta (not datetime of course) dtype? timedelta + timedelta works, timedelta * int works and datetime + timedelta works.
Sorry, this is probably just me forgetting an old discussion :/.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, y
would need special casing to ensure it is integer (in other words, it also needs to be unitless in any case). But it seems to me delta
does not itself need to be unit-less. And if it was not, y = y * delta
would have the unit, and the NaT
should drop out correctly (or at least identically to NaN
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. delta
does not need to be unitless and the NaT
's are automatically taken care of analogously to the NaN
s.
I got this error while trying this out:
In [15]: np.linspace(np.array('NaT', 'M8[s]'), t2, 10,dtype='M8[D]', retstep=True)
build/testenv/lib/python3.6/site-packages/numpy/core/function_base.py:158: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if _nx.any(step == 0):
Out[15]:
(array([ 'NaT', 'NaT', 'NaT', 'NaT',
'NaT', 'NaT', 'NaT', 'NaT',
'NaT', '2019-11-12'], dtype='datetime64[D]'),
numpy.timedelta64('NaT','s'))
In [19]: np.array('NaT', 'M8[s]') == 0
runtests.py:1: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
#!/usr/bin/env python
Out[19]: False
Any ideas what to do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The warning is a consequence of the 0
not having the correct dtype (datetime with unit s
). It would go away if we do == np.array(0, step.dtype)
. And since the ufuncs do that internally anyway, I guess the overhead, while annoying, may not matter too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, then. I have refactored the stuff. The testsuite passes apart from one mysterious failure in test_all_modules_are_expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seberg : I will fix the merge conflicts and rebase to master if you are happy with the PR otherwise.
45ce4ce
to
e0d483a
Compare
e0d483a
to
85be35b
Compare
393e85c
to
1d6a890
Compare
There are merge conflicts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much cleaner to me now, thanks. @eric-wieser any opinions? I guess we may have to ping the mailing list again, its unfortunate that it requires special casing for datetime (due to it not being compatible with floating point).
We will need to add a few extra tests for sure. One for the NaT
behaviour(s) (maybe even multiple dimensions). And I would like one test of what happens for start
and stop
having different time units, and different time units from the dtype=...
argument.
"to be set to the desired output dtype".format(start.dtype)) | ||
# dt will be used as the dtype argument of arange after this if/else block ends, | ||
# to generate array y. Therefore it must be unitless, not timedelta or datetime. | ||
dt = _nx.integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use _nx.int64
since, probably defaults to something wrong on windows or 32bit linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for the suggestion!
|
||
if endpoint and num > 1: | ||
y[-1] = stop | ||
|
||
if axis != 0: | ||
y = _nx.moveaxis(y, 0, axis) | ||
|
||
if is_datelike: | ||
y = y.astype(dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line still necessary? I wonder if it makes more sense to move the astype
up to cast start
already (or at least put it into that branch). I suppose since the unit could be different, it may be wrong to cast start
first, so I guess casting later is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I had put in the line there because before we made step
unit-aware, this was necessary. Now it's redundant, since the return statement has the cast anyway. I remove it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah man, I messed up here I think :(. delta / div
has to be a floating point number, but if delta
is timedelta, it is not. That should break precision somewhere along the line. Since it happens anyway, I suppose we could cast delta
to float directly, only that does currently not convert NaT to NaN, which is a bug...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand what you mean, but is that really a problem? I have tried this test, that passes:
def test_datetime_precision(self):
t1 = array('1969-06-06 00:00:00', dtype='M8[s]')
dt = timedelta64(1, 's')
N = 100
t2 = t1 + dt*N
assert_((linspace(t1, t2, N+1, dtype='M8[s]') == t1+dt*arange(N+1)).all())
Of course if N
is so high that N
equally spaced samples cannot be drawn between start
and stop
with dtype
precision, it won't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh no, I was wrong. There is a problem:
In [20]: t1 = np.array('2019-11-12 08:00:00', 'M8[s]')
In [21]: t2 = np.array('2019-11-12 08:00:01', 'M8[s]')
In [23]: np.linspace(t1, t2, 5, dtype='M8[ms]')
Out[23]:
array(['2019-11-12T08:00:00.000', '2019-11-12T08:00:00.000',
'2019-11-12T08:00:00.000', '2019-11-12T08:00:00.000',
'2019-11-12T08:00:01.000'], dtype='datetime64[ms]')
But this can be taken care of by casting start
and stop
to dtype
at the beginning.
Is there anything else I am missing? I will in any case add these tests to the testsuite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even then t2 - t1
will be numpy.timedelta64(1000,'ms')
if you have num=33
: (t2 - t1) / div
gives numpy.timedelta64(29,'ms')
, but (t2 - t1).astype(np.int64) / 34
(which gives a float value) is 29.41176470588235
, multiply that by 30 and you have a catastrophic failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see there's a problem, but I am sorry to say I cannot figure out what the expected behaviour of linspace
in this case should be. Considering:
In [15]: np.linspace(1, 10, 7, dtype=np.int64)
Out[15]: array([ 1, 2, 4, 5, 7, 8, 10])
This datetime case yields, btw:
In [8]: np.linspace(t1, t2, 33, dtype='M8[ms]')
Out[8]:
array(['2019-11-12T08:00:00.000', '2019-11-12T08:00:00.031',
'2019-11-12T08:00:00.062', '2019-11-12T08:00:00.093',
'2019-11-12T08:00:00.124', '2019-11-12T08:00:00.155',
'2019-11-12T08:00:00.186', '2019-11-12T08:00:00.217',
'2019-11-12T08:00:00.248', '2019-11-12T08:00:00.279',
'2019-11-12T08:00:00.310', '2019-11-12T08:00:00.341',
'2019-11-12T08:00:00.372', '2019-11-12T08:00:00.403',
'2019-11-12T08:00:00.434', '2019-11-12T08:00:00.465',
'2019-11-12T08:00:00.496', '2019-11-12T08:00:00.527',
'2019-11-12T08:00:00.558', '2019-11-12T08:00:00.589',
'2019-11-12T08:00:00.620', '2019-11-12T08:00:00.651',
'2019-11-12T08:00:00.682', '2019-11-12T08:00:00.713',
'2019-11-12T08:00:00.744', '2019-11-12T08:00:00.775',
'2019-11-12T08:00:00.806', '2019-11-12T08:00:00.837',
'2019-11-12T08:00:00.868', '2019-11-12T08:00:00.899',
'2019-11-12T08:00:00.930', '2019-11-12T08:00:00.961',
'2019-11-12T08:00:01.000'], dtype='datetime64[ms]')
In [12]: np.diff(np.linspace(t1, t2, 33, dtype='M8[ms]'))
Out[12]:
array([31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31,
31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 31, 39],
dtype='timedelta64[ms]')
I am sorry if I failed to get your point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the point is exactly that due to the step of 31 here, the intermediate values will be off by more than 1ULP for most values. This can always happen to some degree, but as is, for integers it behaves much better:
In [1]: np.diff(np.linspace(0, 1000, 33, dtype='int64'))
Out[1]:
array([31, 31, 31, 32, 31, 31, 31, 32, 31, 31, 31, 32, 31, 31, 31, 32, 31,
31, 31, 32, 31, 31, 31, 32, 31, 31, 31, 32, 31, 31, 31, 32])
EDIT: And with no, I mean: yes of course that is exactly what I meant :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your previous solution behaved better/correctly here :/. We are bound to bake in some annoying cases (which is not nice, since once it is in, it is difficult to fix), but weird NaT handling seems better than this to me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I understand, thanks. So the options are:
- NaT handling consistent with NaN, but outputs may in many cases be more off than necessary.
- We must explicitly disallow either start or stop being NaT. Output as precise as
dtype
allows.
I find both these options annoying, but maybe not being able to call linspace
with NaT
's are not that big a disadvantage?
These are all valid points. I will add the tests soon. |
@rtatnet3 @seberg I have picked up this PR and submitted #17437 completing the work and with improvements:
|
@l-johnston this PR needs a rebase @seberg can you take a look and see if this is close to merging? Thanks! |
@melissawm I would prefer if someone familiar/heavy user of datetimes could check the past discussion and make a call/suggestion on any open questions (I think they still exists). I can see the weirder edge cases, but as a non-heavy user of this stuff, I find it difficult to make up my mind on them. |
This pull request is intended to partially fix the issue #10514.
The changes introduced in
np.linspace
does the following:Calling
linspace(start_datetime, stop_datetime, num, dtype=<dt>)
withdatetime64
andtimedelta64
objects with a supplieddtype
returnswithout raising any exception.
However, if the
dtype
supplied has so low precision thatnum
equispaced samples of that dtype cannot be drawn, then the array returned are not equispaced (just like the result ofnp.linspace(1, 10, 7, dtype=np.int64)
) are not equispaced.