-
-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: prevent datetime64 span chokes #11873
Conversation
close an reopen. |
LGTM. Minor nit - there is a TODO on line 244, can it be removed? Related issues:
I also added a |
1580c2e
to
a540172
Compare
The If the other issues you mention are fixed by this that's a nice bonus, but closing those should probably be contingent on at least adding a unit test that confirms blocking the specified behaviors, and I'm not targeting those issues here. I did remove Casting with |
Do we want to expand the scope of this PR to include overflow "guards" for all the time spans? |
if (abs(get_datetimestruct_seconds(dts)) > | ||
106 * 24 * 60 * 60) { | ||
/* reject with seconds resolution */ | ||
PyErr_SetString(PyExc_ValueError, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear to me whether ValueError
or OverflowError
is the better choice. Would be good to put some justification for the choice in the commit message, whichever way you go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping I could replace 106 * 24 * 60 * 60
with a sufficiently high precision check using a formula with NPY_INT64_MAX
, but if I understand correctly it could be too late by the time the calculation propagates down to seconds.
"too large for fs units"); | ||
return -1; | ||
}; | ||
|
||
ret = (((((days * 24 + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if this is a hot path. If it's not, can we use something like the following, which doesn't require us to hard-code these imprecise limits:
ret = days;
err |= overflow_mult_then_add(&ret, 24, dts->hour);
err |= overflow_mult_then_add(&ret, 60, dts->min;
err |= overflow_mult_then_add(&ret, 60, dts->sec);
err |= overflow_mult_then_add(&ret, 1000000, dts->us);
err |= overflow_mult_then_add(&ret, 1000000, dts->ps);
err |= overflow_mult_then_add(&ret, 1000, dts->as / 1000);
Where
bool overflow_mult_then_add(npy_int64 *val, npy_uint64 mult, npy_int64 add) {
if (add > 0 && *val > (NPY_INT64_MAX - add) / mult) {
return true;
}
else if (add < 0 && *val < (NPY_INT64_MIN - add) / mult) {
return true;
}
else {
*val = *val * mult + add;
return false;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that by "imprecise" you mean that the time spans from the epoch that are tolerated for given return time units could be specified at a higher precision? And avoiding "hard coding" by incrementally checking for overflow as the return value is propagated through the conversions above, instead of i.e., "+/- n seconds from the epoch are tolerated for these time units."?
Is this something we want in scope for this PR? If I do that, then wouldn't it be prudent to do it for all the return blocks & also add unit tests that probe the increased / improved precision?
We'd also likely want to update the docs to reflect improved precision / limits on time since epoch tolerated if we do that, and for docs we'd still want some kid of hard-"coded" numbers for users to go by, right? I suppose we could switch to "approximate" wording in docs, and just use the old values as rough guides & let users push those overflow limits as they see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm certainly happy to do the work (we'd need benchmarks, too?), so just making sure we've thought about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that this change will break existing code that uses values between the extremes you enforce here and the true extremes representable in the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the limits - we can document the limit as np.timedelta64(np.iinfo(np.int64).max(), 'ps') * 10^-12
etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving the overflow checks into each of the functions named get_datetimestruct_minutes()
and so on? Then each switch case would just call a function coded for its units, with each preceding function in the conversion chain also protected from overflow.
a540172
to
fb90762
Compare
numpy/core/tests/test_datetime.py
Outdated
@pytest.mark.xfail(reason="work in progress") | ||
@pytest.mark.parametrize("time_unit, time_val", [ | ||
('ns', np.datetime64(int(np.iinfo(np.int64).max * (10 ** -9)), 's') + np.timedelta64(0, 'ns')), | ||
('ps', np.datetime64(int(np.iinfo(np.int64).max * (10 ** -12)), 's') + np.timedelta64(0, 'ps')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to detect the precision reduction Eric is worried about in this PR (this test now correctly fails), but I'm slightly concerned about the int()
call here -- it truncates the float I get from the "division" to produce the number of seconds from the epoch.
Obviously, I can't give a float to intialize the datetime64
. Is there a better approach? Python decimal
module or something to flush through the strictest precision test possible, or is this going to be close enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use max // (10**12)
instead, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps (max + 10**12 - 1) // (10**12)
, which rounds up not down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My best attempts so far use divmod
to recuperate the remainders and add them back in using appropriate units.
However, I've grown the unit test quite a bit now, and expanding to include symmetric time spans seems to indicate silent wrap-around for a subset of the time units:
input_seconds, remainder_ps = divmod(np.iinfo(np.int64).max, 10**12)
input_str_minus = str(np.datetime64(0, 's') - np.timedelta64(input_seconds, 's') - np.timedelta64(remainder_ps, 'ps'))
input_str_plus = str(np.datetime64(0, 's') + np.timedelta64(input_seconds, 's') + np.timedelta64(remainder_ps, 'ps'))
print("minus:", np.datetime64(input_str_minus))
print("plus:", np.datetime64(input_str_plus))
Both end up after the epoch, even on master branch:
minus: 1970-04-16T05:57:07.963145224193 # looks wrong
plus: 1970-04-17T18:02:52.036854775807 # looks about right
Is the asymmetry surprising? Our docs would suggest there should be no asymmetry. I originally noticed that pushing just slightly beyond the time spans in the positive directions tends to produce NaT
, so I wanted to have a secondary check in the unit test to confirm that I was actually very close to the "precision" limit.
* previously, datetime64 would accept input that exceeds allowable precision time spans; we now have time span guards at the sub-nanosecond unit level, with checks performed at the seconds level of resolution
fb90762
to
f2a4164
Compare
any progress with fixing this? |
@tylerjereddy Needs a rebase. Are you interested in pursuing this? |
@charris Looks like there is some interest here. Maybe when I can get SciPy |
Let me close some of my idle PRs like this one so that others feel free to take them over, etc. I may choose to focus on something different perhaps. |
Fixes #5452.
I've added time span (relative to the epoch) checks / guards for sub-ns datetime64 units to fix the above issue. We may want to use higher-resolution rejection checks (currently seconds units are used to reject time spans) & expand the guards to include all units of datetime64.
For now / initially, I've focused on the sub-ns range discussed in that linked issue. It may also be more robust to use the C89 / portable
limits.h
header file for max int values instead of the hard-coded time-span limits I've taken from the docs.I've also had to make minor adjustments to unit tests -- we had silent wrapping (garbage) for one test that worked for that test but didn't make sense, and for another test we now intercept unreasonable datetime spans before an overflow can occur, so we sometimes expect a different exception type.