BUG: prevent datetime64 span chokes #11873

tylerjereddy · 2018-09-03T22:24:22Z

I've added time span (relative to the epoch) checks / guards for sub-ns datetime64 units to fix the above issue. We may want to use higher-resolution rejection checks (currently seconds units are used to reject time spans) & expand the guards to include all units of datetime64.

For now / initially, I've focused on the sub-ns range discussed in that linked issue. It may also be more robust to use the C89 / portable limits.h header file for max int values instead of the hard-coded time-span limits I've taken from the docs.

I've also had to make minor adjustments to unit tests -- we had silent wrapping (garbage) for one test that worked for that test but didn't make sense, and for another test we now intercept unreasonable datetime spans before an overflow can occur, so we sometimes expect a different exception type.

charris · 2018-09-03T23:12:02Z

close an reopen.

mattip · 2018-09-04T07:02:25Z

LGTM. Minor nit - there is a TODO on line 244, can it be removed? Related issues:

BUG: Silent overflow when casting large date strings to M8[ns] #9956. Does a.astype('M8[ns]') also go through this codepath?
BUG: datetime64 construction can underflow #8161, which I think this fixes.

I also added a component: numpy.datetime64 label and searched all issues for datetime64 to add the label. There may be more issues that I missed ...

tylerjereddy · 2018-09-04T22:20:35Z

The TODO should not be removed -- I've only added sub-ns time span guards here, as per the specific issue targeted by this PR, all other time / date units are treated as before and may overflow / underflow if the user doesn't check input carefully.

If the other issues you mention are fixed by this that's a nice bonus, but closing those should probably be contingent on at least adding a unit test that confirms blocking the specified behaviors, and I'm not targeting those issues here.

I did remove get_datetimestruct_hours(); that should hopefully bring the diff coverage to 100 %.

Casting with astype is not specifically targeted here--great if there's some bonus issues fixed there, but again not the intention here for a focused / manageable PR.

tylerjereddy · 2018-09-25T17:39:30Z

Do we want to expand the scope of this PR to include overflow "guards" for all the time spans?

eric-wieser · 2019-02-24T09:12:52Z

numpy/core/src/multiarray/datetime.c

+                if (abs(get_datetimestruct_seconds(dts)) > 
+                        106 * 24 * 60 * 60) {
+                    /* reject with seconds resolution */
+                    PyErr_SetString(PyExc_ValueError,


Not clear to me whether ValueError or OverflowError is the better choice. Would be good to put some justification for the choice in the commit message, whichever way you go.

I was hoping I could replace 106 * 24 * 60 * 60 with a sufficiently high precision check using a formula with NPY_INT64_MAX, but if I understand correctly it could be too late by the time the calculation propagates down to seconds.

eric-wieser · 2019-02-24T09:28:11Z

numpy/core/src/multiarray/datetime.c

+                            "too large for fs units");
+                    return -1;
+                };
+
                ret = (((((days * 24 +


I don't know if this is a hot path. If it's not, can we use something like the following, which doesn't require us to hard-code these imprecise limits:

ret = days; err |= overflow_mult_then_add(&ret, 24, dts->hour); err |= overflow_mult_then_add(&ret, 60, dts->min; err |= overflow_mult_then_add(&ret, 60, dts->sec); err |= overflow_mult_then_add(&ret, 1000000, dts->us); err |= overflow_mult_then_add(&ret, 1000000, dts->ps); err |= overflow_mult_then_add(&ret, 1000, dts->as / 1000);

Where

bool overflow_mult_then_add(npy_int64 *val, npy_uint64 mult, npy_int64 add) { if (add > 0 && *val > (NPY_INT64_MAX - add) / mult) { return true; } else if (add < 0 && *val < (NPY_INT64_MIN - add) / mult) { return true; } else { *val = *val * mult + add; return false; } }

I assume that by "imprecise" you mean that the time spans from the epoch that are tolerated for given return time units could be specified at a higher precision? And avoiding "hard coding" by incrementally checking for overflow as the return value is propagated through the conversions above, instead of i.e., "+/- n seconds from the epoch are tolerated for these time units."?

Is this something we want in scope for this PR? If I do that, then wouldn't it be prudent to do it for all the return blocks & also add unit tests that probe the increased / improved precision?

We'd also likely want to update the docs to reflect improved precision / limits on time since epoch tolerated if we do that, and for docs we'd still want some kid of hard-"coded" numbers for users to go by, right? I suppose we could switch to "approximate" wording in docs, and just use the old values as rough guides & let users push those overflow limits as they see fit.

I'm certainly happy to do the work (we'd need benchmarks, too?), so just making sure we've thought about this.

What I mean is that this change will break existing code that uses values between the extremes you enforce here and the true extremes representable in the type.

Regarding the limits - we can document the limit as np.timedelta64(np.iinfo(np.int64).max(), 'ps') * 10^-12 etc

What about moving the overflow checks into each of the functions named get_datetimestruct_minutes() and so on? Then each switch case would just call a function coded for its units, with each preceding function in the conversion chain also protected from overflow.

tylerjereddy · 2019-03-04T01:19:02Z

numpy/core/tests/test_datetime.py

+    @pytest.mark.xfail(reason="work in progress")
+    @pytest.mark.parametrize("time_unit, time_val", [
+        ('ns', np.datetime64(int(np.iinfo(np.int64).max * (10 ** -9)), 's') + np.timedelta64(0, 'ns')),
+        ('ps', np.datetime64(int(np.iinfo(np.int64).max * (10 ** -12)), 's') + np.timedelta64(0, 'ps')),


This seems to detect the precision reduction Eric is worried about in this PR (this test now correctly fails), but I'm slightly concerned about the int() call here -- it truncates the float I get from the "division" to produce the number of seconds from the epoch.

Obviously, I can't give a float to intialize the datetime64. Is there a better approach? Python decimal module or something to flush through the strictest precision test possible, or is this going to be close enough?

Use max // (10**12) instead, perhaps?

Or perhaps (max + 10**12 - 1) // (10**12), which rounds up not down

My best attempts so far use divmod to recuperate the remainders and add them back in using appropriate units.

However, I've grown the unit test quite a bit now, and expanding to include symmetric time spans seems to indicate silent wrap-around for a subset of the time units:

input_seconds, remainder_ps = divmod(np.iinfo(np.int64).max, 10**12) input_str_minus = str(np.datetime64(0, 's') - np.timedelta64(input_seconds, 's') - np.timedelta64(remainder_ps, 'ps')) input_str_plus = str(np.datetime64(0, 's') + np.timedelta64(input_seconds, 's') + np.timedelta64(remainder_ps, 'ps')) print("minus:", np.datetime64(input_str_minus)) print("plus:", np.datetime64(input_str_plus))

Both end up after the epoch, even on master branch:

minus: 1970-04-16T05:57:07.963145224193 # looks wrong plus: 1970-04-17T18:02:52.036854775807 # looks about right

Is the asymmetry surprising? Our docs would suggest there should be no asymmetry. I originally noticed that pushing just slightly beyond the time spans in the positive directions tends to produce NaT, so I wanted to have a secondary check in the unit test to confirm that I was actually very close to the "precision" limit.

* previously, datetime64 would accept input that exceeds allowable precision time spans; we now have time span guards at the sub-nanosecond unit level, with checks performed at the seconds level of resolution

kurtqq · 2021-02-18T20:33:36Z

any progress with fixing this?

charris · 2022-06-09T15:47:23Z

@tylerjereddy Needs a rebase. Are you interested in pursuing this?

tylerjereddy · 2022-06-10T23:47:05Z

@charris Looks like there is some interest here. Maybe when I can get SciPy 1.9.0 release candidates out I can circle back, but not before that I don't think.

tylerjereddy · 2022-12-10T21:05:08Z

Let me close some of my idle PRs like this one so that others feel free to take them over, etc. I may choose to focus on something different perhaps.

tylerjereddy added 00 - Bug component: numpy._core component: numpy.dtype labels Sep 3, 2018

charris closed this Sep 3, 2018

charris reopened this Sep 3, 2018

tylerjereddy force-pushed the issue_5452_m8_choke branch from 1580c2e to a540172 Compare September 4, 2018 22:11

tylerjereddy mentioned this pull request Oct 6, 2018

remainder is not implemented for timedelta64 #12092

Closed

eric-wieser reviewed Feb 24, 2019

View reviewed changes

tylerjereddy added the component: numpy.datetime64 (and timedelta64) label Feb 28, 2019

tylerjereddy force-pushed the issue_5452_m8_choke branch from a540172 to fb90762 Compare March 4, 2019 01:13

tylerjereddy commented Mar 4, 2019

View reviewed changes

tylerjereddy added 2 commits March 4, 2019 14:22

BUG: prevent datetime64 span chokes

cf97db4

* previously, datetime64 would accept input that exceeds allowable precision time spans; we now have time span guards at the sub-nanosecond unit level, with checks performed at the seconds level of resolution

MAINT: start drafting test for datetime64 span limits

f2a4164

tylerjereddy force-pushed the issue_5452_m8_choke branch from fb90762 to f2a4164 Compare March 4, 2019 22:22

Base automatically changed from master to main March 4, 2021 02:04

tylerjereddy closed this Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: prevent datetime64 span chokes #11873

BUG: prevent datetime64 span chokes #11873

tylerjereddy commented Sep 3, 2018

charris commented Sep 3, 2018

mattip commented Sep 4, 2018

tylerjereddy commented Sep 4, 2018

tylerjereddy commented Sep 25, 2018

eric-wieser Feb 24, 2019

tylerjereddy Mar 2, 2019

eric-wieser Feb 24, 2019

tylerjereddy Feb 28, 2019

tylerjereddy Mar 1, 2019

eric-wieser Mar 1, 2019

eric-wieser Mar 1, 2019

tylerjereddy Mar 2, 2019

tylerjereddy Mar 4, 2019 •

edited

Loading

eric-wieser Mar 4, 2019

eric-wieser Mar 4, 2019

tylerjereddy Mar 4, 2019

kurtqq commented Feb 18, 2021

charris commented Jun 9, 2022

tylerjereddy commented Jun 10, 2022

tylerjereddy commented Dec 10, 2022

BUG: prevent datetime64 span chokes #11873

BUG: prevent datetime64 span chokes #11873

Conversation

tylerjereddy commented Sep 3, 2018

charris commented Sep 3, 2018

mattip commented Sep 4, 2018

tylerjereddy commented Sep 4, 2018

tylerjereddy commented Sep 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerjereddy Mar 4, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kurtqq commented Feb 18, 2021

charris commented Jun 9, 2022

tylerjereddy commented Jun 10, 2022

tylerjereddy commented Dec 10, 2022

tylerjereddy Mar 4, 2019 •

edited

Loading