Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: prevent datetime64 span chokes #11873

Closed
wants to merge 2 commits into from

Conversation

tylerjereddy
Copy link
Contributor

Fixes #5452.

I've added time span (relative to the epoch) checks / guards for sub-ns datetime64 units to fix the above issue. We may want to use higher-resolution rejection checks (currently seconds units are used to reject time spans) & expand the guards to include all units of datetime64.

For now / initially, I've focused on the sub-ns range discussed in that linked issue. It may also be more robust to use the C89 / portable limits.h header file for max int values instead of the hard-coded time-span limits I've taken from the docs.

I've also had to make minor adjustments to unit tests -- we had silent wrapping (garbage) for one test that worked for that test but didn't make sense, and for another test we now intercept unreasonable datetime spans before an overflow can occur, so we sometimes expect a different exception type.

@charris
Copy link
Member

charris commented Sep 3, 2018

close an reopen.

@charris charris closed this Sep 3, 2018
@charris charris reopened this Sep 3, 2018
@mattip
Copy link
Member

mattip commented Sep 4, 2018

LGTM. Minor nit - there is a TODO on line 244, can it be removed? Related issues:

I also added a component: numpy.datetime64 label and searched all issues for datetime64 to add the label. There may be more issues that I missed ...

@tylerjereddy
Copy link
Contributor Author

The TODO should not be removed -- I've only added sub-ns time span guards here, as per the specific issue targeted by this PR, all other time / date units are treated as before and may overflow / underflow if the user doesn't check input carefully.

If the other issues you mention are fixed by this that's a nice bonus, but closing those should probably be contingent on at least adding a unit test that confirms blocking the specified behaviors, and I'm not targeting those issues here.

I did remove get_datetimestruct_hours(); that should hopefully bring the diff coverage to 100 %.

Casting with astype is not specifically targeted here--great if there's some bonus issues fixed there, but again not the intention here for a focused / manageable PR.

@tylerjereddy
Copy link
Contributor Author

Do we want to expand the scope of this PR to include overflow "guards" for all the time spans?

if (abs(get_datetimestruct_seconds(dts)) >
106 * 24 * 60 * 60) {
/* reject with seconds resolution */
PyErr_SetString(PyExc_ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me whether ValueError or OverflowError is the better choice. Would be good to put some justification for the choice in the commit message, whichever way you go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping I could replace 106 * 24 * 60 * 60 with a sufficiently high precision check using a formula with NPY_INT64_MAX, but if I understand correctly it could be too late by the time the calculation propagates down to seconds.

"too large for fs units");
return -1;
};

ret = (((((days * 24 +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this is a hot path. If it's not, can we use something like the following, which doesn't require us to hard-code these imprecise limits:

ret = days;
err |= overflow_mult_then_add(&ret, 24, dts->hour);
err |= overflow_mult_then_add(&ret, 60, dts->min;
err |= overflow_mult_then_add(&ret, 60, dts->sec);
err |= overflow_mult_then_add(&ret, 1000000, dts->us);
err |= overflow_mult_then_add(&ret, 1000000, dts->ps);
err |= overflow_mult_then_add(&ret, 1000, dts->as / 1000);

Where

bool overflow_mult_then_add(npy_int64 *val, npy_uint64 mult, npy_int64 add) {
    if (add > 0 && *val > (NPY_INT64_MAX - add) / mult) {
        return true;
    }
    else if (add < 0 && *val < (NPY_INT64_MIN - add) / mult) {
        return true;
    }
    else {
        *val = *val * mult + add;
        return false;
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that by "imprecise" you mean that the time spans from the epoch that are tolerated for given return time units could be specified at a higher precision? And avoiding "hard coding" by incrementally checking for overflow as the return value is propagated through the conversions above, instead of i.e., "+/- n seconds from the epoch are tolerated for these time units."?

Is this something we want in scope for this PR? If I do that, then wouldn't it be prudent to do it for all the return blocks & also add unit tests that probe the increased / improved precision?

We'd also likely want to update the docs to reflect improved precision / limits on time since epoch tolerated if we do that, and for docs we'd still want some kid of hard-"coded" numbers for users to go by, right? I suppose we could switch to "approximate" wording in docs, and just use the old values as rough guides & let users push those overflow limits as they see fit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm certainly happy to do the work (we'd need benchmarks, too?), so just making sure we've thought about this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I mean is that this change will break existing code that uses values between the extremes you enforce here and the true extremes representable in the type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the limits - we can document the limit as np.timedelta64(np.iinfo(np.int64).max(), 'ps') * 10^-12 etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about moving the overflow checks into each of the functions named get_datetimestruct_minutes() and so on? Then each switch case would just call a function coded for its units, with each preceding function in the conversion chain also protected from overflow.

@pytest.mark.xfail(reason="work in progress")
@pytest.mark.parametrize("time_unit, time_val", [
('ns', np.datetime64(int(np.iinfo(np.int64).max * (10 ** -9)), 's') + np.timedelta64(0, 'ns')),
('ps', np.datetime64(int(np.iinfo(np.int64).max * (10 ** -12)), 's') + np.timedelta64(0, 'ps')),
Copy link
Contributor Author

@tylerjereddy tylerjereddy Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to detect the precision reduction Eric is worried about in this PR (this test now correctly fails), but I'm slightly concerned about the int() call here -- it truncates the float I get from the "division" to produce the number of seconds from the epoch.

Obviously, I can't give a float to intialize the datetime64. Is there a better approach? Python decimal module or something to flush through the strictest precision test possible, or is this going to be close enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use max // (10**12) instead, perhaps?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps (max + 10**12 - 1) // (10**12), which rounds up not down

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best attempts so far use divmod to recuperate the remainders and add them back in using appropriate units.

However, I've grown the unit test quite a bit now, and expanding to include symmetric time spans seems to indicate silent wrap-around for a subset of the time units:

input_seconds, remainder_ps = divmod(np.iinfo(np.int64).max, 10**12)
input_str_minus = str(np.datetime64(0, 's') - np.timedelta64(input_seconds, 's') - np.timedelta64(remainder_ps, 'ps'))
input_str_plus = str(np.datetime64(0, 's') + np.timedelta64(input_seconds, 's') + np.timedelta64(remainder_ps, 'ps'))
print("minus:", np.datetime64(input_str_minus))
print("plus:", np.datetime64(input_str_plus))

Both end up after the epoch, even on master branch:

minus: 1970-04-16T05:57:07.963145224193 # looks wrong
plus: 1970-04-17T18:02:52.036854775807 # looks about right

Is the asymmetry surprising? Our docs would suggest there should be no asymmetry. I originally noticed that pushing just slightly beyond the time spans in the positive directions tends to produce NaT, so I wanted to have a secondary check in the unit test to confirm that I was actually very close to the "precision" limit.

* previously, datetime64 would accept
input that exceeds allowable precision
time spans; we now have time span guards
at the sub-nanosecond unit level, with
checks performed at the seconds level of
resolution
@kurtqq
Copy link

kurtqq commented Feb 18, 2021

any progress with fixing this?

Base automatically changed from master to main March 4, 2021 02:04
@charris
Copy link
Member

charris commented Jun 9, 2022

@tylerjereddy Needs a rebase. Are you interested in pursuing this?

@tylerjereddy
Copy link
Contributor Author

@charris Looks like there is some interest here. Maybe when I can get SciPy 1.9.0 release candidates out I can circle back, but not before that I don't think.

@tylerjereddy
Copy link
Contributor Author

Let me close some of my idle PRs like this one so that others feel free to take them over, etc. I may choose to focus on something different perhaps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

datetime64 chokes on sub-nanosecond input string
5 participants