Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: NaT now casts to NaN float64 #11890

Closed
wants to merge 3 commits into from

Conversation

tylerjereddy
Copy link
Contributor

Fixes #8449.

I think I used the "older" array iteration API if I'm reading the docs correctly; seems like this is still used in source though.


/* confine the NaT to NaN substitution checks by type cast */
if ((PyArray_DESCR(self)->type_num == NPY_TIMEDELTA) &&
(PyArray_DESCR(ret)->type_num == NPY_FLOAT64)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ought to handle other sizes of float too

@tylerjereddy
Copy link
Contributor Author

I tried to revise accordingly. Some notes / thoughts:

  • DECREF: not sure if it was needed, but adding it doesn't seem to cause test failures or build warnings for me locally
  • expanding to all FLOAT types: I'm not sure if you actually meant all of them, even NPY_FLOAT256 & so on, but that's what I've done here, albeit protecting with preprocessor directives in case the system doesn't have those defined; I've adjusted the unit test to iterate over the IEEE standard 16, 32, and 64 -- is there a clean way to get pytest to conditionally iterate over the "less standard" ones that NumPy defines if they're available on the system? -- also, is the value of NaN well-established / defined for those more unusual float types?

(PyArray_DESCR(ret)->type_num == NPY_FLOAT128) ||
#endif
#ifdef NPY_FLOAT256
(PyArray_DESCR(ret)->type_num == NPY_FLOAT256)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use NPY_HALF, NPY_FLOAT, NPY_DOUBLE, NPY_LONGDOUBLE here, or whatever the names really are. Never used the sized types when writing type-generic code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to dispense with the preprocessor guards if I do that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it is - I think long double is always defined, even if its just the same size as double.

@tylerjereddy
Copy link
Contributor Author

I updated the type specifications and tried removing the preprocessor guards -- that works locally anyway.

while (self_iter->index < self_iter->size) {
/* GETITEM will retrieve Py_None for NaT */
item = PyArray_GETITEM(self, PyArray_ITER_DATA(self_iter));
Py_XDECREF(item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ought to be after you're done with it, not before (although strictly it doesn't matter)

@tylerjereddy
Copy link
Contributor Author

Revised to move the decref, but I may need some guidance on the NULL comment.

@tylerjereddy
Copy link
Contributor Author

May need clarification re: revision requested by @eric-wieser

@tylerjereddy
Copy link
Contributor Author

Rebased & ping @shoyer for guidance

@shoyer
Copy link
Member

shoyer commented Oct 9, 2018

I have two concerns here:

  1. using the C Python API for this could make the inner loop here really slow. I wouldn't be surprised if something like np.where(np.isnat(source), np.nan, dest) is actually faster, just due to doing the loops at a lower level. It would be interesting to see a few micro-benchmark results.
  2. this is a strange place to be putting this code -- ideally we would not be adding special cases for datetime64 in core numpy functions. At the least, we should split out this addition into a helper function.

@tylerjereddy
Copy link
Contributor Author

Revised based on feedback from @shoyer -- I migrated the casting code right down to the inner type-type conversion loops that are code generated in "C++ style."

@@ -1086,6 +1086,95 @@ TIMEDELTA_setitem(PyObject *op, void *ov, void *vap)

/* Assumes contiguous, and aligned, from and to */

/* special case TIMEDELTA NPY_DATETIME_NAT handling */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that a large amount of code duplication is introduced because of this specialization, but I think the choice is always going to be between highly-specialized loops for type-type conversion & more general loops where bystander type conversions suffer a performance hit from NaT checking.

numpy/core/src/multiarray/arraytypes.c.src Show resolved Hide resolved
numpy/core/src/multiarray/arraytypes.c.src Show resolved Hide resolved
}
else if (*op == (npy_float)(*op)) {
*op = (float)NPY_NAN;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you trying to do with these ifs? You shouldn't ever read op, only write to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A form of type checking; I think I had issues using the template @type@ style things in comparison operators for some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a nice / clean way to inspect types on inner loops in our template language? I guess something involving PyArray_DESCR or similar might be cooked up, although the "ugly" way seems to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do with a better way of attaching traits to types at loop-expansion time, for sure.

One approach we've used elsewhere is just to introduce another parallel variable containing 1/0 for each branch.

Another approach I'd like to investigate, but we have no precedent for, is to use something like:

#define float64_is_float64

And then in the loop use

# ifdef @type@_is_float64
...
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The highest performing approach is probably just to hard code the loops by type so that no check is needed at all, as alluded to above. Just means more code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now - can't you just write a separate loop for the non-floating point cases? I don't think that you need to distinguish between the different types of NaN anyway - C will cast for you just fine, although perhaps with a warning that makes travis unhappy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that's the debate I was having with myself in the comments above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - but you were looking at it as a performance optimization, whereas in fact its required for correctness.

@ahaldane
Copy link
Member

I was having a debate with myself in #8325 over whether we should even allow float <-> datetime casts at all.

But I guess after reading this issue, and the example of converting to float and dividing by 356 to get years, maybe it's useful.

Then it does seem like we should separate it out into a separate loop as suggested above, and strictly check that the float value is in the range MAX_INT64 to MIN_INT64 and set it to NaT if not.

@tylerjereddy
Copy link
Contributor Author

Revised based on feedback to:

  • use separate inner loops for floating vs. non-floating casts
  • try to improve patch coverage by testing for float->m8 typecasts as well
  • added a test that probes i.e., MAX_INT64 limits as discussed above, but this already passes unless I'm misunderstanding--perhaps because there's a fair bit of overflow guard code littered throughout datetime64 infrastructure (albeit with various attached TODOs)

One other thought--I think we're limited to m8 (timedelta64) in this PR--not sure if we want to expand to datetime64 (can't remember if I had a good reason not to include that; maybe just to focus scope or something).

])
def test_float_cast_limits(self, float_type, cast_type):
arr = np.array([[np.iinfo(np.int64).max + 1],
[np.iinfo(np.int64).min - 1]], dtype=float_type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels a bit tangential to my changes in this PR -- maybe I need an example for where usage of those max integer limits currently causes undesirable behavior in master or in my feature branch.

@tylerjereddy
Copy link
Contributor Author

Architecture specific test failure -- interesting! Looks like the max integer related test.

Base automatically changed from master to main March 4, 2021 02:04
@charris
Copy link
Member

charris commented Jun 9, 2022

@tylerjereddy Do you want to pursue this? I suppose it is reasonable to ask why one would convert a datetime type to float when the units will disappear. Needs a rebase in any case.

@tylerjereddy
Copy link
Contributor Author

@charris this one may be less exciting but I don't mind trying to wrap it up once the current SciPy 1.9.0 release candidates start moving out.

@tylerjereddy
Copy link
Contributor Author

Let me close this, priority may not be high, and I should perhaps focus on smaller things on the NumPy side for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Casting a NaT with a timedelta64or datetime64 to float64 return big negative number
5 participants