Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: introduce PyArray_SafeCast to fix issues around stringdtype views #26147

Merged
merged 7 commits into from Apr 3, 2024

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Mar 27, 2024

Fixes #26140.

This is a minimal implementation of a new function that generalizes PyArray_EquivTypes.

@ngoldbaum ngoldbaum added the 09 - Backport-Candidate PRs tagged should be backported label Mar 27, 2024
if ((type1 != type2) && (type1->kind == 'T')) {
return 0;
}
return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to dig slightly deeper and use this pattern:

if (dtypes[i] != PyArray_DESCR(op[i])) {
npy_intp view_offset;
NPY_CASTING safety = PyArray_GetCastInfo(
PyArray_DESCR(op[i]), dtypes[i], NULL, &view_offset);
if (safety < 0 && PyErr_Occurred()) {
/* A proper error during a cast check, should be rare */
return -1;
}
if (view_offset != 0) {
/* NOTE: Could possibly implement non-zero view offsets */
must_copy = 1;
}
if (force_cast_input && i < nin) {
/*
* ArrayMethod flagged to ignore casting (logical funcs
* can force cast to bool)
*/
}
else if (PyArray_MinCastSafety(safety, casting) != casting) {
return 0; /* the cast is not safe enough */
}
}

Making this not be string dtype specific!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nice and then I can control how view_offset gets set in string_to_string_resolve_descriptors, that seems much more idiomatic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this idea doesn't quite work as cleanly as I'd hoped. If I set view_offset != 0 in string_to_string_resolve_descriptors, there's an assert that triggers if casting == NPY_NO_CASTING:

/*
* Check for less harmful non-standard returns. The following two returns
* should never happen:
* 1. No-casting must imply a view offset of 0.
* 2. Equivalent-casting + 0 view offset is (usually) the definition
* of a "no" cast. However, changing the order of fields can also
* create descriptors that are not equivalent but views.
* Note that unsafe casts can have a view offset. For example, in
* principle, casting `<i8` to `<i4` is a cast with 0 offset.
*/
if (*view_offset != 0) {
assert(casting != NPY_NO_CASTING);
}
else {
assert(casting != NPY_EQUIV_CASTING
|| (PyDataType_HASFIELDS(from) && PyDataType_HASFIELDS(to)));
}
return casting;

In the latest version I'm about to push I guard that assert with an if statement that checks for stringdtype, but that seems unsatisfying. Maybe the correct thing is for stringdtype to set e.g. NPY_EQUIV_CASTING for casts with distinct allocators, but when I tried doing that it broke reductions and I couldn't see a clear path forward on fixing that, because the check for valid reductions relies on PyArray_EquivTypes, which enfoces NPY_NO_CASTING as the minimum casting level:

/*
* The first operand and output should be the same array, so they should
* be identical. The second argument can be different for reductions,
* but is checked to be identical for accumulate and reduceat.
* Ideally, the type-resolver ensures that all are identical, but we do
* not enforce this here strictly. Otherwise correct handling of
* byte-order changes (or metadata) requires a lot of care; see gh-20699.
*/
if (!PyArray_EquivTypes(out_descrs[0], out_descrs[2]) || (
enforce_uniform_args && !PyArray_EquivTypes(
out_descrs[0], out_descrs[1]))) {
PyErr_Format(PyExc_TypeError,
"the resolved dtypes are not compatible with %s.%s. "
"Resolved (%R, %R, %R)",
ufunc_get_name_cstr(ufunc), method,
out_descrs[0], out_descrs[1], out_descrs[2]);
goto fail;
}

Maybe we need a new casting level between NO_CASTING and EQUIV_CASTING for this case where data can live outside the array buffer on the descriptor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probably just remove the assert, just need to make sure to not document anywhere that no-casting implies view (I doubt we use it anywhere, but not sure). It used to, for these strings it can't since they use a whole new mechanism for storage... that seems fine.
(you could keep the assert together with checking that the dtype implements the hook for fixing the dtype in the array creation call, given that reason.)

* We ignore that possibility for simplicity; it really is not our bug.
*/
return 0;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary here and in EquivTypes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it, as it I am very certain it was fixed in boost python a while ago.

return 0;
}
/* If casting is "no casting" this dtypes are considered equivalent. */
return PyArray_MinCastSafety(safety, NPY_NO_CASTING) == NPY_NO_CASTING;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that most of this implementation is copied from PyArray_EquivTypes, to avoid calling GetCastInfo twice. If I just called EquivTypes, there would be an unnecessary second call to GetCastInfo to get the view_offset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, could add that as a code comment if you like.

@charris
Copy link
Member

charris commented Mar 29, 2024

What is the status of this?

@ngoldbaum
Copy link
Member Author

I think it just needs another round of review. It’s not critical for the RC but it would be good to have it in.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but the view offset needs fixing. I cannot review if other places using eqivtypes shojld be using this meaning, right now.

Its maybe subtle but only really affects strings, so with that fix happy to see it in. (Other comments could wait).

* We ignore that possibility for simplicity; it really is not our bug.
*/
return 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it, as it I am very certain it was fixed in boost python a while ago.

return 0;
}
/* If casting is "no casting" this dtypes are considered equivalent. */
return PyArray_MinCastSafety(safety, NPY_NO_CASTING) == NPY_NO_CASTING;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, could add that as a code comment if you like.

@@ -79,7 +79,8 @@ string_to_string_resolve_descriptors(PyObject *NPY_UNUSED(self),
return NPY_UNSAFE_CASTING;
}

*view_offset = 0;
// views are only legal between descriptors that share allocators (e.g. the same object)
*view_offset = descr0->allocator != descr1->allocator;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. You must set it to 0 or leave it unchanged! (Intp min, I think)

* is true as well.
*/
NPY_NO_EXPORT unsigned char
PyArray_ViewableTypes(PyArray_Descr *type1, PyArray_Descr *type2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can encode the equiv in the name? Or pass in the cast safety? Passing it in would make it work in the ufunc code I think.

But if not its OK.

@@ -1609,7 +1652,11 @@ _array_fromobject_generic(

/* One more chance for faster exit if user specified the dtype. */
oldtype = PyArray_DESCR(oparr);
if (PyArray_EquivTypes(oldtype, dtype)) {
unsigned char viewable = PyArray_ViewableTypes(oldtype, dtype);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The equiv cast may not matter, but then a single field struct to non-struct could return views... not today, I guess

@ngoldbaum ngoldbaum changed the title BUG: introduce PyArray_ViewableTypes to fix issues around stringdtype views BUG: introduce PyArray_SafeCast to fix issues around stringdtype views Mar 29, 2024
@ngoldbaum
Copy link
Member Author

Thanks for the prompt to refactor so I could use it in ufunc_object.c, the result is a bit more understandable and hopefully more useful.

Now instead of PyArray_ViewableTypes, I'm calling the new function PyArray_SafeCast, which is a generalization of PyArray_EquivTypes that accepts the minimum casting safety and also allows callers to see the view_offset value. I was able to use this new function in the trivial ufunc loop setup code.

@ngoldbaum
Copy link
Member Author

Even though sebastian approved this, probably worth one more round of review before merging after the last commit.

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Mostly just queries/requests for clarification

@@ -1922,7 +1926,7 @@ PyArray_FromArray(PyArrayObject *arr, PyArray_Descr *newtype, int flags)
/* If a writeable array was requested, and arr is not */
((flags & NPY_ARRAY_WRITEABLE) &&
(!(arrflags & NPY_ARRAY_WRITEABLE))) ||
!PyArray_EquivTypes(oldtype, newtype);
!view_safe;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the actual work has now been done before defining copy, I think it should be the first item for copy, i.e.,

copy = !view_safe ||
         ...

Alternatively, maybe better, since this now makes every case where a copy was already required slower, one can remove the last bit, and do the calculation like,

if (!copy) {  /* Check a view is in fact possible */
    npy_intp view_offset
    npy_intp is_safe = ...
    copy = copy && is_safe && (view_offset == 0);
}

* *minimum_safety* casting level. Sets the view_offset if that is set
* for the cast. If ignore_error is 1, errors in cast setup are ignored.
*/
NPY_NO_EXPORT npy_intp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should perhaps add

... are ignored;
otherwise the error is kept and -1 is returned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, remnant of an older version that unconditionally ignored errors.

@@ -1908,6 +1908,10 @@ PyArray_FromArray(PyArrayObject *arr, PyArray_Descr *newtype, int flags)
}

arrflags = PyArray_FLAGS(arr);
npy_intp view_offset;
npy_intp is_safe = PyArray_SafeCast(oldtype, newtype, &view_offset, NPY_NO_CASTING, 1);
npy_intp view_safe = (is_safe && (view_offset == 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably fine as is, but in the code proper, impossible views are marked with view_offset = NPY_MIN_INTP

if (PyArray_EquivTypes(oldtype, dtype)) {
npy_intp view_offset;
npy_intp is_safe = PyArray_SafeCast(oldtype, dtype, &view_offset, NPY_NO_CASTING, 1);
npy_intp view_safe = (is_safe && (view_offset == 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, should one check view_offset != NPY_MIN_INTP instead?

@@ -21,4 +21,9 @@ NPY_VISIBILITY_HIDDEN extern PyObject * npy_ma_str_convert_if_no_array;
NPY_VISIBILITY_HIDDEN extern PyObject * npy_ma_str_cpu;
NPY_VISIBILITY_HIDDEN extern PyObject * npy_ma_str_array_err_msg_substr;

NPY_NO_EXPORT npy_intp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the right module to define it in? I know PyArray_EquivTypes is here, but maybe casts.c makes more sense?

numpy/_core/src/umath/ufunc_object.c Show resolved Hide resolved
numpy/_core/tests/test_stringdtype.py Show resolved Hide resolved
@ngoldbaum
Copy link
Member Author

All review comments should be resolved now

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! A suggestion for a tiny further improvement in-line, though probably barely worth the CI since the compiler will likely do it already.

if (!copy) {
npy_intp view_offset;
npy_intp is_safe = PyArray_SafeCast(oldtype, newtype, &view_offset, NPY_NO_CASTING, 1);
copy = copy || !(is_safe && (view_offset != NPY_MIN_INTP));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion was imperfect: this can be copy = !(is_safe... since we know copy is false.

numpy/_core/src/umath/ufunc_object.c Show resolved Hide resolved
@ngoldbaum
Copy link
Member Author

Pulling this one in, thanks for the review!

@ngoldbaum ngoldbaum merged commit ffb23cd into numpy:main Apr 3, 2024
62 checks passed
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: creating an array that shares memory with a stringdtype array can fail
4 participants