BUG: three string ufunc bugs, one leading to segfault #25515

mhvk · 2023-12-31T22:13:06Z

#25171 added string ufuncs, but unfortunately did not properly update the rather sparse test suite for chararray, in particular failing to test the various fast paths that were created. While one bug was fixed in #25484, not all were: as noted in #25513, the present version causes segfaults in astropy. It turned out there were multiple mistakes:

The memchr version of findchar always returns 0 for the offset;
In replacements, the length of the remaining buffer is not tracked, leading to buffer overflow.
The fast-track code for "no replacement possible" gives empty strings instead of a copy of the input;
These 3 commits fix those bugs, adding tests to avoid regressions.

I note that this only increases test coverage for these cases; a larger addition seems warranted.

Fixes #25513

EDIT: ordering changed.

mhvk · 2023-12-31T22:17:11Z

numpy/_core/src/umath/string_fastsearch.h

@@ -62,7 +62,7 @@ findchar(Buffer<enc> s, Py_ssize_t n, npy_ucs4 ch)
    Buffer<enc> p = s, e = s + n;

    if (n > MEMCHR_CUT_OFF) {
-        p = s.buffer_memchr(ch, n);
+        p = p.buffer_memchr(ch, n);


This solution works, but it sort-of works around the fact that buffer_memchr updates the position of the buffer in-place. My C++ skills were insufficient to trivially solve that.

That's true. I think it'd be preferrable to fix buffer_memchr to return a new buffer without side effects on the calling one.

This would do it probably:

inline Buffer<enc> buffer_memchr(npy_ucs4 ch, int len) { Buffer<enc> newbuf = *this; switch (enc) { case ENCODING::ASCII: newbuf.buf = (char *) memchr(buf, ch, len); break; case ENCODING::UTF32: newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len); break; } return newbuf; }

Yes, that makes more sense, and seems to work.

OK, the first commit now changes buffer_memchr like you suggested.

rgommers · 2024-01-01T10:42:51Z

Thanks for the fixes @mhvk!

I note that this only increases test coverage for these cases; a larger addition seems warranted.

In the meantime, would it be useful/practical to run the AstroPy test suite (ideally only the relevant subset of it, so it doesn't take forever) when more string ufuncs are added?

Cc @lysnikolaou as the author of these new ufuncs.

mhvk · 2024-01-01T14:52:52Z

In the meantime, would it be useful/practical to run the AstroPy test suite (ideally only the relevant subset of it, so it doesn't take forever) when more string ufuncs are added?

I'm not sure that we do enough with strings to make that worthwhile - astropy checking with the nightlies is probably good enough. I think what happened here is understandable - one replaces one way of doing it with another, which is like a refactoring, so it is not illogical to assume the test coverage passing is good enough. The part missed was that the replacement here included quite a lot of special cases which were not tested. Ideally, this would have been asked in review.
Or, better still, it is noticed by test coverage runs, though I don't know how well coverage works for templated C++ code...

Actually, now having had to go over the code, a question for @lysnikolaou - I noticed special paths also for more complicated and very large strings. Are those covered by tests?

lysnikolaou

Thanks @mhvk! Left two comments, but this looks good in general. Also, I don't think the different fast paths are tested well enough (or maybe at all?) here, but I'm planning to add more tests after we move everything to np.strings.

lysnikolaou · 2024-01-02T11:43:34Z

numpy/_core/src/umath/string_fastsearch.h

@@ -62,7 +62,7 @@ findchar(Buffer<enc> s, Py_ssize_t n, npy_ucs4 ch)
    Buffer<enc> p = s, e = s + n;

    if (n > MEMCHR_CUT_OFF) {
-        p = s.buffer_memchr(ch, n);
+        p = p.buffer_memchr(ch, n);


That's true. I think it'd be preferrable to fix buffer_memchr to return a new buffer without side effects on the calling one.

This would do it probably:

inline Buffer<enc> buffer_memchr(npy_ucs4 ch, int len) { Buffer<enc> newbuf = *this; switch (enc) { case ENCODING::ASCII: newbuf.buf = (char *) memchr(buf, ch, len); break; case ENCODING::UTF32: newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len); break; } return newbuf; }

lysnikolaou · 2024-01-02T11:44:24Z

numpy/_core/src/umath/string_ufuncs.cpp

-    if (len1 < len2
-        || (len2 == 0 && len3 == 0)
-        || count == 0
-        || buf2.strcmp(buf3) == 0) {
-        out.buffer_fill_with_zeros_after_index(0);
-        return;
-    }


Is this refactoring needed? If so, could you explain the difference?

Yes, before you returned an empty out if, e.g., the input string is shorter than the replacement string, so cannot possibly match. But in this case, the input string should unchanged, i.e., copied to the output. So, for any of those cases where it is obvious no match will be found, I now fall through directly to copying the input to the output.

Of course! Thanks for the explanation!

Aside, while looking through strcmp just to see whether a (len2 == len3) would be useful (it is and I'll include it), I noticed that the case for a longer buf3 would return the wrong sign. Almost certainly never matters, but maybe for a next fix:

numpy/numpy/_core/src/umath/string_buffer.h

Lines 299 to 306 in d4a66a3

while (tmp2.buf < tmp2.after) {

if (*tmp2 < 0) {

return -1;

}

if (*tmp2 > 0) {

return 1;

}

tmp2++;

(I don't think I should do it here, but can if you'd like me to).

While looking through the commits, I refactored this a little further. Hopefully, the intent of the if statement is clearer now.

As well as other cases where no replacement is possible, ensuring that we return a copy of the input (rather than an empty string).

ngoldbaum · 2024-01-02T17:12:33Z

It looks like this fixes another regression I ran into:

import numpy as np

string_list = ['AbcAbcAbc']

o_dtype = np.str_

oarr = np.array(string_list, dtype=o_dtype)

oarr2 = np.array([s + "2" for s in string_list], dtype=o_dtype)


op = np.greater

ores = op(oarr, oarr2)


print(ores)
print(str(oarr[0]) > str(oarr2[0]))

(adapted from the stringdtype tests). On current main this incorrectly prints [True]\nFalse, but on this PR I get the correct answer.

ngoldbaum · 2024-01-02T17:28:12Z

Pulling this in. Thanks @mhvk!

I know at one point recently we had a bot commenting about missing C coverage but I haven't seen it in a while, having that available would probably also help. I see @lysnikolaou has already pushed some string ufunc tests to his other PR adding an np.strings namespace.

If we add more ufuncs in the future I'll look at the C coverage manually before merging, keeping in mind that the np.char and chararray tests are patchy at best.

ngoldbaum · 2024-01-02T17:53:09Z

but on this PR I get the correct answer.

Actually not quite right, I think the test code I shared earlier is flaky due to memory corruption, will have a followup PR incoming...

mhvk · 2024-01-02T18:35:01Z

@ngoldbaum - sounds all good! The fact that the answer is not right perhaps is due to what I noted in #25515 (comment) (which I did not touch here since it seemed out of scope)

mhvk added 00 - Bug component: numpy._core labels Dec 31, 2023

mhvk added this to the 2.0.0 release milestone Dec 31, 2023

mhvk marked this pull request as ready for review December 31, 2023 22:13

This was referenced Dec 31, 2023

BUG: A segfault in chararray (2.0.0dev0 regression) #25513

Closed

BUG: segfault in devdeps CI due to numpy-dev bug astropy/astropy#15797

Closed

mhvk force-pushed the string-ufunc-bugs branch from 5823fa5 to a30aad8 Compare December 31, 2023 22:15

mhvk commented Dec 31, 2023

View reviewed changes

mattip requested a review from ngoldbaum January 1, 2024 08:17

lysnikolaou reviewed Jan 2, 2024

View reviewed changes

matheussouza88 approved these changes Jan 2, 2024

View reviewed changes

mhvk added 2 commits January 2, 2024 11:13

BUG: fix find of single-character with memchr

925471d

BUG: solve core dump in chararray.replace

a9e50b5

mhvk force-pushed the string-ufunc-bugs branch 2 times, most recently from 356f1c3 to 05516f5 Compare January 2, 2024 16:32

BUG: fix replacement with a string longer than input

5b5b568

As well as other cases where no replacement is possible, ensuring that we return a copy of the input (rather than an empty string).

mhvk force-pushed the string-ufunc-bugs branch from 05516f5 to 5b5b568 Compare January 2, 2024 16:34

ngoldbaum merged commit 7b42b52 into numpy:main Jan 2, 2024
63 checks passed

mhvk deleted the string-ufunc-bugs branch January 2, 2024 18:35

ngoldbaum mentioned this pull request Jan 2, 2024

BUG: fix incorrect strcmp implementation for unequal length strings #25522

Merged

neutrinoceros mentioned this pull request Feb 14, 2024

Deprecate use of chararray in io.fits astropy/astropy#3862

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: three string ufunc bugs, one leading to segfault #25515

BUG: three string ufunc bugs, one leading to segfault #25515

mhvk commented Dec 31, 2023 •

edited

Loading

mhvk Dec 31, 2023

lysnikolaou Jan 2, 2024

mhvk Jan 2, 2024

mhvk Jan 2, 2024

rgommers commented Jan 1, 2024

mhvk commented Jan 1, 2024

lysnikolaou left a comment

lysnikolaou Jan 2, 2024

lysnikolaou Jan 2, 2024

mhvk Jan 2, 2024

lysnikolaou Jan 2, 2024

mhvk Jan 2, 2024

mhvk Jan 2, 2024

ngoldbaum commented Jan 2, 2024

ngoldbaum commented Jan 2, 2024

ngoldbaum commented Jan 2, 2024

mhvk commented Jan 2, 2024

	while (tmp2.buf < tmp2.after) {
	if (*tmp2 < 0) {
	return -1;
	}
	if (*tmp2 > 0) {
	return 1;
	}
	tmp2++;

BUG: three string ufunc bugs, one leading to segfault #25515

BUG: three string ufunc bugs, one leading to segfault #25515

Conversation

mhvk commented Dec 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rgommers commented Jan 1, 2024

mhvk commented Jan 1, 2024

lysnikolaou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ngoldbaum commented Jan 2, 2024

ngoldbaum commented Jan 2, 2024

ngoldbaum commented Jan 2, 2024

mhvk commented Jan 2, 2024

mhvk commented Dec 31, 2023 •

edited

Loading