-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: three string ufunc bugs, one leading to segfault #25515
Conversation
5823fa5
to
a30aad8
Compare
@@ -62,7 +62,7 @@ findchar(Buffer<enc> s, Py_ssize_t n, npy_ucs4 ch) | |||
Buffer<enc> p = s, e = s + n; | |||
|
|||
if (n > MEMCHR_CUT_OFF) { | |||
p = s.buffer_memchr(ch, n); | |||
p = p.buffer_memchr(ch, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This solution works, but it sort-of works around the fact that buffer_memchr
updates the position of the buffer in-place. My C++ skills were insufficient to trivially solve that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I think it'd be preferrable to fix buffer_memchr
to return a new buffer without side effects on the calling one.
This would do it probably:
inline Buffer<enc>
buffer_memchr(npy_ucs4 ch, int len)
{
Buffer<enc> newbuf = *this;
switch (enc) {
case ENCODING::ASCII:
newbuf.buf = (char *) memchr(buf, ch, len);
break;
case ENCODING::UTF32:
newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len);
break;
}
return newbuf;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes more sense, and seems to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the first commit now changes buffer_memchr
like you suggested.
Thanks for the fixes @mhvk!
In the meantime, would it be useful/practical to run the AstroPy test suite (ideally only the relevant subset of it, so it doesn't take forever) when more string ufuncs are added? Cc @lysnikolaou as the author of these new ufuncs. |
I'm not sure that we do enough with strings to make that worthwhile - astropy checking with the nightlies is probably good enough. I think what happened here is understandable - one replaces one way of doing it with another, which is like a refactoring, so it is not illogical to assume the test coverage passing is good enough. The part missed was that the replacement here included quite a lot of special cases which were not tested. Ideally, this would have been asked in review. Actually, now having had to go over the code, a question for @lysnikolaou - I noticed special paths also for more complicated and very large strings. Are those covered by tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mhvk! Left two comments, but this looks good in general. Also, I don't think the different fast paths are tested well enough (or maybe at all?) here, but I'm planning to add more tests after we move everything to np.strings
.
@@ -62,7 +62,7 @@ findchar(Buffer<enc> s, Py_ssize_t n, npy_ucs4 ch) | |||
Buffer<enc> p = s, e = s + n; | |||
|
|||
if (n > MEMCHR_CUT_OFF) { | |||
p = s.buffer_memchr(ch, n); | |||
p = p.buffer_memchr(ch, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I think it'd be preferrable to fix buffer_memchr
to return a new buffer without side effects on the calling one.
This would do it probably:
inline Buffer<enc>
buffer_memchr(npy_ucs4 ch, int len)
{
Buffer<enc> newbuf = *this;
switch (enc) {
case ENCODING::ASCII:
newbuf.buf = (char *) memchr(buf, ch, len);
break;
case ENCODING::UTF32:
newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len);
break;
}
return newbuf;
}
if (len1 < len2 | ||
|| (len2 == 0 && len3 == 0) | ||
|| count == 0 | ||
|| buf2.strcmp(buf3) == 0) { | ||
out.buffer_fill_with_zeros_after_index(0); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this refactoring needed? If so, could you explain the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, before you returned an empty out if, e.g., the input string is shorter than the replacement string, so cannot possibly match. But in this case, the input string should unchanged, i.e., copied to the output. So, for any of those cases where it is obvious no match will be found, I now fall through directly to copying the input to the output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course! Thanks for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside, while looking through strcmp
just to see whether a (len2 == len3)
would be useful (it is and I'll include it), I noticed that the case for a longer buf3
would return the wrong sign. Almost certainly never matters, but maybe for a next fix:
numpy/numpy/_core/src/umath/string_buffer.h
Lines 299 to 306 in d4a66a3
while (tmp2.buf < tmp2.after) { | |
if (*tmp2 < 0) { | |
return -1; | |
} | |
if (*tmp2 > 0) { | |
return 1; | |
} | |
tmp2++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking through the commits, I refactored this a little further. Hopefully, the intent of the if
statement is clearer now.
356f1c3
to
05516f5
Compare
As well as other cases where no replacement is possible, ensuring that we return a copy of the input (rather than an empty string).
05516f5
to
5b5b568
Compare
It looks like this fixes another regression I ran into:
(adapted from the stringdtype tests). On current |
Pulling this in. Thanks @mhvk! I know at one point recently we had a bot commenting about missing C coverage but I haven't seen it in a while, having that available would probably also help. I see @lysnikolaou has already pushed some string ufunc tests to his other PR adding an If we add more ufuncs in the future I'll look at the C coverage manually before merging, keeping in mind that the |
Actually not quite right, I think the test code I shared earlier is flaky due to memory corruption, will have a followup PR incoming... |
@ngoldbaum - sounds all good! The fact that the answer is not right perhaps is due to what I noted in #25515 (comment) (which I did not touch here since it seemed out of scope) |
#25171 added string ufuncs, but unfortunately did not properly update the rather sparse test suite for
chararray
, in particular failing to test the various fast paths that were created. While one bug was fixed in #25484, not all were: as noted in #25513, the present version causes segfaults inastropy
. It turned out there were multiple mistakes:memchr
version offindchar
always returns 0 for the offset;These 3 commits fix those bugs, adding tests to avoid regressions.
I note that this only increases test coverage for these cases; a larger addition seems warranted.
Fixes #25513
EDIT: ordering changed.