Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: three string ufunc bugs, one leading to segfault #25515

Merged
merged 3 commits into from
Jan 2, 2024

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Dec 31, 2023

#25171 added string ufuncs, but unfortunately did not properly update the rather sparse test suite for chararray, in particular failing to test the various fast paths that were created. While one bug was fixed in #25484, not all were: as noted in #25513, the present version causes segfaults in astropy. It turned out there were multiple mistakes:

  1. The memchr version of findchar always returns 0 for the offset;
  2. In replacements, the length of the remaining buffer is not tracked, leading to buffer overflow.
  3. The fast-track code for "no replacement possible" gives empty strings instead of a copy of the input;
    These 3 commits fix those bugs, adding tests to avoid regressions.

I note that this only increases test coverage for these cases; a larger addition seems warranted.

Fixes #25513

EDIT: ordering changed.

@@ -62,7 +62,7 @@ findchar(Buffer<enc> s, Py_ssize_t n, npy_ucs4 ch)
Buffer<enc> p = s, e = s + n;

if (n > MEMCHR_CUT_OFF) {
p = s.buffer_memchr(ch, n);
p = p.buffer_memchr(ch, n);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution works, but it sort-of works around the fact that buffer_memchr updates the position of the buffer in-place. My C++ skills were insufficient to trivially solve that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I think it'd be preferrable to fix buffer_memchr to return a new buffer without side effects on the calling one.

This would do it probably:

inline Buffer<enc>
buffer_memchr(npy_ucs4 ch, int len)
{
    Buffer<enc> newbuf = *this;
    switch (enc) {
    case ENCODING::ASCII:
        newbuf.buf = (char *) memchr(buf, ch, len);
        break;
    case ENCODING::UTF32:
        newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len);
        break;
    }
    return newbuf;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes more sense, and seems to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, the first commit now changes buffer_memchr like you suggested.

@mattip mattip requested a review from ngoldbaum January 1, 2024 08:17
@rgommers
Copy link
Member

rgommers commented Jan 1, 2024

Thanks for the fixes @mhvk!

I note that this only increases test coverage for these cases; a larger addition seems warranted.

In the meantime, would it be useful/practical to run the AstroPy test suite (ideally only the relevant subset of it, so it doesn't take forever) when more string ufuncs are added?

Cc @lysnikolaou as the author of these new ufuncs.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 1, 2024

In the meantime, would it be useful/practical to run the AstroPy test suite (ideally only the relevant subset of it, so it doesn't take forever) when more string ufuncs are added?

I'm not sure that we do enough with strings to make that worthwhile - astropy checking with the nightlies is probably good enough. I think what happened here is understandable - one replaces one way of doing it with another, which is like a refactoring, so it is not illogical to assume the test coverage passing is good enough. The part missed was that the replacement here included quite a lot of special cases which were not tested. Ideally, this would have been asked in review.
Or, better still, it is noticed by test coverage runs, though I don't know how well coverage works for templated C++ code...

Actually, now having had to go over the code, a question for @lysnikolaou - I noticed special paths also for more complicated and very large strings. Are those covered by tests?

Copy link
Contributor

@lysnikolaou lysnikolaou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mhvk! Left two comments, but this looks good in general. Also, I don't think the different fast paths are tested well enough (or maybe at all?) here, but I'm planning to add more tests after we move everything to np.strings.

@@ -62,7 +62,7 @@ findchar(Buffer<enc> s, Py_ssize_t n, npy_ucs4 ch)
Buffer<enc> p = s, e = s + n;

if (n > MEMCHR_CUT_OFF) {
p = s.buffer_memchr(ch, n);
p = p.buffer_memchr(ch, n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I think it'd be preferrable to fix buffer_memchr to return a new buffer without side effects on the calling one.

This would do it probably:

inline Buffer<enc>
buffer_memchr(npy_ucs4 ch, int len)
{
    Buffer<enc> newbuf = *this;
    switch (enc) {
    case ENCODING::ASCII:
        newbuf.buf = (char *) memchr(buf, ch, len);
        break;
    case ENCODING::UTF32:
        newbuf.buf = (char *) wmemchr((wchar_t *) buf, ch, len);
        break;
    }
    return newbuf;
}

Comment on lines -346 to -352
if (len1 < len2
|| (len2 == 0 && len3 == 0)
|| count == 0
|| buf2.strcmp(buf3) == 0) {
out.buffer_fill_with_zeros_after_index(0);
return;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this refactoring needed? If so, could you explain the difference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, before you returned an empty out if, e.g., the input string is shorter than the replacement string, so cannot possibly match. But in this case, the input string should unchanged, i.e., copied to the output. So, for any of those cases where it is obvious no match will be found, I now fall through directly to copying the input to the output.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course! Thanks for the explanation!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside, while looking through strcmp just to see whether a (len2 == len3) would be useful (it is and I'll include it), I noticed that the case for a longer buf3 would return the wrong sign. Almost certainly never matters, but maybe for a next fix:

while (tmp2.buf < tmp2.after) {
if (*tmp2 < 0) {
return -1;
}
if (*tmp2 > 0) {
return 1;
}
tmp2++;
(I don't think I should do it here, but can if you'd like me to).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking through the commits, I refactored this a little further. Hopefully, the intent of the if statement is clearer now.

@mhvk mhvk force-pushed the string-ufunc-bugs branch 2 times, most recently from 356f1c3 to 05516f5 Compare January 2, 2024 16:32
As well as other cases where no replacement is possible,
ensuring that we return a copy of the input (rather than
an empty string).
@ngoldbaum
Copy link
Member

It looks like this fixes another regression I ran into:

import numpy as np

string_list = ['AbcAbcAbc']

o_dtype = np.str_

oarr = np.array(string_list, dtype=o_dtype)

oarr2 = np.array([s + "2" for s in string_list], dtype=o_dtype)


op = np.greater

ores = op(oarr, oarr2)


print(ores)
print(str(oarr[0]) > str(oarr2[0]))

(adapted from the stringdtype tests). On current main this incorrectly prints [True]\nFalse, but on this PR I get the correct answer.

@ngoldbaum
Copy link
Member

Pulling this in. Thanks @mhvk!

I know at one point recently we had a bot commenting about missing C coverage but I haven't seen it in a while, having that available would probably also help. I see @lysnikolaou has already pushed some string ufunc tests to his other PR adding an np.strings namespace.

If we add more ufuncs in the future I'll look at the C coverage manually before merging, keeping in mind that the np.char and chararray tests are patchy at best.

@ngoldbaum ngoldbaum merged commit 7b42b52 into numpy:main Jan 2, 2024
63 checks passed
@ngoldbaum
Copy link
Member

but on this PR I get the correct answer.

Actually not quite right, I think the test code I shared earlier is flaky due to memory corruption, will have a followup PR incoming...

@mhvk
Copy link
Contributor Author

mhvk commented Jan 2, 2024

@ngoldbaum - sounds all good! The fact that the answer is not right perhaps is due to what I noted in #25515 (comment) (which I did not touch here since it seemed out of scope)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: A segfault in chararray (2.0.0dev0 regression)
5 participants