-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: fix incorrect strcmp implementation for unequal length strings #25522
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so it was that part. But nicer realizing that after the end a string should be considered all 0, so the other one is always larger.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems sound now.
return -1; | ||
} | ||
if (*tmp1 > 0) { | ||
if (*tmp1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused, wouldn't this make:
In [1]: "a" == "a\0"
Out[1]: False
Return True
? For NumPy strings, this should be handled already earlier by the length, and for the new strings embedded 0 characters should be supported.
EDIT: Or I don't know maybe we need strcmp_padded
for legacy strings...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm learning new things about numpy strings every day. For some reason I thought that this was True
until now, but I'm clearly wrong:
In [1]: import numpy as np
In [2]: np.__version__
Out[2]: '1.26.3'
In [3]: np.str_("a") == np.str_("a\0")
Out[3]: False
The scalar repr doesn't include the trailing nulls, which I guess makes sense but definitely added to my confusion here:
In [4]: np.str_("a\0")
Out[4]: 'a'
In [5]: repr(np.str_("a\0")) == repr(np.str_("a"))
Out[5]: True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I guess the above is only true for scalars, for array elements trailing nulls are ignored in Numpy 1.26.3:
In [12]: np.array([np.str_("a\0")]) == np.array([np.str_("a")])
Out[12]: array([ True])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just shows that scalars are broken... they should strip nulls on construction to not inherit the unicode behavior, I guess.
Thanks Nathan. Sorry, I was confused thinking about the new stringdtype for which the trailing 0 character handling would have been incorrect. |
Pulling this in, we have a separate implementation for UTF-8 strings that doesn't need this |
Followup for #25515, the result of
strcmp
shouldn't depend on the characters of the longer string for characters after the length of the shorter string