Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix incorrect strcmp implementation for unequal length strings #25522

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Jan 2, 2024

Followup for #25515, the result of strcmp shouldn't depend on the characters of the longer string for characters after the length of the shorter string

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so it was that part. But nicer realizing that after the end a string should be considered all 0, so the other one is always larger.

Copy link
Contributor

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems sound now.

return -1;
}
if (*tmp1 > 0) {
if (*tmp1) {
Copy link
Member

@seberg seberg Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, wouldn't this make:

In [1]: "a" == "a\0"
Out[1]: False

Return True? For NumPy strings, this should be handled already earlier by the length, and for the new strings embedded 0 characters should be supported.

EDIT: Or I don't know maybe we need strcmp_padded for legacy strings...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm learning new things about numpy strings every day. For some reason I thought that this was True until now, but I'm clearly wrong:

In [1]: import numpy as np

In [2]: np.__version__
Out[2]: '1.26.3'

In [3]: np.str_("a") == np.str_("a\0")
Out[3]: False

The scalar repr doesn't include the trailing nulls, which I guess makes sense but definitely added to my confusion here:

In [4]: np.str_("a\0")
Out[4]: 'a'

In [5]: repr(np.str_("a\0")) == repr(np.str_("a"))
Out[5]: True

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I guess the above is only true for scalars, for array elements trailing nulls are ignored in Numpy 1.26.3:

In [12]: np.array([np.str_("a\0")]) == np.array([np.str_("a")])
Out[12]: array([ True])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just shows that scalars are broken... they should strip nulls on construction to not inherit the unicode behavior, I guess.

@seberg seberg merged commit 8da91c1 into numpy:main Jan 3, 2024
63 checks passed
@seberg
Copy link
Member

seberg commented Jan 3, 2024

Thanks Nathan. Sorry, I was confused thinking about the new stringdtype for which the trailing 0 character handling would have been incorrect.

@ngoldbaum
Copy link
Member Author

Pulling this in, we have a separate implementation for UTF-8 strings that doesn't need this strcmp implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants