BUG: fix incorrect strcmp implementation for unequal length strings #25522

ngoldbaum · 2024-01-02T18:39:35Z

Followup for #25515, the result of strcmp shouldn't depend on the characters of the longer string for characters after the length of the shorter string

mhvk

Ah, so it was that part. But nicer realizing that after the end a string should be considered all 0, so the other one is always larger.

lorentzenchr

The logic seems sound now.

seberg · 2024-01-03T08:44:33Z

numpy/_core/src/umath/string_buffer.h

-                return -1;
-            }
-            if (*tmp1 > 0) {
+            if (*tmp1) {


I am confused, wouldn't this make:

In [1]: "a" == "a\0" Out[1]: False

Return True? For NumPy strings, this should be handled already earlier by the length, and for the new strings embedded 0 characters should be supported.

EDIT: Or I don't know maybe we need strcmp_padded for legacy strings...

I'm learning new things about numpy strings every day. For some reason I thought that this was True until now, but I'm clearly wrong:

In [1]: import numpy as np In [2]: np.__version__ Out[2]: '1.26.3' In [3]: np.str_("a") == np.str_("a\0") Out[3]: False

The scalar repr doesn't include the trailing nulls, which I guess makes sense but definitely added to my confusion here:

In [4]: np.str_("a\0") Out[4]: 'a' In [5]: repr(np.str_("a\0")) == repr(np.str_("a")) Out[5]: True

Although I guess the above is only true for scalars, for array elements trailing nulls are ignored in Numpy 1.26.3:

In [12]: np.array([np.str_("a\0")]) == np.array([np.str_("a")]) Out[12]: array([ True])

Just shows that scalars are broken... they should strip nulls on construction to not inherit the unicode behavior, I guess.

seberg · 2024-01-03T18:45:27Z

Thanks Nathan. Sorry, I was confused thinking about the new stringdtype for which the trailing 0 character handling would have been incorrect.

ngoldbaum · 2024-01-03T18:48:36Z

Pulling this in, we have a separate implementation for UTF-8 strings that doesn't need this strcmp implementation.

ngoldbaum requested a review from lysnikolaou January 2, 2024 18:39

github-actions bot added the 00 - Bug label Jan 2, 2024

BUG: fix incorrect strcmp implementation for unequal length strings

9980712

ngoldbaum force-pushed the fix-strcmp branch from 4be68cb to 9980712 Compare January 2, 2024 18:43

mhvk approved these changes Jan 3, 2024

View reviewed changes

lorentzenchr approved these changes Jan 3, 2024

View reviewed changes

seberg reviewed Jan 3, 2024

View reviewed changes

seberg merged commit 8da91c1 into numpy:main Jan 3, 2024
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: fix incorrect strcmp implementation for unequal length strings #25522

BUG: fix incorrect strcmp implementation for unequal length strings #25522

ngoldbaum commented Jan 2, 2024 •

edited

Loading

mhvk left a comment

lorentzenchr left a comment

seberg Jan 3, 2024 •

edited

Loading

ngoldbaum Jan 3, 2024

ngoldbaum Jan 3, 2024

seberg Jan 3, 2024

seberg commented Jan 3, 2024

ngoldbaum commented Jan 3, 2024

BUG: fix incorrect strcmp implementation for unequal length strings #25522

BUG: fix incorrect strcmp implementation for unequal length strings #25522

Conversation

ngoldbaum commented Jan 2, 2024 • edited Loading

mhvk left a comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

seberg Jan 3, 2024 • edited Loading

Choose a reason for hiding this comment

ngoldbaum Jan 3, 2024

Choose a reason for hiding this comment

ngoldbaum Jan 3, 2024

Choose a reason for hiding this comment

seberg Jan 3, 2024

Choose a reason for hiding this comment

seberg commented Jan 3, 2024

ngoldbaum commented Jan 3, 2024

ngoldbaum commented Jan 2, 2024 •

edited

Loading

seberg Jan 3, 2024 •

edited

Loading