Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/REG: Possible bug/regression in np.vectorize with strings #26269

Closed
seberg opened this issue Apr 12, 2024 · 4 comments · Fixed by #26270
Closed

BUG/REG: Possible bug/regression in np.vectorize with strings #26269

seberg opened this issue Apr 12, 2024 · 4 comments · Fixed by #26270

Comments

@seberg
Copy link
Member

seberg commented Apr 12, 2024

Copying from pydata/xarray#8932 (comment) just to at least check. I am not sure there is too much to be done here if the string length changes, to be honest (and we have better strings now).
It would seem that np.vectorize is probably just fundamentally tricky around it (i.e. why NumPy uses vectorize with object output and then casts the result, at least unless the result length is obvious).

On the other hand, I would feel a bit safer knowing why it changed, and I don' tknow right now:


Coping from @keewis, there:

The pure numpy reproducer is (which was kinda what @kmuehlbauer described, but now you can try yourself):

import numpy as np

arr = np.array(["SOme wOrd DŽ ß ᾛ ΣΣ ffi⁵Å Ç Ⅰ"]).astype(np.str_)
f = str.casefold
print(np.vectorize(f, otypes=[arr.dtype])(arr))

str.casefold returns a string that's four characters longer, and before numpy<2 the dtype would be extended to fit the string (<30U) while with numpy>=2 the old dtype would be used, truncating the string.

str.capitalize does not change the length of the string, which is why that doesn't fail.

FWIW, the new string dtype doesn't have that issue, and numpy.strings has a lot of the vectorized functions we've been creating using np.vectorize (i.e. those will most likely be much faster).

@ngoldbaum
Copy link
Member

Maybe because of #26136? Although looking at that now I don't think it was backported. If that's the reason, could add an exception there if dtype.kind is S or U.

@seberg
Copy link
Member Author

seberg commented Apr 12, 2024

Aha, that seems likely, the time frame of 2-3 week had come up! I am not sure that the CI run in question isn't using the nightlies (@keewis you probably know?).

But in that case, it is also good to know that we don't have to worry about it for 2.0.

@keewis
Copy link
Contributor

keewis commented Apr 12, 2024

we're indeed using the nightlies from scientific-python-nightly-wheels (not the release candidate), so yes, that might be it.

Edit: do you know what the upload frequency of that is? Once per week?
Edit2: in any case, thanks a lot for the very quick fix, @seberg, @ngoldbaum

@seberg
Copy link
Member Author

seberg commented Apr 12, 2024

Edit: do you know what the upload frequency of that is? Once per week?

Twice a week: on Wednesday and Sunday (although, now that 2.0 is out, I guess we could reduce it again to once a week), and occasionally we trigger it manually to unblock something or push a change out quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants