Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add ufunc for np.char.isalpha #24835

Merged
merged 14 commits into from
Oct 4, 2023
Merged

Conversation

lysnikolaou
Copy link
Contributor

No description provided.

@lysnikolaou
Copy link
Contributor Author

lysnikolaou commented Sep 29, 2023

One problem with the current state of this PR that I can't really find a solution to is this:

The isalpha ufunc itself works correctly, both when called directly, and when called from the np.char module with an ndarray.

$ export PYTHONPATH="/Users/lysnikolaou/repos/python/numpy/build-install/usr/lib/python3.11/site-packages"
🐍 Launching Python with PYTHONPATH="/Users/lysnikolaou/repos/python/numpy/build-install/usr/lib/python3.11/site-packages"
$ /usr/bin/env python -P
Python 3.11.3 (tags/v3.11.3:f3909b8bc8, May 29 2023, 12:51:44) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.array([['abc', 'def'], ['world', ' he']])
>>> np.core.multi
np.core.multiarray np.core.multiply( 
>>> np.core.multiarray.isalpha(a)
array([[ True,  True],
       [ True, False]])
>>> np.char.isalpha(a)
array([[ True,  True],
       [ True, False]])

However, I'm getting a value error when passing a chararray to it.

$ export PYTHONPATH="/Users/lysnikolaou/repos/python/numpy/build-install/usr/lib/python3.11/site-packages"
🐍 Launching Python with PYTHONPATH="/Users/lysnikolaou/repos/python/numpy/build-install/usr/lib/python3.11/site-packages"
$ /usr/bin/env python -P
Python 3.11.3 (tags/v3.11.3:f3909b8bc8, May 29 2023, 12:51:44) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> a = np.char.array([['abc', 'def'], ['world', ' he']])
>>> np.core.multiarray.isalpha(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/lysnikolaou/repos/python/numpy/build-install/usr/lib/python3.11/site-packages/numpy/core/defchararray.py", line 2099, in __array_finalize__
    raise ValueError("Can only create a chararray from string data.")
ValueError: Can only create a chararray from string data.

It seems that the bool array returned from the ufunc is somehow getting cast to a chararray, which is the input type. Is that something that should be happening? Am I missing something more obvious?

@seberg
Copy link
Member

seberg commented Sep 29, 2023

Is that something that should be happening?

Yes, that should be happening. Unlike a custom function, UFuncs will correctly preserve subclass information.

Could probably add something like (untested):

def __array_prepare__(self, arr, context=None):
    if arr.dtype in "US":
        return arr.view(type(self))
    return arr

to make it work, I doubt that has any negative side-effects. Maybe __array_wrap__ but I suspect __array_prepare__ might be more targeeted even. (Alternatively, there is a point why we don't just return a normal array when this happens.)

@lysnikolaou
Copy link
Contributor Author

Is there a way to tell the ufunc to force a specific return type? Because __array_finalize__ allows all of SUbc, I fear that it might break something if we change it in the chararray itself.

@seberg
Copy link
Member

seberg commented Sep 29, 2023

You can use subok=False only (or pass out=), but what is the problem with adding an array prepare like that? I never suggested to losen __array_finalize__.

EDIT: Or does __array_prepare__ kick in only after this error is given?

@lysnikolaou
Copy link
Contributor Author

The same is still happening after applying this:

numpy on  string-ufuncs-isalpha [$!?] is 📦 v2.0.0.dev0 via 🐍 pyenv mambaforge (venv) 
❯ git diff  
diff --git a/numpy/core/defchararray.py b/numpy/core/defchararray.py
index c3968bc43..2c2c7c955 100644
--- a/numpy/core/defchararray.py
+++ b/numpy/core/defchararray.py
@@ -2093,6 +2093,11 @@ def __new__(subtype, shape, itemsize=1, unicode=False, buffer=None,
         _globalvar = 0
         return self
 
+    def __array_prepare__(self, arr, context=None):
+        if arr.dtype.char in "US":
+            return arr.view(type(self))
+        return arr
+
     def __array_finalize__(self, obj):
         # The b is a special case because it is used for reconstructing.
         if not _globalvar and self.dtype.char not in 'SUbc':

@charris charris changed the title Add ufunc for np.char.isalpha ENH: Add ufunc for np.char.isalpha Sep 29, 2023
@lysnikolaou
Copy link
Contributor Author

@mattip @seberg This appears to be okay now. I added an __array_wrap__ which just returns the arr that's being passed in. I don't have a deep enough understanding of the ufunc mechanism yet to figure out if this is okay like this, so I'd like a review please.

@mattip
Copy link
Member

mattip commented Oct 2, 2023

Nice. I think the __array_wrap__ is correct. Could you rerun the benchmarks?

@lysnikolaou
Copy link
Contributor Author

Could you rerun the benchmarks?

Similar results:

| Change   | Before [eabefa4e] <main>   | After [54ef4672] <string-ufuncs-isalpha>   |   Ratio | Benchmark (Parameter)                                   |
|----------|----------------------------|--------------------------------------------|---------|---------------------------------------------------------|
| -        | 7.64±0.04μs                | 2.30±0.1μs                                 |     0.3 | bench_core.NumPyChar.time_isalpha_small_list_big_string |
| -        | 1.34±0.02ms                | 5.15±0.2μs                                 |     0   | bench_core.NumPyChar.time_isalpha_big_list_small_string |

@lysnikolaou
Copy link
Contributor Author

lysnikolaou commented Oct 3, 2023

@ngoldbaum Is stringdtype currently broken on main? I'm trying to test this and building stringdtype fails both with this branch and with the numpy nightly build. Maybe I'm missing something.

@ngoldbaum
Copy link
Member

I'm trying to test this and building stringdtype fails both with this branch and with the numpy nightly build

I don't see this when I build stringdtype using numpy's main branch or this PR. Can you open an issue over on the numpy-user-dtypes repo with more detail about what's happening on your setup? Maybe it's just a documentation thing but if it isn't I'd like to know what the problem is.

FWIW, this does break isalpha for stringdtype because the dtype doesn't have an implementation of the ufunc yet, but I think that's expected. No need for numpy to handle that I think, I can fix it in stringdtype once this is merged.

@lysnikolaou
Copy link
Contributor Author

Sounds good. I'll try to recreate and will open a PR.

@mattip @seberg I've got a follow-up PR ready that adds support for np.add. Can we merge this so that I can open that PR?

* however it may be that this should be moved into `auxdata` eventually,
* which may also be slightly faster/cleaner (but more involved).
*/
int len = steps[0] / sizeof(character);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had ignored it, assuming this was a first iteration for testing. Yes, I know it adds some boilerplate, but we should probably just create the ufunc without any loops and then add them the same way as the string equality is done.

Then change the signature of this loop and the above comment actually can make sense.

Right now, you should get crashes for some sliced arrays arr = arr[::10]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seberg I think I've got something better now. Can you check again and let me know whether I've missed anything else?

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM now, I would like the rstrip template parameter to be removed from isalpha mainly.
I will let Matti take a final quick look when you feel my comments are addressed well enough.

@@ -1137,6 +1144,13 @@ def english_upper(s):
TD(O),
signature='(n?,k),(k,m?)->(n?,m?)',
),
'isalpha' :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure: this only shows up in the internal ufunc namespace? (actually I guess tests would fail if not)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, yes.

docstrings.get('numpy.core.umath.isalpha'),
None,
[TypeDescription(U, EmptyFunctionTypeDescr, U, '?'),
TypeDescription(S, EmptyFunctionTypeDescr, S, '?')],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just leave this empty, instead of the EmptyFunctionTypeDescr? If not, fine, but maybe add a comment to EmptyFunctionTypeDescr to explain what it is used for.

Yes, this ensures it shows up in .types and it may actually be a reasonable way to ensure that in practice. But TBH, it is half-private anyway, so maybe we can just ignore it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't possible before cause of a syntax error on empty arrays under MSVC, but I wrote some code so that it's okay.

numpy/core/src/umath/string_ufuncs.cpp Outdated Show resolved Hide resolved
numpy/core/src/umath/string_ufuncs.cpp Outdated Show resolved Hide resolved
numpy/core/defchararray.py Outdated Show resolved Hide resolved
@ngoldbaum
Copy link
Member

I'm glad it was relatively straightforward to add support for string ufuncs besides the logic operators, looking forward to improved string operator performance! 🚀

Copy link
Member

@mattip mattip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, and some documentation nits.

numpy/core/code_generators/generate_umath.py Outdated Show resolved Hide resolved
return -1;
}}
f = PyUFunc_FromFuncAndDataAndSignatureAndIdentity(
NULL, NULL, NULL, {nloops},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only this line is different between the two stanzas. It should be possible to refactor this

  • to select on an attribute of uf rather than have a separate empty list, passing it around seems awkward. Something like what I suggested above?
  • extra points for tweaking fmt for an empty uf rather than rewriting the whole thing

numpy/core/code_generators/ufunc_docstrings.py Outdated Show resolved Hide resolved
numpy/core/code_generators/ufunc_docstrings.py Outdated Show resolved Hide resolved
numpy/core/code_generators/ufunc_docstrings.py Outdated Show resolved Hide resolved
@lysnikolaou
Copy link
Contributor Author

@mattip @seberg Thanks both for the reviews. I've addressed all comments.

@mattip mattip merged commit 64fc516 into numpy:main Oct 4, 2023
48 of 50 checks passed
@mattip
Copy link
Member

mattip commented Oct 4, 2023

Thanks @lysnikolaou

@lysnikolaou
Copy link
Contributor Author

@charris This was the first one of the PRs adding ufuncs for np.char and we'd like to backport this to 1.x. Is it okay to tag you in all of the other PRs that need backporting as well? I'm not sure about the exact process here, so I'd need a few pointers.

@seberg
Copy link
Member

seberg commented Dec 11, 2023

While it seems nice that these are backportable. I don't really think they should be backported for a 1.26.x bug-fix release. These are quite large changes, and chances are some behavior is changed.

@charris
Copy link
Member

charris commented Dec 11, 2023

Agree with @seberg here, we don't usually add new features in patch releases.

@lysnikolaou
Copy link
Contributor Author

Got it, thanks both for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants