Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT: what should we do about string ufuncs that return lists? #26176

Open
ngoldbaum opened this issue Mar 29, 2024 · 5 comments
Open

MAINT: what should we do about string ufuncs that return lists? #26176

ngoldbaum opened this issue Mar 29, 2024 · 5 comments
Labels
component: numpy.strings String dtypes and functions

Comments

@ngoldbaum
Copy link
Member

We talked a little bit about this in #25993 but didn't resolve things.

Currently, the following functions don't have implementations in np.strings:

# Removed from namespace until behavior has been crystalized
# "join", "split", "rsplit", "splitlines",

We should figure out the appropriate API for adding fast ufuncs for these. I'm actually not sure if it's possible to write ufuncs for these given that they have an unknown number of outputs for each array element, you really need ragged arrays for this to make sense. Although that said, we could also return object arrays of python lists or just leave them with the _vec_string implementation.

@rgommers rgommers added the component: numpy.strings String dtypes and functions label Apr 1, 2024
@lysnikolaou
Copy link
Contributor

To me it doesn't make sense to create ufuncs for these. Even if we were to come up with a smart way to return arrays with empty strings wherever necessary, that would be incredibly memory-inefficient. Adding those to np.strings and explicitly mentioning in the docs that these use the old vec_string approach, which comes with performance implications, seems like the way to go.

@mhvk
Copy link
Contributor

mhvk commented Apr 2, 2024

Agreed that really we do not have good tools for this - it needs ragged arrays of some form (which, if sharing the allocator, might be able to indicate the strings themselves really efficiently, just by adjusting offsets and lengths).

My preference would to not implement these in np.strings for now, but rather document why we have not included them. Maybe someone else will come up with a good way to do it!? Or someone implements ragged arrays and we can use those. I think it may be better not to get stuck on the output given by vec_string.

@ngoldbaum
Copy link
Member Author

One thing we can do that does make sense for many applications is a padded split. We would need to implement a two-pass algorithm that finds the locations of all the splits, finds the entry with the most splits, and uses that to compute the output shape.

@mhvk
Copy link
Contributor

mhvk commented Apr 2, 2024

I guess this is where @lysnikolaou's comment comes in:

Even if we were to come up with a smart way to return arrays with empty strings wherever necessary, that would be incredibly memory-inefficient.

But maybe it is still worth doing?

@ngoldbaum
Copy link
Member Author

ngoldbaum commented Apr 2, 2024

np.char.split has about 750 code search results. That's not a ton but it's not nothing either. The only one in a "popular" repository is ONNX, which goes out of its way to implement a padded split:

https://github.com/onnx/onnx/blob/366ac64b5af0af2d02b019d84664e6b46b3ceb09/onnx/reference/ops/op_string_split.py#L25-L36

That said, this usage wouldn't find a padded split particularly useful:

https://github.com/mindspore-ai/mindspore/blob/5f329d13e3154083d7ee03afd544be64d5d4b765/mindspore/python/mindspore/profiler/parser/ascend_hccl_generator.py#L217

For comparison np.char.partition only has about 100 code search results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: numpy.strings String dtypes and functions
Projects
None yet
Development

No branches or pull requests

4 participants