MAINT: what should we do about string ufuncs that return lists? #26176

ngoldbaum · 2024-03-29T21:28:54Z

We talked a little bit about this in #25993 but didn't resolve things.

Currently, the following functions don't have implementations in np.strings:

Lines 66 to 67 in ef6299c

    
           # Removed from namespace until behavior has been crystalized 
        
           # "join", "split", "rsplit", "splitlines",

We should figure out the appropriate API for adding fast ufuncs for these. I'm actually not sure if it's possible to write ufuncs for these given that they have an unknown number of outputs for each array element, you really need ragged arrays for this to make sense. Although that said, we could also return object arrays of python lists or just leave them with the _vec_string implementation.

The text was updated successfully, but these errors were encountered:

lysnikolaou · 2024-04-02T12:10:44Z

To me it doesn't make sense to create ufuncs for these. Even if we were to come up with a smart way to return arrays with empty strings wherever necessary, that would be incredibly memory-inefficient. Adding those to np.strings and explicitly mentioning in the docs that these use the old vec_string approach, which comes with performance implications, seems like the way to go.

mhvk · 2024-04-02T13:34:34Z

Agreed that really we do not have good tools for this - it needs ragged arrays of some form (which, if sharing the allocator, might be able to indicate the strings themselves really efficiently, just by adjusting offsets and lengths).

My preference would to not implement these in np.strings for now, but rather document why we have not included them. Maybe someone else will come up with a good way to do it!? Or someone implements ragged arrays and we can use those. I think it may be better not to get stuck on the output given by vec_string.

ngoldbaum · 2024-04-02T15:07:05Z

One thing we can do that does make sense for many applications is a padded split. We would need to implement a two-pass algorithm that finds the locations of all the splits, finds the entry with the most splits, and uses that to compute the output shape.

mhvk · 2024-04-02T18:42:17Z

I guess this is where @lysnikolaou's comment comes in:

Even if we were to come up with a smart way to return arrays with empty strings wherever necessary, that would be incredibly memory-inefficient.

But maybe it is still worth doing?

ngoldbaum · 2024-04-02T20:10:49Z

np.char.split has about 750 code search results. That's not a ton but it's not nothing either. The only one in a "popular" repository is ONNX, which goes out of its way to implement a padded split:

https://github.com/onnx/onnx/blob/366ac64b5af0af2d02b019d84664e6b46b3ceb09/onnx/reference/ops/op_string_split.py#L25-L36

That said, this usage wouldn't find a padded split particularly useful:

https://github.com/mindspore-ai/mindspore/blob/5f329d13e3154083d7ee03afd544be64d5d4b765/mindspore/python/mindspore/profiler/parser/ascend_hccl_generator.py#L217

For comparison np.char.partition only has about 100 code search results.

rgommers added the component: numpy.strings String dtypes and functions label Apr 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MAINT: what should we do about string ufuncs that return lists? #26176

MAINT: what should we do about string ufuncs that return lists? #26176

ngoldbaum commented Mar 29, 2024

lysnikolaou commented Apr 2, 2024

mhvk commented Apr 2, 2024

ngoldbaum commented Apr 2, 2024

mhvk commented Apr 2, 2024

ngoldbaum commented Apr 2, 2024 •

edited

MAINT: what should we do about string ufuncs that return lists? #26176

MAINT: what should we do about string ufuncs that return lists? #26176

Comments

ngoldbaum commented Mar 29, 2024

lysnikolaou commented Apr 2, 2024

mhvk commented Apr 2, 2024

ngoldbaum commented Apr 2, 2024

mhvk commented Apr 2, 2024

ngoldbaum commented Apr 2, 2024 • edited

ngoldbaum commented Apr 2, 2024 •

edited