ENH: Added np.char.slice_ #20694

madphysicist · 2022-01-01T13:33:31Z

There are numerous examples of string slicing being a fairly requested feature:

Given the existence of the char module, there is no reason not to include a basic slicing operation that can work cheaper than making views and copies of a string, or switching to pandas for this one feature. This PR introduces such a function. It's written entirely in python, and does its absolute best not to make a copy of any data.

The original inspiration for this is my answer to the first question in the list above. I've added a couple of features since then, like the ability to have a meaningful non-unit step and the ability to set the length of non-unit-step chunks.

There are two things I'm not sure about with this PR:

The other methods in defchararray.py appear as methods of the subclass np.chararray. Should slice be added to those as well, or is it better to keep access to it only via np.char.slice, as the documentation for chararray indicates? I left it out for now since it's a quick change to make if desired.
I don't really understand the array_function_dispatch decorator. While I've done my best to emulate the other functions in the same module, I hope someone with knowhow can look it over.

(Thorough) tests are included. Mailing list thread for new feature starts here: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JIK2T5XJJPDFIJM5VRPDXGZFMUYCVV5H/

eric-wieser · 2022-01-01T13:38:27Z

numpy/core/defchararray.py

+        newarray = ndarray(buffer=base, offset=newoffset + realoffset,
+                           shape=newshape, strides=newstrides,
+                           dtype=newdtype)
+    except ValueError as e:


It should be possible to achieve this with stride_tricks so that a copy is never returned.

Good point. Let me see what stride tricks is doing.

@eric-wieser. I have to ask. How is this line, in as_strided fundamentally different from what you marked above: array = np.asarray(DummyArray(interface, base=x)). Is is because DummyArray is "trusted" because it's already an array subclass, vs what I have here, which is attempting to make a new one from scratch? The only thing that won't work here is that as_strided does not accept an offset. I can hack in another parameter to do that.

@eric-wieser I tried it your way: madphysicist@e422186. However, as it turns out, it makes no difference. The issue is with changing dtype: both ndarray and as_strided complain equally when you try to do that unless the array is contiguous. Maybe that's something worth bothering about, maybe not. Within the scope of this PR, doing what I'm doing now seems to be a fix. I do need to add a test for an array created as a non-contiguous view into a bytes object or something.

@eric-wieser Please take a look at the changes and additional tests I made to deal with non-contiguous base arrays. I basically made the assumption that somewhere there is a contiguous block of memory available, got its address and calculated the size out to the last element I need, then used that to create my view. This is a deficiency in the API: I left a ranty comment about it in the code. There is no reason a responsible user shouldn't be able to use as_strided to view with a new dtype. In the meantime, please enjoy my hack.

@eric-wieser Given that I basically agree with your preference for using as_strided, would you prefer that I move the dtype-modifying code out to as_strided, similarly to the private branch I linked above? It would definitely make this function cleaner and less error prone.

eric-wieser

Ideally we'd be able to implement this as:

def slice(arr, sl):
    dt = arr.dtype
    char_dt = np.dtype((dt.type, 1))
    arr_chars = arr.view((char_dt, dt.itemsize // char_dt.itemsize))
    sliced = arr_chars[..., sl]
    # last axis is noncontiguous, view not possible
    if sliced.shape[-1] != 1 and sliced.strides[-1] != char_dt.itemsize:
        sliced = sliced.copy()
    return sliced.view((dt.type, sliced.shape[-1]))

But that doesn't seem to work (in my old numpy version). Changing the logic of view to allow it seems like the nicest fix, but that might not be viable.

In general __array_interface__['data'] is in deep hack territory, and such deep hacks don't belong in a string-specific operation like the slice proposed in this PR: either we need to fix the problem causing the hack in the first place, or add some helper API like as_strided.

I haven't been following recent numpy development, so there may be a variant on what I do above that does work.

madphysicist · 2022-01-02T22:04:22Z

There is a bit of nuance here because you may want to actively change the dtype and offset (as I'm doing) since a 10-character slice of a <U50 string should be <U10, not <U50, and may start at a non-integer multiple of dtype('<U10').itemsize. My preference is to add offset and dtype functionality to as_strided. I'm working on a PR to split off a lower-level version of np.ndarray to use with as_strided, while the public interface stays the same. Will let you know when I have something lined up.

madphysicist · 2022-01-04T08:16:34Z

@eric-wieser I've taken a stab at changing the logic of view here: #20722. This might be all it takes to get this PR working. I'd still like to have overlapping chunks as a possible thing, but I'd like to see how #20722 goes before adding another PR to enhance as_strided.

eric-wieser · 2022-01-05T17:41:47Z

From that PR:

I still want to add offset and dtype arguments to as_strided. It's still affected by this contiguity check because it uses asarray under the hood, but I really don't think it should be. Do you have any thoughts on that?

Can you elaborate on what semantics you want these to haev?

madphysicist · 2022-01-05T18:08:57Z

For this PR, I can probably get rid of chunksize. Right now, with the updates to dtype, assuming a.dtype == np.dtype('S1'), we can get the main functionality of slice_ implemented without any copying as

 a[..., None].view('S1')[start:end].view(f'S{end-start}').squeeze()

(The squeeze is figurative). I rather like the idea of being able to sample arbitrary characters from the string, in forward, backward and even reverse order. I think we can still do that with just view and slicing. I can see how potentially overlapping chunks of arbitrary size may be a bit of a stretch (though I really like the idea), so the modifications to as_strided are not necessarily a prerequisite for this PR.

offset: number of bytes to add to current array's base address when viewing. This seems like a natural addition when working with strings. I actually got the idea from np.ndarray. The challenge I see here is dealing with subclasses, but it's my understanding that all arrays implement the buffer protocol (I may be very very wrong about that).

dtype: datatype with which to view elements. This one seems pretty straightforward. Given that you can completely screw up just about everything just using strides and shape, I don't see any reason not to trust a competent user to be able to change the dtype as well.

With these two changes, the snippet above could become

as_strided(a, offset=start * a.dtype.itemsize, dtype=f'S{end - start}')

In some ways, I find this easier to understand, since it only makes a single coherent transformation rather than multiple changes.

Updated tests, docs. Updated docs in anticipation of updated as_strided

github-actions bot added the 01 - Enhancement label Jan 1, 2022

eric-wieser reviewed Jan 1, 2022

View reviewed changes

madphysicist mentioned this pull request Jan 2, 2022

ENH? BUG? Unexpeced restrictions in behavior of view creating when switching dtypes #20705

Open

eric-wieser reviewed Jan 2, 2022

View reviewed changes

madphysicist mentioned this pull request Jan 4, 2022

ENH: Removed requirement for C-contiguity when changing to dtype of different size #20722

Merged

madphysicist added 8 commits January 8, 2022 03:23

ENH: Added np.char.slice_

4d0db71

MAINT: Fixed linter errors

4d410ce

MAINT: Fixed remaining linter issues

4a1167b

ENH: Added support for non-numpy buffers

430adcb

MAINT: Fixed another linter mistake

0056fda

MAINT: Added release note

da404de

MAINT: I just want the linter to be happy!

89efa96

BUG: Added missing stacklevel

3fd48e9

madphysicist force-pushed the char_slice branch from 5fff2ef to 3fd48e9 Compare January 8, 2022 09:23

MAINT: Removed annoying swap of start and stop

01dc74c

Updated tests, docs. Updated docs in anticipation of updated as_strided

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added np.char.slice_ #20694

ENH: Added np.char.slice_ #20694

madphysicist commented Jan 1, 2022 •

edited

eric-wieser Jan 1, 2022

madphysicist Jan 1, 2022

madphysicist Jan 1, 2022

madphysicist Jan 1, 2022

madphysicist Jan 1, 2022

madphysicist Jan 1, 2022

eric-wieser left a comment •

edited

madphysicist commented Jan 2, 2022

madphysicist commented Jan 4, 2022

eric-wieser commented Jan 5, 2022

madphysicist commented Jan 5, 2022

ENH: Added np.char.slice_ #20694

Are you sure you want to change the base?

ENH: Added np.char.slice_ #20694

Conversation

madphysicist commented Jan 1, 2022 • edited

eric-wieser Jan 1, 2022

Choose a reason for hiding this comment

madphysicist Jan 1, 2022

Choose a reason for hiding this comment

madphysicist Jan 1, 2022

Choose a reason for hiding this comment

madphysicist Jan 1, 2022

Choose a reason for hiding this comment

madphysicist Jan 1, 2022

Choose a reason for hiding this comment

madphysicist Jan 1, 2022

Choose a reason for hiding this comment

eric-wieser left a comment • edited

Choose a reason for hiding this comment

madphysicist commented Jan 2, 2022

madphysicist commented Jan 4, 2022

eric-wieser commented Jan 5, 2022

madphysicist commented Jan 5, 2022

madphysicist commented Jan 1, 2022 •

edited

eric-wieser left a comment •

edited