Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added np.char.slice_ #20694

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Conversation

madphysicist
Copy link
Contributor

@madphysicist madphysicist commented Jan 1, 2022

There are numerous examples of string slicing being a fairly requested feature:

Given the existence of the char module, there is no reason not to include a basic slicing operation that can work cheaper than making views and copies of a string, or switching to pandas for this one feature. This PR introduces such a function. It's written entirely in python, and does its absolute best not to make a copy of any data.

The original inspiration for this is my answer to the first question in the list above. I've added a couple of features since then, like the ability to have a meaningful non-unit step and the ability to set the length of non-unit-step chunks.

There are two things I'm not sure about with this PR:

  1. The other methods in defchararray.py appear as methods of the subclass np.chararray. Should slice be added to those as well, or is it better to keep access to it only via np.char.slice, as the documentation for chararray indicates? I left it out for now since it's a quick change to make if desired.
  2. I don't really understand the array_function_dispatch decorator. While I've done my best to emulate the other functions in the same module, I hope someone with knowhow can look it over.

(Thorough) tests are included. Mailing list thread for new feature starts here: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JIK2T5XJJPDFIJM5VRPDXGZFMUYCVV5H/

newarray = ndarray(buffer=base, offset=newoffset + realoffset,
shape=newshape, strides=newstrides,
dtype=newdtype)
except ValueError as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to achieve this with stride_tricks so that a copy is never returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Let me see what stride tricks is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser. I have to ask. How is this line, in as_strided fundamentally different from what you marked above: array = np.asarray(DummyArray(interface, base=x)). Is is because DummyArray is "trusted" because it's already an array subclass, vs what I have here, which is attempting to make a new one from scratch? The only thing that won't work here is that as_strided does not accept an offset. I can hack in another parameter to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser I tried it your way: madphysicist@e422186. However, as it turns out, it makes no difference. The issue is with changing dtype: both ndarray and as_strided complain equally when you try to do that unless the array is contiguous. Maybe that's something worth bothering about, maybe not. Within the scope of this PR, doing what I'm doing now seems to be a fix. I do need to add a test for an array created as a non-contiguous view into a bytes object or something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser Please take a look at the changes and additional tests I made to deal with non-contiguous base arrays. I basically made the assumption that somewhere there is a contiguous block of memory available, got its address and calculated the size out to the last element I need, then used that to create my view. This is a deficiency in the API: I left a ranty comment about it in the code. There is no reason a responsible user shouldn't be able to use as_strided to view with a new dtype. In the meantime, please enjoy my hack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eric-wieser Given that I basically agree with your preference for using as_strided, would you prefer that I move the dtype-modifying code out to as_strided, similarly to the private branch I linked above? It would definitely make this function cleaner and less error prone.

Copy link
Member

@eric-wieser eric-wieser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd be able to implement this as:

def slice(arr, sl):
    dt = arr.dtype
    char_dt = np.dtype((dt.type, 1))
    arr_chars = arr.view((char_dt, dt.itemsize // char_dt.itemsize))
    sliced = arr_chars[..., sl]
    # last axis is noncontiguous, view not possible
    if sliced.shape[-1] != 1 and sliced.strides[-1] != char_dt.itemsize:
        sliced = sliced.copy()
    return sliced.view((dt.type, sliced.shape[-1]))

But that doesn't seem to work (in my old numpy version). Changing the logic of view to allow it seems like the nicest fix, but that might not be viable.

In general __array_interface__['data'] is in deep hack territory, and such deep hacks don't belong in a string-specific operation like the slice proposed in this PR: either we need to fix the problem causing the hack in the first place, or add some helper API like as_strided.

I haven't been following recent numpy development, so there may be a variant on what I do above that does work.

@madphysicist
Copy link
Contributor Author

There is a bit of nuance here because you may want to actively change the dtype and offset (as I'm doing) since a 10-character slice of a <U50 string should be <U10, not <U50, and may start at a non-integer multiple of dtype('<U10').itemsize. My preference is to add offset and dtype functionality to as_strided. I'm working on a PR to split off a lower-level version of np.ndarray to use with as_strided, while the public interface stays the same. Will let you know when I have something lined up.

@madphysicist
Copy link
Contributor Author

@eric-wieser I've taken a stab at changing the logic of view here: #20722. This might be all it takes to get this PR working. I'd still like to have overlapping chunks as a possible thing, but I'd like to see how #20722 goes before adding another PR to enhance as_strided.

@eric-wieser
Copy link
Member

From that PR:

I still want to add offset and dtype arguments to as_strided. It's still affected by this contiguity check because it uses asarray under the hood, but I really don't think it should be. Do you have any thoughts on that?

Can you elaborate on what semantics you want these to haev?

@madphysicist
Copy link
Contributor Author

For this PR, I can probably get rid of chunksize. Right now, with the updates to dtype, assuming a.dtype == np.dtype('S1'), we can get the main functionality of slice_ implemented without any copying as

 a[..., None].view('S1')[start:end].view(f'S{end-start}').squeeze()

(The squeeze is figurative). I rather like the idea of being able to sample arbitrary characters from the string, in forward, backward and even reverse order. I think we can still do that with just view and slicing. I can see how potentially overlapping chunks of arbitrary size may be a bit of a stretch (though I really like the idea), so the modifications to as_strided are not necessarily a prerequisite for this PR.

offset: number of bytes to add to current array's base address when viewing. This seems like a natural addition when working with strings. I actually got the idea from np.ndarray. The challenge I see here is dealing with subclasses, but it's my understanding that all arrays implement the buffer protocol (I may be very very wrong about that).

dtype: datatype with which to view elements. This one seems pretty straightforward. Given that you can completely screw up just about everything just using strides and shape, I don't see any reason not to trust a competent user to be able to change the dtype as well.

With these two changes, the snippet above could become

as_strided(a, offset=start * a.dtype.itemsize, dtype=f'S{end - start}')

In some ways, I find this easier to understand, since it only makes a single coherent transformation rather than multiple changes.

Updated tests, docs.
Updated docs in anticipation of updated as_strided
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants