New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added np.char.slice_ #20694
base: main
Are you sure you want to change the base?
ENH: Added np.char.slice_ #20694
Conversation
newarray = ndarray(buffer=base, offset=newoffset + realoffset, | ||
shape=newshape, strides=newstrides, | ||
dtype=newdtype) | ||
except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be possible to achieve this with stride_tricks
so that a copy is never returned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Let me see what stride tricks is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser. I have to ask. How is this line, in as_strided
fundamentally different from what you marked above: array = np.asarray(DummyArray(interface, base=x))
. Is is because DummyArray
is "trusted" because it's already an array subclass, vs what I have here, which is attempting to make a new one from scratch? The only thing that won't work here is that as_strided
does not accept an offset. I can hack in another parameter to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser I tried it your way: madphysicist@e422186. However, as it turns out, it makes no difference. The issue is with changing dtype: both ndarray
and as_strided
complain equally when you try to do that unless the array is contiguous. Maybe that's something worth bothering about, maybe not. Within the scope of this PR, doing what I'm doing now seems to be a fix. I do need to add a test for an array created as a non-contiguous view into a bytes
object or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser Please take a look at the changes and additional tests I made to deal with non-contiguous base arrays. I basically made the assumption that somewhere there is a contiguous block of memory available, got its address and calculated the size out to the last element I need, then used that to create my view. This is a deficiency in the API: I left a ranty comment about it in the code. There is no reason a responsible user shouldn't be able to use as_strided
to view with a new dtype. In the meantime, please enjoy my hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eric-wieser Given that I basically agree with your preference for using as_strided
, would you prefer that I move the dtype-modifying code out to as_strided
, similarly to the private branch I linked above? It would definitely make this function cleaner and less error prone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd be able to implement this as:
def slice(arr, sl):
dt = arr.dtype
char_dt = np.dtype((dt.type, 1))
arr_chars = arr.view((char_dt, dt.itemsize // char_dt.itemsize))
sliced = arr_chars[..., sl]
# last axis is noncontiguous, view not possible
if sliced.shape[-1] != 1 and sliced.strides[-1] != char_dt.itemsize:
sliced = sliced.copy()
return sliced.view((dt.type, sliced.shape[-1]))
But that doesn't seem to work (in my old numpy version). Changing the logic of view
to allow it seems like the nicest fix, but that might not be viable.
In general __array_interface__['data']
is in deep hack territory, and such deep hacks don't belong in a string-specific operation like the slice
proposed in this PR: either we need to fix the problem causing the hack in the first place, or add some helper API like as_strided
.
I haven't been following recent numpy development, so there may be a variant on what I do above that does work.
There is a bit of nuance here because you may want to actively change the dtype and offset (as I'm doing) since a 10-character slice of a |
@eric-wieser I've taken a stab at changing the logic of |
From that PR:
Can you elaborate on what semantics you want these to haev? |
For this PR, I can probably get rid of
(The squeeze is figurative). I rather like the idea of being able to sample arbitrary characters from the string, in forward, backward and even reverse order. I think we can still do that with just
With these two changes, the snippet above could become
In some ways, I find this easier to understand, since it only makes a single coherent transformation rather than multiple changes. |
5fff2ef
to
3fd48e9
Compare
Updated tests, docs. Updated docs in anticipation of updated as_strided
There are numerous examples of string slicing being a fairly requested feature:
Given the existence of the
char
module, there is no reason not to include a basic slicing operation that can work cheaper than making views and copies of a string, or switching to pandas for this one feature. This PR introduces such a function. It's written entirely in python, and does its absolute best not to make a copy of any data.The original inspiration for this is my answer to the first question in the list above. I've added a couple of features since then, like the ability to have a meaningful non-unit step and the ability to set the length of non-unit-step chunks.
There are two things I'm not sure about with this PR:
defchararray.py
appear as methods of the subclassnp.chararray
. Shouldslice
be added to those as well, or is it better to keep access to it only vianp.char.slice
, as the documentation forchararray
indicates? I left it out for now since it's a quick change to make if desired.array_function_dispatch
decorator. While I've done my best to emulate the other functions in the same module, I hope someone with knowhow can look it over.(Thorough) tests are included. Mailing list thread for new feature starts here: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/JIK2T5XJJPDFIJM5VRPDXGZFMUYCVV5H/