-
-
Notifications
You must be signed in to change notification settings - Fork 12
Add round-trip casts between unicode and ASCIIDType #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
b775d32
to
bc38f89
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The resolvers are not quite correct and I would suggest to use a type alias for unsigned char
(maybe just Py_UCS1
).
Otherwise just a few smaller comments.
TypeError, | ||
match="Can only store ASCII text in a ASCIIDType array.", | ||
): | ||
arr.astype(ASCIIDType(5)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking astype
already supports arr.astype(ASCIIDType)
, but I am not sure. It is definitely reachable e.g. via ufuncs, but that is a bit trickier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't support it yet, that's numpy/numpy#22756. I'm going to try poking at that once I finish here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right... I didn't change it anywhere, I guess. Starting with astype()
is probably good since it is likely clearer or at least simpler than np.array()
.
} | ||
else { | ||
copy_size = out_size; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this type of setup could of course also be done in the get_loop
if it helps a lot. But it hardly matters in practice...
(I guess it might for HPy support or so, but that is another tricky thing to figure out one day.)
…when the output descriptor is abstract
Thanks so much for the comments @seberg, I learn a ton every time you give me code review. Would you mind taking another look at this when you have a chance? |
asciidtype/asciidtype/src/casts.c
Outdated
UnicodeToASCIICastSpec->nout = 1; | ||
UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING, | ||
UnicodeToASCIICastSpec->flags = | ||
(NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t know about NPY_METH_REQUIRES_PYAPI
until I happened to come across it in the experiemental dtype header today. Out of curiosity, what are the downsides of not releasing the GIL in a casting or ufunc loop? For this one I only need a python API function for error handling, so I could re-acquire the GIL manually in the error condition as I originally had it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to release the GIL as much as possible (except for very small amount of work, since there is a cost to releasing; although maybe that cost got reduced also).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B.: we are relying (right now) on the fact that in CPython (and pypy cpyext) accessing the object is OK even without the GIL so long we only get the elsize
, etc.
For general Python implementations, this may need to happen in the setup, or we have to think about how the descrs
are passed exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few comments, I would undo the GIL grabbing thingy and simplify the unicode -> ascii loop by also casting to Py_UCS4.
asciidtype/asciidtype/src/casts.c
Outdated
} | ||
// UCS4 character is ascii, so copy first byte of character | ||
// into output, ignoring the rest | ||
*(out + i) = *(in + i * 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the second loop where you changed it, I think this would be simpler:
Py_UCS4 c = ((Py_UCS4 *)in)[i];
if (c > 127) {
// not ascii
}
out[i] = c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you are trying to support unaligned access anyway anymore. Plus, the current code is probably endianess-specific anyway, and just fails on big endian systems.
asciidtype/asciidtype/src/casts.c
Outdated
UnicodeToASCIICastSpec->nout = 1; | ||
UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING, | ||
UnicodeToASCIICastSpec->flags = | ||
(NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to release the GIL as much as possible (except for very small amount of work, since there is a cost to releasing; although maybe that cost got reduced also).
asciidtype/asciidtype/src/casts.c
Outdated
UnicodeToASCIICastSpec->nout = 1; | ||
UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING, | ||
UnicodeToASCIICastSpec->flags = | ||
(NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
N.B.: we are relying (right now) on the fact that in CPython (and pypy cpyext) accessing the object is OK even without the GIL so long we only get the elsize
, etc.
For general Python implementations, this may need to happen in the setup, or we have to think about how the descrs
are passed exactly.
PyArray_Descr *unicode_descr = PyArray_DescrNewFromType(NPY_UNICODE); | ||
// numpy stores unicode as UCS4 (4 bytes wide), so bitshift | ||
// by 2 to get the number of bytes needed to store the UCS4 charaters | ||
unicode_descr->elsize = in_size << 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My very personal thing would be to use * 4
and / 4
and trust the compiler to know that it's just a bit shift. (not that I checked that they do :)).
void->quad gives error than segfault
Since Numpy stores unicode data internally as UCS4, I can do these casts relatively straightforwardly without worrying about encoding details.
For unicode to ASCII, I check character by character (e.g. 4 bytes at a time) if the UCS4 character is valid ASCII, if it is I assign the first byte of the character to the corresponding byte of the output array. If we find an invalid ASCII character, we re-acquire the GIL, set a
TypeError
, release the GIL, and return.For ASCII to unicode, no error checking is needed, so we just set the ASCII character to the first byte of the corresponding character in the output array, and set the rest of the bytes in the character to zero.
This also includes some other misc fixes I noticed along the way around error checking and handling reference counts.
Finally, mostly as a note to myself to check this tomorrow, I noticed that if I do a round-trip from unicode to ascii back to unicode, but I phrase it like this:
the resulting array will still have
ASCIIDType(5)
as the dtype. If I specify the output dtype as a string (ascii_arr.astype('U5')
as I've done in the round-trip test I added in this PR), it works fine. I need to look into the implementation ofastype
to understand why that's happening, I suspect it's a numpy bug.