Add round-trip casts between unicode and ASCIIDType #13

ngoldbaum · 2022-12-12T22:01:40Z

Since Numpy stores unicode data internally as UCS4, I can do these casts relatively straightforwardly without worrying about encoding details.

For unicode to ASCII, I check character by character (e.g. 4 bytes at a time) if the UCS4 character is valid ASCII, if it is I assign the first byte of the character to the corresponding byte of the output array. If we find an invalid ASCII character, we re-acquire the GIL, set a TypeError, release the GIL, and return.

For ASCII to unicode, no error checking is needed, so we just set the ASCII character to the first byte of the corresponding character in the output array, and set the rest of the bytes in the character to zero.

This also includes some other misc fixes I noticed along the way around error checking and handling reference counts.

Finally, mostly as a note to myself to check this tomorrow, I noticed that if I do a round-trip from unicode to ascii back to unicode, but I phrase it like this:

arr = np.array(["hello", "this", "is", "an", "array"])
ascii_arr = arr.astype(ASCIIDType(5))
round_trip_arr = ascii_arr.astype(np.unicode_)

the resulting array will still have ASCIIDType(5) as the dtype. If I specify the output dtype as a string (ascii_arr.astype('U5') as I've done in the round-trip test I added in this PR), it works fine. I need to look into the implementation of astype to understand why that's happening, I suspect it's a numpy bug.

seberg

The resolvers are not quite correct and I would suggest to use a type alias for unsigned char (maybe just Py_UCS1).

Otherwise just a few smaller comments.

asciidtype/asciidtype/src/casts.c

seberg · 2022-12-13T08:32:13Z

asciidtype/tests/test_asciidtype.py

+            TypeError,
+            match="Can only store ASCII text in a ASCIIDType array.",
+        ):
+            arr.astype(ASCIIDType(5))


I am thinking astype already supports arr.astype(ASCIIDType), but I am not sure. It is definitely reachable e.g. via ufuncs, but that is a bit trickier.

It doesn't support it yet, that's numpy/numpy#22756. I'm going to try poking at that once I finish here.

Ah, right... I didn't change it anywhere, I guess. Starting with astype() is probably good since it is likely clearer or at least simpler than np.array().

seberg · 2022-12-13T08:33:15Z

asciidtype/asciidtype/src/casts.c

+    }
+    else {
+        copy_size = out_size;
+    }


Ah, this type of setup could of course also be done in the get_loop if it helps a lot. But it hardly matters in practice...
(I guess it might for HPy support or so, but that is another tricky thing to figure out one day.)

asciidtype/asciidtype/src/casts.c

…when the output descriptor is abstract

ngoldbaum · 2022-12-13T20:36:12Z

Thanks so much for the comments @seberg, I learn a ton every time you give me code review.

Would you mind taking another look at this when you have a chance?

ngoldbaum · 2022-12-14T01:47:32Z

asciidtype/asciidtype/src/casts.c

+    UnicodeToASCIICastSpec->nout = 1;
+    UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING,
+    UnicodeToASCIICastSpec->flags =
+            (NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI);


I didn’t know about NPY_METH_REQUIRES_PYAPI until I happened to come across it in the experiemental dtype header today. Out of curiosity, what are the downsides of not releasing the GIL in a casting or ufunc loop? For this one I only need a python API function for error handling, so I could re-acquire the GIL manually in the error condition as I originally had it.

You want to release the GIL as much as possible (except for very small amount of work, since there is a cost to releasing; although maybe that cost got reduced also).

N.B.: we are relying (right now) on the fact that in CPython (and pypy cpyext) accessing the object is OK even without the GIL so long we only get the elsize, etc.
For general Python implementations, this may need to happen in the setup, or we have to think about how the descrs are passed exactly.

seberg

Just a few comments, I would undo the GIL grabbing thingy and simplify the unicode -> ascii loop by also casting to Py_UCS4.

seberg · 2022-12-13T21:08:33Z

asciidtype/asciidtype/src/casts.c

+            }
+            // UCS4 character is ascii, so copy first byte of character
+            // into output, ignoring the rest
+            *(out + i) = *(in + i * 4);


Similar to the second loop where you changed it, I think this would be simpler:

Py_UCS4 c = ((Py_UCS4 *)in)[i]; if (c > 127) { // not ascii } out[i] = c

I don't think you are trying to support unaligned access anyway anymore. Plus, the current code is probably endianess-specific anyway, and just fails on big endian systems.

seberg · 2022-12-14T09:05:05Z

asciidtype/asciidtype/src/casts.c

+    UnicodeToASCIICastSpec->nout = 1;
+    UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING,
+    UnicodeToASCIICastSpec->flags =
+            (NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI);


You want to release the GIL as much as possible (except for very small amount of work, since there is a cost to releasing; although maybe that cost got reduced also).

seberg · 2022-12-14T09:16:03Z

asciidtype/asciidtype/src/casts.c

+    UnicodeToASCIICastSpec->nout = 1;
+    UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING,
+    UnicodeToASCIICastSpec->flags =
+            (NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI);


N.B.: we are relying (right now) on the fact that in CPython (and pypy cpyext) accessing the object is OK even without the GIL so long we only get the elsize, etc.
For general Python implementations, this may need to happen in the setup, or we have to think about how the descrs are passed exactly.

seberg · 2022-12-14T09:56:42Z

asciidtype/asciidtype/src/casts.c

+        PyArray_Descr *unicode_descr = PyArray_DescrNewFromType(NPY_UNICODE);
+        // numpy stores unicode as UCS4 (4 bytes wide), so bitshift
+        // by 2 to get the number of bytes needed to store the UCS4 charaters
+        unicode_descr->elsize = in_size << 2;


My very personal thing would be to use * 4 and / 4 and trust the compiler to know that it's just a bit shift. (not that I checked that they do :)).

void->quad gives error than segfault

ngoldbaum added 6 commits December 12, 2022 14:33

add missing return for error case

f5d9204

add missing error checking in get_value

09a84ef

remove incorrect decref for borrowed reference

38630cc

add NPY_UNUSED for unused parameters to ascii_to_ascii_get_loop

f7bb0ba

increase maximum allowed line length

9e86760

add ascii to unicode and unicode to ascii casts

bc38f89

ngoldbaum force-pushed the add-asciidtype branch from b775d32 to bc38f89 Compare December 12, 2022 22:10

seberg reviewed Dec 13, 2022

View reviewed changes

ngoldbaum added 5 commits December 13, 2022 12:21

make new_asciidtype_instance take a long instead of PyObject*

d73cec2

ascii <-> unicode resolve_descriptors return correct descriptors for …

23ad336

…when the output descriptor is abstract

remove get_loop and fix casting safety

e4a019a

use unsigned char in ucs4_character_is_ascii

bcbc0c7

simplify ascii to unicode casting use PY_UCS types

3f21ba6

ngoldbaum commented Dec 14, 2022

View reviewed changes

seberg reviewed Dec 14, 2022

View reviewed changes

ngoldbaum added 2 commits December 14, 2022 11:15

simplify unicode to ascii cast

7b88321

don't use NPY_METH_REQUIRES_PYAPI

dae600b

ngoldbaum merged commit 92353e5 into numpy:main Dec 14, 2022

SwayamInSync added a commit that referenced this pull request Sep 9, 2025

Merge pull request #13 from SwayamInSync/void_cast

03b4ecb

void->quad gives error than segfault

Uh oh!

Add round-trip casts between unicode and ASCIIDType #13

Add round-trip casts between unicode and ASCIIDType #13

Uh oh!

Conversation

ngoldbaum commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ngoldbaum commented Dec 13, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ngoldbaum commented Dec 12, 2022 •

edited

Loading