Skip to content

Conversation

ngoldbaum
Copy link
Member

@ngoldbaum ngoldbaum commented Dec 12, 2022

Since Numpy stores unicode data internally as UCS4, I can do these casts relatively straightforwardly without worrying about encoding details.

For unicode to ASCII, I check character by character (e.g. 4 bytes at a time) if the UCS4 character is valid ASCII, if it is I assign the first byte of the character to the corresponding byte of the output array. If we find an invalid ASCII character, we re-acquire the GIL, set a TypeError, release the GIL, and return.

For ASCII to unicode, no error checking is needed, so we just set the ASCII character to the first byte of the corresponding character in the output array, and set the rest of the bytes in the character to zero.

This also includes some other misc fixes I noticed along the way around error checking and handling reference counts.

Finally, mostly as a note to myself to check this tomorrow, I noticed that if I do a round-trip from unicode to ascii back to unicode, but I phrase it like this:

arr = np.array(["hello", "this", "is", "an", "array"])
ascii_arr = arr.astype(ASCIIDType(5))
round_trip_arr = ascii_arr.astype(np.unicode_)

the resulting array will still have ASCIIDType(5) as the dtype. If I specify the output dtype as a string (ascii_arr.astype('U5') as I've done in the round-trip test I added in this PR), it works fine. I need to look into the implementation of astype to understand why that's happening, I suspect it's a numpy bug.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resolvers are not quite correct and I would suggest to use a type alias for unsigned char (maybe just Py_UCS1).

Otherwise just a few smaller comments.

TypeError,
match="Can only store ASCII text in a ASCIIDType array.",
):
arr.astype(ASCIIDType(5))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking astype already supports arr.astype(ASCIIDType), but I am not sure. It is definitely reachable e.g. via ufuncs, but that is a bit trickier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't support it yet, that's numpy/numpy#22756. I'm going to try poking at that once I finish here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right... I didn't change it anywhere, I guess. Starting with astype() is probably good since it is likely clearer or at least simpler than np.array().

}
else {
copy_size = out_size;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this type of setup could of course also be done in the get_loop if it helps a lot. But it hardly matters in practice...
(I guess it might for HPy support or so, but that is another tricky thing to figure out one day.)

@ngoldbaum
Copy link
Member Author

Thanks so much for the comments @seberg, I learn a ton every time you give me code review.

Would you mind taking another look at this when you have a chance?

UnicodeToASCIICastSpec->nout = 1;
UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING,
UnicodeToASCIICastSpec->flags =
(NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t know about NPY_METH_REQUIRES_PYAPI until I happened to come across it in the experiemental dtype header today. Out of curiosity, what are the downsides of not releasing the GIL in a casting or ufunc loop? For this one I only need a python API function for error handling, so I could re-acquire the GIL manually in the error condition as I originally had it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to release the GIL as much as possible (except for very small amount of work, since there is a cost to releasing; although maybe that cost got reduced also).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B.: we are relying (right now) on the fact that in CPython (and pypy cpyext) accessing the object is OK even without the GIL so long we only get the elsize, etc.
For general Python implementations, this may need to happen in the setup, or we have to think about how the descrs are passed exactly.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few comments, I would undo the GIL grabbing thingy and simplify the unicode -> ascii loop by also casting to Py_UCS4.

}
// UCS4 character is ascii, so copy first byte of character
// into output, ignoring the rest
*(out + i) = *(in + i * 4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the second loop where you changed it, I think this would be simpler:

Py_UCS4 c = ((Py_UCS4 *)in)[i];
if (c > 127) {
    // not ascii
}
out[i] = c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you are trying to support unaligned access anyway anymore. Plus, the current code is probably endianess-specific anyway, and just fails on big endian systems.

UnicodeToASCIICastSpec->nout = 1;
UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING,
UnicodeToASCIICastSpec->flags =
(NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to release the GIL as much as possible (except for very small amount of work, since there is a cost to releasing; although maybe that cost got reduced also).

UnicodeToASCIICastSpec->nout = 1;
UnicodeToASCIICastSpec->casting = NPY_UNSAFE_CASTING,
UnicodeToASCIICastSpec->flags =
(NPY_METH_NO_FLOATINGPOINT_ERRORS | NPY_METH_REQUIRES_PYAPI);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B.: we are relying (right now) on the fact that in CPython (and pypy cpyext) accessing the object is OK even without the GIL so long we only get the elsize, etc.
For general Python implementations, this may need to happen in the setup, or we have to think about how the descrs are passed exactly.

PyArray_Descr *unicode_descr = PyArray_DescrNewFromType(NPY_UNICODE);
// numpy stores unicode as UCS4 (4 bytes wide), so bitshift
// by 2 to get the number of bytes needed to store the UCS4 charaters
unicode_descr->elsize = in_size << 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My very personal thing would be to use * 4 and / 4 and trust the compiler to know that it's just a bit shift. (not that I checked that they do :)).

@ngoldbaum ngoldbaum merged commit 92353e5 into numpy:main Dec 14, 2022
SwayamInSync added a commit that referenced this pull request Sep 9, 2025
void->quad gives error than segfault
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants