Handle array format fixes with C strings #18

jakirkham · 2018-11-24T21:53:23Z

Copies over format and itemsize information for Python 2 to start with. This largely removes the differences between Python 2/3 array format handling (with the exception of Python 2 only formats). Then handle format checks with C strings for all cases. As the strings should all by one char long in our case. Simply compare the char values directly.

As unicode arrays are represented with either a "u" or "w" format to indicate UCS2 or UCS4, check both of these independently at runtime. This drops the use of Py_UNICODE_SIZE in the code, which should make binaries a bit more portable as they are not dependent on the underlying unicode character size.

Checks for the Python 2 character array format case "c" explicitly. This was handled implicitly before as the Python 2 code path through the old buffer protocol casting everything to the unsigned bytes format "B". However now it needs to be handled explicitly as we want to have our data treated as unsigned bytes instead of the legacy character format.

All other cases simply leverage whatever the array exported either through the buffer protocol on Python 3 or what we patched in afterwards from the array's information on Python 2. This makes it a little clearer what the unusual cases are that need patching (i.e. unicode and character arrays). Also it makes it a little clearer that on Python 2 we are pulling this information from the array. Appears to improve performance as well. Not to mention binaries are a bit more portable.

Makes it a little clearer that these typecodes are intended to represent unsigned integer types as opposed to UCS types per se.

When handling the `array` case, go ahead and copy the `itemsize` and `format` over to start with on Python 2. This way Python 2/3 cases are handled basically the same after this point. Then use `_format` to perform string comparisons in C instead of Python. As the strings are all short, these are very fast. This avoids the overhead of Python string comparison. Update the unicode strategy to handle the fact that UCS2 is represented by the `"u"` format and UCS4 is represented by the `"w"` format. Specify the type casting for each case using the appropriate unsigned integer width type as before. As a result the format check now occurs at run-time instead of build-time, this should make binaries more portable. Explicitly check for the character array format `"c"` on Python 2. In this case patch the format, to be the unsigned char format `"B"`. This is basically what we did before. However now we do so explicitly. As this type only exists on Python 2 (not Python 3), there is no need to handle this case for other Python version. So include `PY2K` in the conditional to ensure this check is not included on Python 3.

As `Py_UNICODE_SIZE` is no longer being used in the comparisons, skip writing it to `config.pxi` as part of the build in `setup.py`.

Matches nicely with the ordering below. Also suggests implicitly what formats we start with and what they are mapped to.

Shortens the code a bit and improves readability.

In some cases the unicode typecode `"w"` shows up, but it is not well documented and is on its way out as well. That said, the code assumed this to be UCS4. This may very well be true, but it is a bit unclear. Similarly on Python 2, it seems that the unicode typecode `"u"` can mean either UCS2 or UCS4. Given the confusion about the unicode typecodes, lump all comparisons of them together and simply handle casting based on `itemsize`. This should avoid getting into any technical issues of these legacy types while still maintaining the intended behavior (casting to an appropriately sized unsigned integer). While we are at it, go ahead skip defining these legacy types as they don't have as a clear of a mapping as one might initially think.

Instead of using `strcmp`, simply access the first `char` of the `format` string and assign it to `fmt` a `char` type. Then compare the different values that `fmt` can take on. Cython turns this into a `switch`/`case` block with relatively little interference. So the compiler can easily optimize this into the appropriate form.

This makes it a little easier to read the code either in Cython or its C equivalent.

jakirkham force-pushed the use_c_str_cmp_array branch from a51503c to 453eb04 Compare November 24, 2018 21:58

jakirkham added 3 commits November 24, 2018 16:59

Rename typecode variables with unsigned int types

4a21540

Makes it a little clearer that these typecodes are intended to represent unsigned integer types as opposed to UCS types per se.

Skip adding Py_UNICODE_SIZE to config.pxi

b8da447

As `Py_UNICODE_SIZE` is no longer being used in the comparisons, skip writing it to `config.pxi` as part of the build in `setup.py`.

jakirkham force-pushed the use_c_str_cmp_array branch from 453eb04 to b30ad91 Compare November 24, 2018 22:03

jakirkham added 2 commits November 24, 2018 17:04

Swap text and unsigned int typecode definitions

b30ad91

Matches nicely with the ordering below. Also suggests implicitly what formats we start with and what they are mapped to.

Import strcmp directly into cybuffer.pyx

7a46e6a

Shortens the code a bit and improves readability.

jakirkham force-pushed the use_c_str_cmp_array branch from 1aebe99 to 7af1ecb Compare November 24, 2018 22:12

jakirkham force-pushed the use_c_str_cmp_array branch from 96bbe67 to 4cd0361 Compare November 24, 2018 22:34

jakirkham added 3 commits November 24, 2018 17:35

Move the Python 2 character array case first

4cd0361

This makes it a little easier to read the code either in Cython or its C equivalent.

Update comment about array type casting [ci skip]

4abde07

jakirkham merged commit 97ad376 into master Nov 24, 2018

jakirkham deleted the use_c_str_cmp_array branch November 24, 2018 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle array format fixes with C strings #18

Handle array format fixes with C strings #18

jakirkham commented Nov 24, 2018 •

edited

Loading

Handle array format fixes with C strings #18

Handle array format fixes with C strings #18

Conversation

jakirkham commented Nov 24, 2018 • edited Loading

jakirkham commented Nov 24, 2018 •

edited

Loading