Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle array format fixes with C strings #18

Merged
merged 9 commits into from
Nov 24, 2018
Merged

Conversation

jakirkham
Copy link
Owner

@jakirkham jakirkham commented Nov 24, 2018

Copies over format and itemsize information for Python 2 to start with. This largely removes the differences between Python 2/3 array format handling (with the exception of Python 2 only formats). Then handle format checks with C strings for all cases. As the strings should all by one char long in our case. Simply compare the char values directly.

As unicode arrays are represented with either a "u" or "w" format to indicate UCS2 or UCS4, check both of these independently at runtime. This drops the use of Py_UNICODE_SIZE in the code, which should make binaries a bit more portable as they are not dependent on the underlying unicode character size.

Checks for the Python 2 character array format case "c" explicitly. This was handled implicitly before as the Python 2 code path through the old buffer protocol casting everything to the unsigned bytes format "B". However now it needs to be handled explicitly as we want to have our data treated as unsigned bytes instead of the legacy character format.

All other cases simply leverage whatever the array exported either through the buffer protocol on Python 3 or what we patched in afterwards from the array's information on Python 2. This makes it a little clearer what the unusual cases are that need patching (i.e. unicode and character arrays). Also it makes it a little clearer that on Python 2 we are pulling this information from the array. Appears to improve performance as well. Not to mention binaries are a bit more portable.

Makes it a little clearer that these typecodes are intended to represent
unsigned integer types as opposed to UCS types per se.
When handling the `array` case, go ahead and copy the `itemsize` and
`format` over to start with on Python 2. This way Python 2/3 cases are
handled basically the same after this point. Then use `_format` to
perform string comparisons in C instead of Python. As the strings are
all short, these are very fast. This avoids the overhead of Python
string comparison.

Update the unicode strategy to handle the fact that UCS2 is represented
by the `"u"` format and UCS4 is represented by the `"w"` format. Specify
the type casting for each case using the appropriate unsigned integer
width type as before. As a result the format check now occurs at
run-time instead of build-time, this should make binaries more portable.

Explicitly check for the character array format `"c"` on Python 2. In
this case patch the format, to be the unsigned char format `"B"`. This
is basically what we did before. However now we do so explicitly. As
this type only exists on Python 2 (not Python 3), there is no need to
handle this case for other Python version. So include `PY2K` in the
conditional to ensure this check is not included on Python 3.
As `Py_UNICODE_SIZE` is no longer being used in the comparisons, skip
writing it to `config.pxi` as part of the build in `setup.py`.
Matches nicely with the ordering below. Also suggests implicitly what
formats we start with and what they are mapped to.
Shortens the code a bit and improves readability.
In some cases the unicode typecode `"w"` shows up, but it is not well
documented and is on its way out as well. That said, the code assumed
this to be UCS4. This may very well be true, but it is a bit unclear.
Similarly on Python 2, it seems that the unicode typecode `"u"` can mean
either UCS2 or UCS4. Given the confusion about the unicode typecodes,
lump all comparisons of them together and simply handle casting based on
`itemsize`. This should avoid getting into any technical issues of these
legacy types while still maintaining the intended behavior (casting to
an appropriately sized unsigned integer). While we are at it, go ahead
skip defining these legacy types as they don't have as a clear of a
mapping as one might initially think.
Instead of using `strcmp`, simply access the first `char` of the
`format` string and assign it to `fmt` a `char` type. Then compare the
different values that `fmt` can take on. Cython turns this into a
`switch`/`case` block with relatively little interference. So the
compiler can easily optimize this into the appropriate form.
This makes it a little easier to read the code either in Cython or its C
equivalent.
@jakirkham jakirkham merged commit 97ad376 into master Nov 24, 2018
@jakirkham jakirkham deleted the use_c_str_cmp_array branch November 24, 2018 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant