Simplify initialization of `cybuffer` with a builtin `array` #13

jakirkham · 2018-11-24T04:45:00Z

Makes some small adjustments to how cybuffer handles the array initialization case. Namely only special case handling of array on Python 3 or earlier (as Python 4 shouldn't have any of these issues). Also turn Py_UNICODE_SIZE into a Cython compile-time constant to cutdown on C code generated in relation to it. Finally cleanup and fuse together some no-op cases for some array types. Streamlines things a bit and makes it a bit clearer what is happening in the code.

When initializing `cybuffer` with an `array, ensure that we only patch up the `format` and associated attributes for the `array` buffer on Python 2/3. This workaround is needed for Python 2 as the `array` type does not support the (new) buffer protocol. So we need to use the old buffer protocol and then the (new) buffer protocol with that result. This loses information about the `array` in the process, which is the bulk of the reason this is needed. Python 3's `array` supports the (new) buffer protocol; so, we can largely ignore these workarounds. However on Python 2/3, we have the `"u"` (unicode) format for the `array`, which is a little broken. So we workaround those issues by casting the array to an unsigned integer of equivalent width to the unicode type. However the `"u"` format for `array` is deprecated and scheduled for removal in Python 4. So there should be no need to do any of this work on Python 4. As such, check whether Python 2/3 is in use when checking if the data is provided is an `array`. That way this whole branch will automatically be removed on Python 4 when it lands.

Previously we dealt with `Py_UNICODE_SIZE` as a C compile-time constant, which would inject the value during the C preprocessor pass. Later compilation optimizations by the C compiler likely removed unneeded branches caused by this. Though this is pretty needless as Cython is generating these branches that we later drop. Not to mention Cython sometimes assigns constants like these to intermediate or stack local variables, which can hinder compiler optimizations. It would be better yet to have Cython not define these branches at all. To that end, we make `Py_UNICODE_SIZE` a Cython compile-time constant that is set in `config.pxi` along with other compile-time constants. As a result, Cython injects these directly in `cybuffer.pxd` and `cybuffer.pxi` before even performing any compilation. Once it goes to compile the Cython code to C, it recognizes `Py_UNICODE_SIZE` is a compile-time constant and evaluates all expressions it is used in. As a result, the Cython branches with `Py_UNICODE_SIZE` are eliminated during compile-time leaving only one unconditional code path in C. In order to discern what `Py_UNICODE_SIZE` should be at the Python level, we inspect the `itemsize` of an empty unicode (i.e. `"u"`) `array`'s `itemsize` in `setup.py`. Based on this we can determine `Py_UNICODE_SIZE`'s value. We then inject this into `config.pxi`. This value is only inspected and added for Python 2/3 as the unicode `array` is scheduled for removal in Python 4. As we only check this value within a branch of code that should be removed on Python 4 anyways, leaving `Py_UNICODE_SIZE` undefined shouldn't be an issue for Python 4 builds. While `Py_UNICODE_SIZE` is defined in Python headers, we don't need to be worried about our definition of `Py_UNICODE_SIZE` clashing with it. The reason being Cython guarantees the assigned value will injected into the source. So the definition will have already been eliminated by the time the C code is generated. Further we only use `Py_UNICODE_SIZE` in comparisons to other constants. So Cython will inject the value and then evaluate these comparisons resulting in them being turned into always `True` or `False` branches. These in turn are optimized away.

On Python 2, the `format` of all `array`s is `uint8` (i.e. `"B"`). The reason being that all `array`s on Python 2 only support the old buffer protocol. As such they need to be converted through that path to get access to use them with the (new) buffer protocol. Doing so results in such objects being treated as if they are of `uint8` type. Since that is how we want character (i.e. `"c"`) arrays treated on Python 2 (they don't exist on Python 3), then there is nothing we need to here. This includes changing the `_format`. So drop that line.

Go ahead and join the different no-op cases with `array` types into one condition. Just makes it easier to see that all of these are just being skipped. Besides Cython will break this out however it sees fit in C code. So there is no benefit to us doing anything special here.

Go ahead and set `itemsize` at the same time that `format` is being set for the Python 2 `array` cases. These two properties are connected and naturally go together. So this makes a bit more sense from a code readability point of view.

On Python 2 when handling the `array` case, the only reason to worry about adjusting the shape and strides is if the `itemsize` changed as a consequence of casting back to the original type. If the itemsize is the same, which on Python 2 is 1 as a consequence of needing to use the old buffer protocol for `array`s, then there is no need to perform this casting. Hence we update the Python 2 shape and stride adjustment case to check to see if the `itemsize` is different from `1`, which is now possible as we assign the `itemsize` when handling the casting above. This is fast check since it is comparing to C integers and only matters for Python 2 as the code is removed for newer Python versions.

Drop the checks for `uint8` and character `array`s in the general case in the casting code. Instead favor handling these checks as part of the Python 2 `elif` case after handling the unicode `array`s. This consolidates the code a bit. Also simplifies the number of checks being done on Python 3. The Python 2 case winds up being about the same other than it makes better usage of its branches. Should improve readability and streamline things a bit.

As there is no Python 4+ yet, we do not need to worry about coverage on the `else` case here. So inform `coverage` to not worry about it. ref: https://coverage.readthedocs.io/en/coverage-4.4.2/branch.html#structurally-partial-branches

Make this comment a bit more informative. Particularly after we have included changing `itemsize`.

jakirkham added 9 commits November 23, 2018 22:42

Set itemsize and format for array on Python 2

7b09039

Go ahead and set `itemsize` at the same time that `format` is being set for the Python 2 `array` cases. These two properties are connected and naturally go together. So this makes a bit more sense from a code readability point of view.

Ignore coverage of Python 4+

321dc8b

As there is no Python 4+ yet, we do not need to worry about coverage on the `else` case here. So inform `coverage` to not worry about it. ref: https://coverage.readthedocs.io/en/coverage-4.4.2/branch.html#structurally-partial-branches

Update array casting comment

ae9f899

Make this comment a bit more informative. Particularly after we have included changing `itemsize`.

jakirkham merged commit 302de15 into master Nov 24, 2018

jakirkham deleted the simp_array_init branch November 24, 2018 06:34

jakirkham mentioned this pull request Nov 24, 2018

Drop UBYTE_TC #14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify initialization of `cybuffer` with a builtin `array` #13

Simplify initialization of `cybuffer` with a builtin `array` #13

jakirkham commented Nov 24, 2018

Simplify initialization of cybuffer with a builtin array #13

Simplify initialization of cybuffer with a builtin array #13

Conversation

jakirkham commented Nov 24, 2018

Simplify initialization of `cybuffer` with a builtin `array` #13

Simplify initialization of `cybuffer` with a builtin `array` #13