Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify initialization of cybuffer with a builtin array #13

Merged
merged 9 commits into from
Nov 24, 2018

Conversation

jakirkham
Copy link
Owner

Makes some small adjustments to how cybuffer handles the array initialization case. Namely only special case handling of array on Python 3 or earlier (as Python 4 shouldn't have any of these issues). Also turn Py_UNICODE_SIZE into a Cython compile-time constant to cutdown on C code generated in relation to it. Finally cleanup and fuse together some no-op cases for some array types. Streamlines things a bit and makes it a bit clearer what is happening in the code.

When initializing `cybuffer` with an `array, ensure that we only patch
up the `format` and associated attributes for the `array` buffer on
Python 2/3. This workaround is needed for Python 2 as the `array` type
does not support the (new) buffer protocol. So we need to use the old
buffer protocol and then the (new) buffer protocol with that result.
This loses information about the `array` in the process, which is the
bulk of the reason this is needed. Python 3's `array` supports the (new)
buffer protocol; so, we can largely ignore these workarounds. However on
Python 2/3, we have the `"u"` (unicode) format for the `array`, which is
a little broken. So we workaround those issues by casting the array to
an unsigned integer of equivalent width to the unicode type. However the
`"u"` format for `array` is deprecated and scheduled for removal in
Python 4. So there should be no need to do any of this work on Python 4.
As such, check whether Python 2/3 is in use when checking if the data is
provided is an `array`. That way this whole branch will automatically be
removed on Python 4 when it lands.
Previously we dealt with `Py_UNICODE_SIZE` as a C compile-time constant,
which would inject the value during the C preprocessor pass. Later
compilation optimizations by the C compiler likely removed unneeded
branches caused by this. Though this is pretty needless as Cython is
generating these branches that we later drop. Not to mention Cython
sometimes assigns constants like these to intermediate or stack local
variables, which can hinder compiler optimizations. It would be better
yet to have Cython not define these branches at all.

To that end, we make `Py_UNICODE_SIZE` a Cython compile-time constant
that is set in `config.pxi` along with other compile-time constants. As
a result, Cython injects these directly in `cybuffer.pxd` and
`cybuffer.pxi` before even performing any compilation. Once it goes to
compile the Cython code to C, it recognizes `Py_UNICODE_SIZE` is a
compile-time constant and evaluates all expressions it is used in. As a
result, the Cython branches with `Py_UNICODE_SIZE` are eliminated during
compile-time leaving only one unconditional code path in C.

In order to discern what `Py_UNICODE_SIZE` should be at the Python
level, we inspect the `itemsize` of an empty unicode (i.e. `"u"`)
`array`'s `itemsize` in `setup.py`. Based on this we can determine
`Py_UNICODE_SIZE`'s value. We then inject this into `config.pxi`. This
value is only inspected and added for Python 2/3 as the unicode `array`
is scheduled for removal in Python 4. As we only check this value
within a branch of code that should be removed on Python 4 anyways,
leaving `Py_UNICODE_SIZE` undefined shouldn't be an issue for Python 4
builds.

While `Py_UNICODE_SIZE` is defined in Python headers, we don't need to
be worried about our definition of `Py_UNICODE_SIZE` clashing with it.
The reason being Cython guarantees the assigned value will injected into
the source. So the definition will have already been eliminated by the
time the C code is generated. Further we only use `Py_UNICODE_SIZE` in
comparisons to other constants. So Cython will inject the value and then
evaluate these comparisons resulting in them being turned into always
`True` or `False` branches. These in turn are optimized away.
On Python 2, the `format` of all `array`s is `uint8` (i.e. `"B"`). The
reason being that all `array`s on Python 2 only support the old buffer
protocol. As such they need to be converted through that path to get
access to use them with the (new) buffer protocol. Doing so results in
such objects being treated as if they are of `uint8` type. Since that is
how we want character (i.e. `"c"`) arrays treated on Python 2 (they
don't exist on Python 3), then there is nothing we need to here. This
includes changing the `_format`. So drop that line.
Go ahead and join the different no-op cases with `array` types into one
condition. Just makes it easier to see that all of these are just being
skipped. Besides Cython will break this out however it sees fit in C
code. So there is no benefit to us doing anything special here.
Go ahead and set `itemsize` at the same time that `format` is being set
for the Python 2 `array` cases. These two properties are connected and
naturally go together. So this makes a bit more sense from a code
readability point of view.
On Python 2 when handling the `array` case, the only reason to worry
about adjusting the shape and strides is if the `itemsize` changed as a
consequence of casting back to the original type. If the itemsize is the
same, which on Python 2 is 1 as a consequence of needing to use the old
buffer protocol for `array`s, then there is no need to perform this
casting. Hence we update the Python 2 shape and stride adjustment case
to check to see if the `itemsize` is different from `1`, which is now
possible as we assign the `itemsize` when handling the casting above.
This is fast check since it is comparing to C integers and only matters
for Python 2 as the code is removed for newer Python versions.
Drop the checks for `uint8` and character `array`s in the general case
in the casting code. Instead favor handling these checks as part of the
Python 2 `elif` case after handling the unicode `array`s. This
consolidates the code a bit. Also simplifies the number of checks being
done on Python 3. The Python 2 case winds up being about the same other
than it makes better usage of its branches. Should improve readability
and streamline things a bit.
As there is no Python 4+ yet, we do not need to worry about coverage on
the `else` case here. So inform `coverage` to not worry about it.

ref: https://coverage.readthedocs.io/en/coverage-4.4.2/branch.html#structurally-partial-branches
Make this comment a bit more informative. Particularly after we have
included changing `itemsize`.
@jakirkham jakirkham merged commit 302de15 into master Nov 24, 2018
@jakirkham jakirkham deleted the simp_array_init branch November 24, 2018 06:34
@jakirkham jakirkham mentioned this pull request Nov 24, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant