Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Python 3 Pitfalls
We use the future library and only support python >= 3.3. The motivation for Python 3 is best summarized in this article. Porting-to-python-3 is a nice description how to support python 3 via a single code base.
str vs. bytes
On Python 3 strings are now internally Unicode, bytes are their encoded representation (see e.g. this brilliant video).
lxml can deal with strings and bytes but treats them differently. Strings must not have an xml encoding declaration - it will raise otherwise. Bytes on the other hand can, and probably should (otherwise it defaults to utf-8) have an encoding declaration, see also lxml FAQ.
The idea is to to represent raw content like waveforms, XML files or URL request by bytes and BytesIO. We internally convert them to Unicode / str such that the user of the library, will only deal with strings (e.g.
hasattr() expects an
KeyError to be raised if the attribute is not found. Thus in order to allow
hasattr() of an
__getattr__ must raise an
AttributeError instead of an
KeyError. This was not the case in the past. This change affects many places in our code and might also break some third party code.
doctests, unicode prefix
Doctests are a pain to maintain for a python 2 and python 3 compatible code base. The problem is that the python 3 aware
str / the compatibility
future.builtins.str will result in:
# python 3 >>> from future.builtins import str >>> str("hello") >>> "hello"
# python 2 >>> from future.builtins import str >>> str("hello") >>> u"hello"
Thus there is a "u" prefix in python2 which will cause the doctests to fail. Solutions range from trying to use print everywhere, use
#doctest: +ELLIPSIS (e.g. ..."hello"), or just skipping the test. They are all quite awful.
arbitrary order dictionaries
In contrast to Python 2 dictionaries are now really arbitrary order and the order changes often between test runs.
sorted is here your friend.
accessing single characters in byte strings
In : s = b"hallo" In : s # intuitively I expect "l" Out: 108 In : chr(s) Out: 'l' In : s[3:4] # I often used this Out: b'l'
Some libraries have troubles with
future.builtins.str for a single code base python 2 and 3 support. For these cases,
native_str is your friend, e.g.:
np.ctypeslib.ndpointer(dtype='int32', ndim=1, flags=native_str('C_CONTIGUOUS'))
decode and encode
It's pretty simple: always use
str, unless communicating externally, because you should use a definite encoding. In other words, all file reading should be decoded (either through the text mode, or loading binary and decoding manually) and all file writing should be encoded. Python 2 allows you to be a bit lazy and mix the two; Python 3 does not.
The typical pitfall is: Our tests are reading bytes but not decoding them, and the writes in the new code are not encoding the strings to bytes.
A single code base bytes and str support for python 2 and 3 is quite a challenge. The following examples shows how the following python 3 decode encode can work with python 2:
>>> byte = b'b\xc3\xa4h' >>> print(byte.decode()) bäh >>> string = "bäh" >>> print(string.encode()) b'b\xc3\xa4h'
python 2 (python 3 compatible via future library)
>>> from __future__ import print_function >>> from future.builtins import str, bytes >>> byte = bytes(b'b\xc3\xa4h') >>> print(byte.decode()) bäh >>> string = str(u"bäh") b'b\xc3\xa4h'
python 2 (python 3 compatible via 'utf-8' argument)
>>> from __future__ import print_function >>> byte = b'b\xc3\xa4h' >>> print(byte.decode('utf-8')) bäh >>> string = u"bäh" >>> string.encode('utf-8') 'b\xc3\xa4h'