Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicode_literals considered harmful #10017

Closed
embray opened this issue Dec 15, 2017 · 2 comments
Closed

unicode_literals considered harmful #10017

embray opened this issue Dec 15, 2017 · 2 comments

Comments

@embray
Copy link
Contributor

embray commented Dec 15, 2017

from __future__ import unicode_literals often does more harm than good. My experience with Python 3 porting has been that while it's often tempting to start out by saying "yes, unicode everywhere on Python 2", it turns out to be more of a problem than one might immediately expect--in fact one runs into exactly the kinds of problems that motivated Python 3's backwards-compatibility breaking in the first place. It causes Python 2 interfaces that previously returned str instances to now return unicode instances.

This is fine up to the point where you pass those other third-party interfaces that don't deal with unicode well. This includes the Python standard library. For example, if the user's home directory contains non-ASCII characters, matplotlib crashes very early on at import time due to a call to os.path.expanduser('~'). Because unicode_literals means we're passing in u'~' this results, due to the implementation of os.path.expanduser, in a concatenation of a str with a unicode. And the legacy unicode coercion behavior is such that Python will try to decode the str as ASCII, resulting in a UnicodeDecodeError. This can be easily demonstrated, for example, by running something like:

HOME="$HOME/☃" python -c 'import matplotlib'

And that's just the start. Problems related to concatenating unicode and non-unicode strings are pervasive.

Because of this it's actually often safer, on Python 2, to leave str as str and only explicitly use unicode strings in places where one is explicitly representing non-ASCII text (e.g. in string literals). While it's true that leaving str as str on Python 2 runs a risk of mojibake, that only tends to be an issue when combining strings from multiple sources that may have different encodings. In the most common cases (e.g. combining paths from the same filesystem) this won't be an issue, and mojibake issues are better addressed at the source--typically some system-level interface. In fact simply using unicode strings everywhere on Python 2 does reduce the likelihood of encoding problems if encodings aren't already handled carefully at system boundaries.

If you don't want to take it from me, here's a more authoritative source on this: https://mail.python.org/pipermail/python-dev/2016-December/147009.html The end result of that thread was that recommandations to use unicode_literals were removed from the official Python 3 porting guide. I would suggest matplotlib also remove unicode_literals at least in most modules where it isn't strictly necessary, and instead (now that Python 3.3+ supports it) use u'' explicitly for the rare unicode literals in the source code and tests. I'll have a pull request for this ready soon.

@tacaswell tacaswell added this to the v2.2 milestone Dec 19, 2017
@slel
Copy link

slel commented Oct 31, 2018

Is this fixed by #10044 and ready to be closed?

@jklymak
Copy link
Member

jklymak commented Oct 31, 2018

I think so....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants