New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require the system encoding to be UTF-8 #6305

Merged
merged 3 commits into from Aug 6, 2018

Conversation

Projects
None yet
4 participants
@stuhood
Copy link
Member

stuhood commented Aug 4, 2018

Problem

As seen in #6295, python 3's open call defaults to decoding from locale.getpreferredencoding() when opened in non-binary mode, which can cause odd misconfigurations to be detected late in the game.

One solution to that problem would be to explicitly specify encodings on all open calls where the encoding is known ahead of time, but that game of whack a mole is hard to win, since libraries might also open files indirectly.

Solution

Require a preferred encoding of UTF-8, and address a few locations where pants' own CI was not providing it.

Result

If folks are using other encodings for valid reasons, hopefully they will file issues; if not, they will have a workaround.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Aug 4, 2018

Also, to be clear: I'm not certain that this is the very best idea. It's possible that:

  • since we're explicitly calling bytes.encode('UTF-8') all over the place, rather than banning other encodings, we could just require UTF-8
  • we could actually use the locale.getpreferredencoding() in all of the locations where we currently specify utf-8
  • that we should actually specify utf-8 to all of the open calls where it is relevant
@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

Eric-Arellano commented Aug 4, 2018

Is there another option to force the system to use UTF 8? I think I remember seeing a way to do this in Python. Although I’m not sure the ramifications and if that’s safe.

The simplest option to me seems to be requiring UTF-8 because it reduces the complexity of development and simplifies our assumptions. If someone using Pants complains after trying that this breaks their usage, then we could invest in supporting their encoding. Alternative of supporting multiple encodings now may be a premature optimization.

Thanks for looking into this all!

@illicitonion

This comment has been minimized.

Copy link
Contributor

illicitonion commented Aug 5, 2018

My instinct is that we should just require everyone to use UTF-8 so that we can assume UTF-8 everywhere. Simplifies our world a lot.

In particular, I'm scared about the idea of things crossing the python/rust boundary with other encodings; we very strongly assume utf-8 on that boundary.

@@ -17,6 +17,14 @@ class PantsLoader(object):
ENTRYPOINT_ENV_VAR = 'PANTS_ENTRYPOINT'
DEFAULT_ENTRYPOINT = 'pants.bin.pants_exe:main'

ENCODING_IGNORE_ENV_VAR = 'PANTS_IGNORE_UNRECOGNIZED_ENCODING'
BLACKLISTED_ENCODINGS = (

This comment has been minimized.

@benjyw

benjyw Aug 6, 2018

Contributor

I'd be in favor of being even stricter, and requiring UTF-8. Do we know that this might cause issues?

@benjyw

This comment has been minimized.

Copy link
Contributor

benjyw commented Aug 6, 2018

Adding an explicit encoding to every open seems like the right thing to do in some sense, but doesn't feel very pythonic somehow? Part of the joy of python is how easy it is to mess with files...

In Python 3.7 we can actually force UTF-8, by settings the env var PYTHONUTF8=1 or adding the cmd line flag -X utf8 to the cpython invocation. See https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUTF8.

So perhaps we should require Python 3.7 for running Pants once we're fully on py3, and until then make this check more strict so that it fails for any preferred encoding that isn't UTF-8?

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

Eric-Arellano commented Aug 6, 2018

Definitely agreed having to specify encoding=utf-8 is very unpythonic and easy for us to mess up. There’s no way our testing will catch every usage of open, so unless we create a linter for this use case we can easily mess up.

Requiring Python 3.7 seems like a step up in terms of restrictiveness compared to only requiring UTF-8. Because Python 3.7 came out only a month ago and some libraries still have issues with it, I’m not very comfortable with requiring 3.7.

P.S. Benjy’s comment brings up our need to decide which python 3 languages we’ll support. The most common for libraries in 2018 is to support 3.5+. Which version we choose determines which new features we can use, such as 3.7’s dataclasses. This might also not be the best venue for deciding.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Aug 6, 2018

It sounds like there is agreement around requiring UTF-8, so I'll just go ahead and do that.

@stuhood stuhood changed the title Blacklist known-problematic system encodings. Require the system encoding to be UTF-8 Aug 6, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Aug 6, 2018

This now requires UTF-8, and I'm expecting it to pass CI. Please take a look.

@benjyw

benjyw approved these changes Aug 6, 2018

@Eric-Arellano
Copy link
Contributor

Eric-Arellano left a comment

Thanks Stu!

@stuhood stuhood merged commit fde209f into pantsbuild:master Aug 6, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/guarantee-encoding branch Aug 6, 2018

CMLivingston pushed a commit to CMLivingston/pants that referenced this pull request Aug 27, 2018

Require the system encoding to be UTF-8 (pantsbuild#6305)
### Problem

As seen in pantsbuild#6295, python 3's `open` call defaults to decoding from `locale.getpreferredencoding()` when opened in non-binary mode, which can cause odd misconfigurations to be detected late in the game.

One solution to that problem would be to explicitly specify encodings on all `open` calls where the encoding is known ahead of time, but that game of whack a mole is hard to win, since libraries might also open files indirectly.

### Solution

Require a preferred encoding of `UTF-8`, and address a few locations where pants' own CI was not providing it.

### Result

If folks are using other encodings for valid reasons, hopefully they will file issues; if not, they will have a workaround.

Eric-Arellano added a commit to Eric-Arellano/pants that referenced this pull request Nov 27, 2018

Improve error message for locale check
In pantsbuild#6305, we began to require UTF-8 for Pants to work properly.

After helping a Pants user via Slack to figure out the error message and how to respond to it, we identified some small tweaks to the error message that can make it even more clear and human-readable:

* Use second-person to make message friendlier.
* Explain that we resolve encoding via `locale.getpreferredencoding()`. This helps to clarify that while LC_ALL and LANG are likely what need to be changed, those are means to the end, not the end itself.
* Clarify that the bypass is valid to use, only means we cannot guarantee valid behavior.

Eric-Arellano added a commit that referenced this pull request Dec 6, 2018

Improve error message for locale check (#6821)
In #6305, we began to require UTF-8 for Pants to work properly.

After helping a Pants user via Slack to figure out the error message and how to respond to it, we identified some small tweaks to the error message that can make it even more clear and human-readable:

* Use second-person to make message friendlier.
* Explain that we resolve encoding via `locale.getpreferredencoding()`. This helps to clarify that while LC_ALL and LANG are likely what need to be changed, those are means to the end, not the end itself.
* Clarify that the bypass is valid to use, only means we cannot guarantee valid behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment