New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 fixes - specify binary vs unicode behavior of temporary_file() and stdio_as() #6226

Merged
merged 6 commits into from Jul 25, 2018

Conversation

Projects
None yet
3 participants
@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Jul 24, 2018

Problem

Python 2 allows us to freely mix bytes and unicode with IO. Python 3 forces us to choose between bytes vs unicode.

This led to failing unit tests for contextutil.

Solution

Our code base has mixed use of stdio_as() and temporary_file(). In some instances, we write unicode and it clearly seems to be a good case for text. In others, like downloads, bytes make sense.

Hence, I think the best approach is to enable turning on or off binary mode, with a default given.

The defaults should emulate Python 3's native implementation (principle of least surprise), as discussed below.

temporary_file()

In both Py2 and Py3, tempfile defaults to w+b, as it does not presume what you are writing to the file (docs). E.g. with our contrib code for downloads, bytes makes sense.

So, I kept the default to bytes.

stdio_as()

In Py3, stdin, stderr, and stdout are unicode streams, meaning stdin.write(b'hi') will fail. If you want to access bytes, you use buffer, e.g. stdin.buffer.write(b'hi') (docs).

So, I changed stdio_as() to default to writing unicode with the option to write in binary mode.

Incremental port

This change shouldn't break anything in Python 2, because it allows for intermixing string types. That's great, because it means that we can incrementally choose whether something should be text vs bytes.

@stuhood stuhood requested review from stuhood and kwlzn Jul 24, 2018

@stuhood
Copy link
Member

stuhood left a comment

I believe that this is accurate. Thanks.

There is one failure in CI.

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

Eric-Arellano commented Jul 24, 2018

Thanks Stu.

Should I also add unit tests around this new behavior? I'm not sure what type of test would be meaningful - something like asserting that you can't write unicode to a binary file seems repetitive to me as Python provides this functionality and error handling.

There are 4 combinations we're introducing:

  • binary files, binary std
  • binary files, unicode std
  • unicode files, binary std
  • unicode files, unicode std

Some of these don't seem to work together, like binary files replacing unicode std.

decoded_members.append(m)
tar.extractall(outdir, members=decoded_members)
else:
tar.extractall(outdir)

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 24, 2018

Contributor

This change is necessary because of removing suffix=b'' in temporary_dir(), which means temporary_dir will return a unicode file name instead of bytes—a good thing and necessary for Py3 to work.

In Py2, TarFile manges unicode in file names. See https://superuser.com/questions/60379/how-can-i-create-a-zip-tgz-in-linux-such-that-windows-has-proper-filenames/190786#190786 for more information and the basis of this fix.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Jul 24, 2018

Should I also add unit tests around this new behavior? I'm not sure what type of test would be meaningful - something like asserting that you can't write unicode to a binary file seems repetitive to me as Python provides this functionality and error handling.

Rather than adding new tests, getting the existing tests running in Python3 is probably the highest priority. If we don't already have functionality exercising the case, then it's a "tree falling in the forest" situation.

@kwlzn

kwlzn approved these changes Jul 25, 2018

Copy link
Member

kwlzn left a comment

lgtm w one question

@@ -123,7 +123,7 @@ def _stdio_stream_as(src_fd, dst_fd, dst_sys_attribute, mode):


@contextmanager
def stdio_as(stdout_fd, stderr_fd, stdin_fd):
def stdio_as(stdout_fd, stderr_fd, stdin_fd, binary_mode=False):

This comment has been minimized.

@kwlzn

kwlzn Jul 25, 2018

Member

afaict, the binary_mode=True case here is never exercised? should there be some if PY3 conditional logic somewhere consuming this for when pants is executed with a 3.x interpreter?

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 25, 2018

Contributor

In the use cases so far (i.e. util tests), I haven't had to turn on binary_mode=True. We in general will want to use text mode rather than binary_mode, because this is how std streams work in Python 3 by default.

Further, typically in Py3, if you do need to write bytes, you can access bytes through stdin.buffer.

The flag is only to give us the option if in porting everything else we find some use case where we want to write bytes directly to stdin etc, rather than unicode. For example, if we override stdin with a file that doesn't have this buffer functionality, and we want to write bytes, we'll have to trigger binary_mode=True.

This comment has been minimized.

@stuhood

stuhood Jul 25, 2018

Member

Mm. Good point Kris.

@Eric-Arellano : This is opening files that wrap fds, so there are no python objects before this method is entered. So the "override stdin with a file that doesn't have this buffer functionality" case might not be a case.

@stuhood
Copy link
Member

stuhood left a comment

Thanks!

@stuhood stuhood merged commit 88b07dd into pantsbuild:master Jul 25, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@Eric-Arellano Eric-Arellano deleted the Eric-Arellano:py3-fixes_contextutil-tempfile-unicode branch Jul 26, 2018

stuhood added a commit that referenced this pull request Jul 26, 2018

Python 3 fixes - fix dirutil, fileutil, and xml_parser tests (#6229)
### Problem
Dirutil tests failing due to unicode vs bytes. 

`dirutil.py` itself is fine. I decided `safe_file_dump()` and `read_file()` should still work with bytes, not unicode.

### Related PRs
To get completely green on Py3, requires #6226 and #6228. This can be merged before them both, though, as it shouldn't break Py2 and we aren't running Py3 yet.

CMLivingston pushed a commit to CMLivingston/pants that referenced this pull request Aug 27, 2018

Python 3 fixes - specify binary vs unicode behavior of temporary_file…
…() (pantsbuild#6226)

## Problem
Python 2 allows us to freely mix bytes and unicode with IO. Python 3 forces us to choose between bytes vs unicode.

This led to failing unit tests for `contextutil`.

## Solution
Our code base has mixed use of `temporary_file()`. In some instances, we write unicode and it clearly seems to be a good case for text. In others, like downloads, bytes make sense.

Hence, I think the best approach is to enable turning on or off binary mode, with a default given. 

The defaults should emulate Python 3's native implementation (principle of least surprise), as discussed below.

### temporary_file()
In both Py2 and Py3, `tempfile` defaults to `w+b`, as it does not presume what you are writing to the file ([docs](https://docs.python.org/3/library/tempfile.html#tempfile.TemporaryFile)). E.g. with our contrib code for downloads, bytes makes sense.

So, I kept the default to bytes.

## Incremental port
This change shouldn't break anything in Python 2, because it allows for intermixing string types. That's great, because it means that we can incrementally choose whether something should be text vs bytes.

CMLivingston pushed a commit to CMLivingston/pants that referenced this pull request Aug 27, 2018

Python 3 fixes - fix dirutil, fileutil, and xml_parser tests (pantsbu…
…ild#6229)

### Problem
Dirutil tests failing due to unicode vs bytes. 

`dirutil.py` itself is fine. I decided `safe_file_dump()` and `read_file()` should still work with bytes, not unicode.

### Related PRs
To get completely green on Py3, requires pantsbuild#6226 and pantsbuild#6228. This can be merged before them both, though, as it shouldn't break Py2 and we aren't running Py3 yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment