New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3 fixes - fix scm bytes vs unicode issues #6257

Merged
merged 5 commits into from Jul 30, 2018

Conversation

Projects
None yet
2 participants
@Eric-Arellano
Copy link
Contributor

Eric-Arellano commented Jul 28, 2018

This folder had a lot of issues with intermixing unicode and bytes.

While I think some of it can be refactored to use unicode (simpler to deal with), I tried to stop myself from doing this and to stay as true to the original semantics as possible.

Tests now pass locally in Py2 and Py3.

sha = tree_data[i + 1:i + 1 + GIT_HASH_LENGTH].encode('hex')
name = b''.join(tree_data[start:i])
sha = b''.join(tree_data[i + 1:i + 1 + GIT_HASH_LENGTH])
sha_hex = binascii.hexlify(sha)

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 28, 2018

Contributor

Original had a couple problems:

  • calling encode() on a list
  • using 'hex' encoding, which has been dropped from the encode() function in Py3

Instead, binascii.hexlify() will get the result we want, and it must be passed a byte string instead of a list.

i += 1
mode = tree_data[start:i]
mode = b''.join(tree_data[start:i])

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 28, 2018

Contributor

Originally, mode and the below would be a List[bytes], rather than bytes. In Py2, it wouldn't complain about comparing a list to a byte string, but Py3 does.

raise self.MissingFileException(rev, relpath)

_, object_type, object_len = parts

# Read the object data
blob = self._cat_file_process.stdout.read(int(object_len))
blob = [bytes([b]) for b in bytes(blob)]

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 28, 2018

Contributor

This addition is to consistently use Py3 semantics for iterating through a byte string, which is very different than Py2. See http://python-future.org/compatible_idioms.html#byte-string-literals

This comment has been minimized.

@stuhood

stuhood Jul 28, 2018

Member

It's not clear to me why we're doing this... we're reading a precise number of bytes off of the wire, and what we want out of this method is a byte string. So why is the transform necessary?

If the output of the process is currently set as unicode, it shouldn't be.

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 28, 2018

Contributor

So the reason I did this is to allow us to consistently iterate through the bytestring between Py2 and Py3.

In Py2, you can do this

for c in b'test':
  assert type(c) == bytes

In Py3, if you try doing this, it will actually return ints, instead of bytechars. If you want to iterate through the byte string like originally intended, you would have to do this

for i in b'test':
  assert type(i) == int
  bytechar = bytes([i])
  assert type(bytechar) == bytes

According to the Future cheat sheet, to ensure consistent behavior, you're supposed to wrap the string literal like this

for i in bytes(b'test'):
  bytechar = bytes([i])

I originally had the above line where it was directly called, e.g. in _read_tree(). I moved it here to reduce duplication.

But I think you're right - this should still return a byte string, not a list of byte chars. I'll restore it to returning a byte string, while wrapping it in bytes() to get the same semantics in both Py2 and Py3.

P.S. Thanks again for the review! 100% of these changes in this new project phase are manual, so are much harder to reason about, and I appreciate you looking over this all!

if link_to.startswith('../') or link_to[0] == '/':
# Points outside the repo.
raise self.ExternalSymlinkException(self.rev, relpath)

# Restart our search at the top with the new path.
# Git stores symlinks in terms of Unix paths, so split on '/' instead of os.path.sep
components = link_to.split(SLASH) + components
components = link_to.split(SLASH.decode('utf-8')) + components

This comment has been minimized.

@stuhood

stuhood Jul 28, 2018

Member

Should change the constant instead. Despite being ostensibly public, it's only used locally.

raise self.MissingFileException(rev, relpath)

_, object_type, object_len = parts

# Read the object data
blob = self._cat_file_process.stdout.read(int(object_len))
blob = [bytes([b]) for b in bytes(blob)]

This comment has been minimized.

@stuhood

stuhood Jul 28, 2018

Member

It's not clear to me why we're doing this... we're reading a precise number of bytes off of the wire, and what we want out of this method is a byte string. So why is the transform necessary?

If the output of the process is currently set as unicode, it shouldn't be.

@@ -571,6 +573,7 @@ def _read_tree(self, path):
object_type, tree_data = self._read_object_from_repo(rev=self.rev, relpath=path)
assert object_type == b'tree'
# The tree data here is (mode ' ' filename \0 20-byte-sha)*
tree_data = [bytes([b]) for b in tree_data]

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 28, 2018

Contributor

@stuhood This commit 597c32e is what code looks like when having _read_object_from_repo return a bytestring instead of List[byte char].

Let me know if you think I should revert this to before. It looks like the function is private to the repo, so I'm in favor of reverting to before as it results in 3 less lines of duplicated code.

This comment has been minimized.

@stuhood

stuhood Jul 28, 2018

Member

Between:

    blob = self._cat_file_process.stdout.read(int(object_len))
    blob = [bytes([b]) for b in bytes(blob)]

...and:

    blob = bytes(self._cat_file_process.stdout.read(int(object_len)))

... definitely the latter, right?

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Jul 28, 2018

Contributor

With the latter, you have to have the 3 instances of statements like tree_data = [bytes([b]) for b in tree_data] throughout the codebase, rather than having the two lines together. I prefer keeping them closer together for proximity.

If we go with the former, I think you're revealing the need for a brief comment explaining why blob is redefined or another name for the second value like blob_as_bytechar_list

Either way, the code works and isn't a public API so this isn't a big deal.

This comment has been minimized.

@stuhood

stuhood Jul 29, 2018

Member

From what I can tell, most/all of those tree_data = [bytes([b]) for b in tree_data] transforms are because we're calling join on bytes, which doesn't seem necessary.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented on src/python/pants/scm/git.py in 597c32e Jul 29, 2018

The join here is unnecessary: if you remove that you should be able to directly wrap the bytes in a BytesIO.

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented on src/python/pants/scm/git.py in 597c32e Jul 29, 2018

Ditto remove join... not sure why it's here.

@Eric-Arellano

This comment has been minimized.

Copy link
Contributor

Eric-Arellano commented Jul 29, 2018

Good catch with the unnecessary joins, you're definitely right and it works with those removed. Which then means I 100% agree with you that _read_object_from_repo should return bytes, not List[byte char].

We do still need to convert into the list within _read_tree because we iterate one character at a time and Py3 changes the semantics. I added a comment and link explaining why, as it probably seems unnecessary to most readers.

Thanks for reviewing this closely!

@stuhood
Copy link
Member

stuhood left a comment

Thanks!

@stuhood stuhood merged commit a4010de into pantsbuild:master Jul 30, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@Eric-Arellano Eric-Arellano deleted the Eric-Arellano:py3-fixes_scm branch Jul 30, 2018

CMLivingston pushed a commit to CMLivingston/pants that referenced this pull request Aug 27, 2018

Python 3 fixes - fix scm bytes vs unicode issues (pantsbuild#6257)
This folder had a lot of issues with intermixing unicode and bytes. 

While I think some of it can be refactored to use unicode (simpler to deal with), I tried to stop myself from doing this and to stay as true to the original semantics as possible.

Tests now pass locally in Py2 and Py3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment