Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch: Add __str__ and __bytes__ for undecoded content. #790

Closed
wants to merge 1 commit into from

Conversation

erikvanzijst
Copy link
Contributor

Patch.patch assumes all content to be encoded in UTF-8 and forcefully
replaces any non-decodable sequences. This can lead to corruption for
content that either does not conform to any specific encoding altogether, or
uses an encoding that is incompatible with, or ambinuous to UTF-8.

This change adds __str__ and __bytes__ implementations to Patch that
return the unmodified, raw bytes.

@erikvanzijst erikvanzijst force-pushed the erik/patch_bytes branch 2 times, most recently from 5ff0818 to 95e4180 Compare April 27, 2018 21:02
@erikvanzijst
Copy link
Contributor Author

Would you mind having a look at this pr, @brandonio21?

@erikvanzijst
Copy link
Contributor Author

We're using PyGit2 to power parts of Bitbucket, but up until now that didn't include patch generation (for which we still just shelled out git). As we're moving more functionality onto pygit, we ran into the forced unicode decoding issue. Bitbucket needs access to the raw, untouched diff/patch bytes and so it seemed appropriate to put that under str and bytes.

If I overlooked something and the raw byte contents of patches are already accessible, please let me know.

@jdavid
Copy link
Member

jdavid commented Apr 29, 2018

We should deprecate .patch

So what we need is a way to get bytes and a way to get text:

  • bytes()
  • .decode()

Since .decode() will always return text, we can use it to deprecate .patch ; don't need to implement str()

Don't need to add new tests for .decode(), better to replace .patch by .decode() in the tests.

Bonus points if .patch raises a deprecation warning.

Thanks!

Copy link
Contributor

@brandonio21 brandonio21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. I do agree that .patch should probably be deprecated (simply because we don't want many ways to do this lying around).

In regards to naming, I like str and bytes.

src/patch.c Outdated
#if PY_MAJOR_VERSION == 2
ret = Py_BuildValue("s#", buf.ptr, buf.size);
#else
ret = to_unicode(buf.ptr, NULL, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the documentation of Py_BuildValue returns a str using UTF8 for us. Do we even need to use to_unicode here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, looks like to_unicode has some extra logic we might want. Carry on!

@jdavid
Copy link
Member

jdavid commented Apr 29, 2018

The problem with str() is that it should return text in Python 3 and bytes in Python 2, exactly as @erikvanzijst has implemented. So it cannot replace .patch; while .decode() is expected to return text in both Python 2 and 3.

So I think we should have bytes() and .decode() ; and we may have str()

Eventually decode could accept an optional extra parameter for the encoding.

The case for str is about the output we want for print(patch), which I think is a different issue.

$ python2
>>> b''.decode()
u''
>>> str(b'')
''

$ python3
>>> b''.decode()
''
>>> str(b'')
"b''"

@erikvanzijst
Copy link
Contributor Author

Yeah, I can add an explicit .decode(encoding="utf-8", errors="strict") to Patch, mimicking str.decode / bytes.decode, to make it a little easier to write py2/3-compatible clients.

I propose to also keep str, as it's so ubiquitous.

I'll also add a runtime deprecation warning to Patch.patch.

Patch.patch assumes all content to be encoded in UTF-8 and forcefully
replaces any non-decodable sequences. This can lead to corruption for
content that either does not conform to any specific encoding altogether, or
uses an encoding that is incompatible with, or ambinuous to UTF-8.

This change adds __str__(), __bytes__() methods to Patch that return
the unmodified, raw bytes. It also adds a new .decode() method that
gives greater control over how text decoding happens, deprecating
Patch.patch.
PyErr_WarnEx(PyExc_DeprecationWarning, "`Patch.patch` assumes UTF-8 encoding and can have unexpected results on "
"other encodings. If decoded text is needed, use `Patch.decode()` "
"instead. Otherwise use `bytes(Patch)`.", 1);
return decode(self, "utf-8", "replace");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hardwired parameters are equivalent to the original logic of to_unicode(buf.ptr, NULL, NULL).

@jdavid
Copy link
Member

jdavid commented May 1, 2018

It's true that str() is more obvious than .decode(), but still don't like the idea for print(patch) to print the text string.

What about .as_bytes() and .as_string(), or .to_bytes() and .to_string()? Or something else like that.

Here we're defining the policy that eventually should be used across the whole lib.
For reference, right now only Oid implements str() and only Blob implements bytes().

@jdavid
Copy link
Member

jdavid commented May 1, 2018

Or to be more explicit as/to_text() instead of as/to_string()

@erikvanzijst
Copy link
Contributor Author

It's true that str() is more obvious than .decode(), but still don't like the idea for print(patch) to print the text string.

That's true only on Python 3 though. What probably shaped my thinking a bit is that most of my day to day work is still Python 2 where str() returns raw bytes, not text.

However, I agree that it might be best to put Python 3 idioms ahead of Python 2 and I'm happy to axe __str__ in that light. Doing that probably also diminishes the value of having a Python 3-specific __bytes__.

AFAICT, Blob implements the buffer protocol, but bytes(blob) only works on Python 3. On Python 2 it relies on Blob.__str__ which doesn't return the blob's contents. The only 2/3-compatible way to access a blob's bytes seems to be through Blob.data, or something like memoryview(blob).tobytes() through the buffer protocol interface directly.

What would you think of carrying Blob's interface forward and adding Patch.data instead of as_bytes()? Additionally I could also implement the buffer protocol onto Patch, so that it behaves identically to Blob.

@jdavid
Copy link
Member

jdavid commented May 1, 2018

Objects such as blobs also have .read_raw(), though only blobs have .data ; not nice to have 2 ways for the same thing. Ideally I would like a consistent way to get bytes and text strings through the library, even if it's not implemented now, I would like it to be defined.

#610 is a similar issue, concerning DiffLine. The names chosen here will apply there. Again we see right now 2 different ways, to get the text string, .patch and .content for Patch and DiffLine respectively.

So let leave alone str, and bytes reserved for objects implementing the buffer protocol.

So, .data looks good enough to me. Here an updated proposal:

  • .data for the bytes string
  • .text for the text string

If somebody wants a text string with an encoding different from utf-8 it would be as easy as:

.data.decode(...)

So no need for .text to be a method, it would be just a shorthand.

The buffer protocol I understand is an optimization to reduce memory copies. Whether to implement it or not is up to you, don't know whether it will make a difference for your use case.

These could be the commits:

  1. Add .data
  2. Add .text (and deprecate .patch)
  3. Implement the buffer protocol

I understand the first one is enough to solve your issue. It's up to you how far do you want to go 😄

Please don't hesitate if you have better ideas than .data and .text

@erikvanzijst
Copy link
Contributor Author

Yeah, I'm happy to add Patch.data, as well as a read-only buffer protocol interface. The latter could be helpful when dealing with very large diffs, similar to very large blobs. Buffers probably make less sense for DiffLine.

Though, would .text be hardwired to replace invalid code sequences (errors='replace', like Patch.patch does now)? If we do that, making it seemingly trivial to convert between bytes and text, then I worry that we might to continue to feed into the common misunderstandings around text decoding.

I defined .encode() to default to errors='strict' so as to blow up on non-UTF8 data. A method or attribute that silently truncates its input doesn't seem attractive.

Do we really need a shorthand for decoding a patch/blob/line into text? With Git's lack of encoding information, this is an inherently tricky and unreliable proposition.

@jdavid
Copy link
Member

jdavid commented May 2, 2018

This replace usage was introduced in #77 ;
the commit given as example is gitster/git@c31820c

That commit doesn't have an encoding header. Github prefers to print bad data that not printing at all. And so does git log or git show for instance.

So I would stick to that behaviour and keep replace. It will also make a smoother transition from .patch to .text, and through all these years I don't remember anyone raising an issue because of this.

I think it's convenient to have .text, probably we could add it in other places as well, so you can type things like print(commit.text)

For reference there're a few more places where we get the bytes and text with different naming: Commit.message, Commit.raw_message, Tag.message, Tag._message

In some cases there are several attributes available and here we see 2 conventions:

  • TreeEntry.name and TreeEntry._name
  • Signature.name and Signature.raw_name

We need to consider these to have a coherent convention... I like Patch.text and .data looks good enough, though it could be .raw or .raw_text, ??

@jnrbsn
Copy link
Contributor

jnrbsn commented Mar 30, 2019

@jdavid @erikvanzijst This can probably be closed now that #893 is merged.

@jdavid
Copy link
Member

jdavid commented Mar 31, 2019

yes, follow up in issue #895

Thank you both!

@jdavid jdavid closed this Mar 31, 2019
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request May 1, 2019
0.28.1 (2019-04-19)
-------------------------

- Now works with pycparser 2.18 and above
  `#846 <https://github.com/libgit2/pygit2/issues/846>`_

- Now ``Repository.write_archive(..)`` keeps the file mode
  `#616 <https://github.com/libgit2/pygit2/issues/616>`_
  `#898 <https://github.com/libgit2/pygit2/pull/898>`_

- New ``Patch.data`` returns the raw contents of the patch as a byte string
  `#790 <https://github.com/libgit2/pygit2/pull/790>`_
  `#893 <https://github.com/libgit2/pygit2/pull/893>`_

- New ``Patch.text`` returns the contents of the patch as a text string,
  deprecates `Patch.patch`
  `#790 <https://github.com/libgit2/pygit2/pull/790>`_
  `#893 <https://github.com/libgit2/pygit2/pull/893>`_

Deprecations:

- ``Patch.patch`` is deprecated, use ``Patch.text`` instead
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request May 19, 2019
0.28.1 (2019-04-19)
-------------------------

- Now works with pycparser 2.18 and above
  `#846 <https://github.com/libgit2/pygit2/issues/846>`_

- Now ``Repository.write_archive(..)`` keeps the file mode
  `#616 <https://github.com/libgit2/pygit2/issues/616>`_
  `#898 <https://github.com/libgit2/pygit2/pull/898>`_

- New ``Patch.data`` returns the raw contents of the patch as a byte string
  `#790 <https://github.com/libgit2/pygit2/pull/790>`_
  `#893 <https://github.com/libgit2/pygit2/pull/893>`_

- New ``Patch.text`` returns the contents of the patch as a text string,
  deprecates `Patch.patch`
  `#790 <https://github.com/libgit2/pygit2/pull/790>`_
  `#893 <https://github.com/libgit2/pygit2/pull/893>`_

Deprecations:

- ``Patch.patch`` is deprecated, use ``Patch.text`` instead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants