Patch: Add str and bytes for undecoded content. #790

erikvanzijst · 2018-04-27T20:27:50Z

Patch.patch assumes all content to be encoded in UTF-8 and forcefully
replaces any non-decodable sequences. This can lead to corruption for
content that either does not conform to any specific encoding altogether, or
uses an encoding that is incompatible with, or ambinuous to UTF-8.

This change adds __str__ and __bytes__ implementations to Patch that
return the unmodified, raw bytes.

erikvanzijst · 2018-04-27T21:04:34Z

Would you mind having a look at this pr, @brandonio21?

erikvanzijst · 2018-04-27T21:07:56Z

We're using PyGit2 to power parts of Bitbucket, but up until now that didn't include patch generation (for which we still just shelled out git). As we're moving more functionality onto pygit, we ran into the forced unicode decoding issue. Bitbucket needs access to the raw, untouched diff/patch bytes and so it seemed appropriate to put that under str and bytes.

If I overlooked something and the raw byte contents of patches are already accessible, please let me know.

jdavid · 2018-04-29T14:00:38Z

We should deprecate .patch

So what we need is a way to get bytes and a way to get text:

bytes()
.decode()

Since .decode() will always return text, we can use it to deprecate .patch ; don't need to implement str()

Don't need to add new tests for .decode(), better to replace .patch by .decode() in the tests.

Bonus points if .patch raises a deprecation warning.

Thanks!

brandonio21

I think this looks good. I do agree that .patch should probably be deprecated (simply because we don't want many ways to do this lying around).

In regards to naming, I like str and bytes.

brandonio21 · 2018-04-29T15:54:14Z

src/patch.c

+#if PY_MAJOR_VERSION == 2
+    ret = Py_BuildValue("s#", buf.ptr, buf.size);
+#else
+    ret = to_unicode(buf.ptr, NULL, NULL);


It looks like the documentation of Py_BuildValue returns a str using UTF8 for us. Do we even need to use to_unicode here?

Ah, looks like to_unicode has some extra logic we might want. Carry on!

jdavid · 2018-04-29T18:20:58Z

The problem with str() is that it should return text in Python 3 and bytes in Python 2, exactly as @erikvanzijst has implemented. So it cannot replace .patch; while .decode() is expected to return text in both Python 2 and 3.

So I think we should have bytes() and .decode() ; and we may have str()

Eventually decode could accept an optional extra parameter for the encoding.

The case for str is about the output we want for print(patch), which I think is a different issue.

$ python2
>>> b''.decode()
u''
>>> str(b'')
''

$ python3
>>> b''.decode()
''
>>> str(b'')
"b''"

erikvanzijst · 2018-04-30T18:19:22Z

Yeah, I can add an explicit .decode(encoding="utf-8", errors="strict") to Patch, mimicking str.decode / bytes.decode, to make it a little easier to write py2/3-compatible clients.

I propose to also keep str, as it's so ubiquitous.

I'll also add a runtime deprecation warning to Patch.patch.

Patch.patch assumes all content to be encoded in UTF-8 and forcefully replaces any non-decodable sequences. This can lead to corruption for content that either does not conform to any specific encoding altogether, or uses an encoding that is incompatible with, or ambinuous to UTF-8. This change adds __str__(), __bytes__() methods to Patch that return the unmodified, raw bytes. It also adds a new .decode() method that gives greater control over how text decoding happens, deprecating Patch.patch.

erikvanzijst · 2018-05-01T01:30:33Z

src/patch.c

+    PyErr_WarnEx(PyExc_DeprecationWarning, "`Patch.patch` assumes UTF-8 encoding and can have unexpected results on "
+                                           "other encodings. If decoded text is needed, use `Patch.decode()` "
+                                           "instead. Otherwise use `bytes(Patch)`.", 1);
+    return decode(self, "utf-8", "replace");


These hardwired parameters are equivalent to the original logic of to_unicode(buf.ptr, NULL, NULL).

jdavid · 2018-05-01T08:55:36Z

It's true that str() is more obvious than .decode(), but still don't like the idea for print(patch) to print the text string.

What about .as_bytes() and .as_string(), or .to_bytes() and .to_string()? Or something else like that.

Here we're defining the policy that eventually should be used across the whole lib.
For reference, right now only Oid implements str() and only Blob implements bytes().

jdavid · 2018-05-01T11:23:55Z

Or to be more explicit as/to_text() instead of as/to_string()

erikvanzijst · 2018-05-01T17:15:46Z

It's true that str() is more obvious than .decode(), but still don't like the idea for print(patch) to print the text string.

That's true only on Python 3 though. What probably shaped my thinking a bit is that most of my day to day work is still Python 2 where str() returns raw bytes, not text.

However, I agree that it might be best to put Python 3 idioms ahead of Python 2 and I'm happy to axe __str__ in that light. Doing that probably also diminishes the value of having a Python 3-specific __bytes__.

AFAICT, Blob implements the buffer protocol, but bytes(blob) only works on Python 3. On Python 2 it relies on Blob.__str__ which doesn't return the blob's contents. The only 2/3-compatible way to access a blob's bytes seems to be through Blob.data, or something like memoryview(blob).tobytes() through the buffer protocol interface directly.

What would you think of carrying Blob's interface forward and adding Patch.data instead of as_bytes()? Additionally I could also implement the buffer protocol onto Patch, so that it behaves identically to Blob.

jdavid · 2018-05-01T18:39:18Z

Objects such as blobs also have .read_raw(), though only blobs have .data ; not nice to have 2 ways for the same thing. Ideally I would like a consistent way to get bytes and text strings through the library, even if it's not implemented now, I would like it to be defined.

#610 is a similar issue, concerning DiffLine. The names chosen here will apply there. Again we see right now 2 different ways, to get the text string, .patch and .content for Patch and DiffLine respectively.

So let leave alone str, and bytes reserved for objects implementing the buffer protocol.

So, .data looks good enough to me. Here an updated proposal:

.data for the bytes string
.text for the text string

If somebody wants a text string with an encoding different from utf-8 it would be as easy as:

.data.decode(...)

So no need for .text to be a method, it would be just a shorthand.

The buffer protocol I understand is an optimization to reduce memory copies. Whether to implement it or not is up to you, don't know whether it will make a difference for your use case.

These could be the commits:

Add .data
Add .text (and deprecate .patch)
Implement the buffer protocol

I understand the first one is enough to solve your issue. It's up to you how far do you want to go 😄

Please don't hesitate if you have better ideas than .data and .text

erikvanzijst · 2018-05-01T20:17:59Z

Yeah, I'm happy to add Patch.data, as well as a read-only buffer protocol interface. The latter could be helpful when dealing with very large diffs, similar to very large blobs. Buffers probably make less sense for DiffLine.

Though, would .text be hardwired to replace invalid code sequences (errors='replace', like Patch.patch does now)? If we do that, making it seemingly trivial to convert between bytes and text, then I worry that we might to continue to feed into the common misunderstandings around text decoding.

I defined .encode() to default to errors='strict' so as to blow up on non-UTF8 data. A method or attribute that silently truncates its input doesn't seem attractive.

Do we really need a shorthand for decoding a patch/blob/line into text? With Git's lack of encoding information, this is an inherently tricky and unreliable proposition.

jdavid · 2018-05-02T08:42:59Z

This replace usage was introduced in #77 ;
the commit given as example is gitster/git@c31820c

That commit doesn't have an encoding header. Github prefers to print bad data that not printing at all. And so does git log or git show for instance.

So I would stick to that behaviour and keep replace. It will also make a smoother transition from .patch to .text, and through all these years I don't remember anyone raising an issue because of this.

I think it's convenient to have .text, probably we could add it in other places as well, so you can type things like print(commit.text)

For reference there're a few more places where we get the bytes and text with different naming: Commit.message, Commit.raw_message, Tag.message, Tag._message

In some cases there are several attributes available and here we see 2 conventions:

TreeEntry.name and TreeEntry._name
Signature.name and Signature.raw_name

We need to consider these to have a coherent convention... I like Patch.text and .data looks good enough, though it could be .raw or .raw_text, ??

jnrbsn · 2019-03-30T19:34:01Z

@jdavid @erikvanzijst This can probably be closed now that #893 is merged.

jdavid · 2019-03-31T11:46:01Z

yes, follow up in issue #895

Thank you both!

0.28.1 (2019-04-19) ------------------------- - Now works with pycparser 2.18 and above `#846 <https://github.com/libgit2/pygit2/issues/846>`_ - Now ``Repository.write_archive(..)`` keeps the file mode `#616 <https://github.com/libgit2/pygit2/issues/616>`_ `#898 <https://github.com/libgit2/pygit2/pull/898>`_ - New ``Patch.data`` returns the raw contents of the patch as a byte string `#790 <https://github.com/libgit2/pygit2/pull/790>`_ `#893 <https://github.com/libgit2/pygit2/pull/893>`_ - New ``Patch.text`` returns the contents of the patch as a text string, deprecates `Patch.patch` `#790 <https://github.com/libgit2/pygit2/pull/790>`_ `#893 <https://github.com/libgit2/pygit2/pull/893>`_ Deprecations: - ``Patch.patch`` is deprecated, use ``Patch.text`` instead

erikvanzijst force-pushed the erik/patch_bytes branch 2 times, most recently from 5ff0818 to 95e4180 Compare April 27, 2018 21:02

brandonio21 reviewed Apr 29, 2018

View reviewed changes

erikvanzijst force-pushed the erik/patch_bytes branch from 95e4180 to 19561ee Compare May 1, 2018 01:25

erikvanzijst commented May 1, 2018

View reviewed changes

jdavid mentioned this pull request May 1, 2018

DiffLine.content is represented as unicode #610

Closed

jnrbsn mentioned this pull request Mar 25, 2019

Deprecate Patch.patch and add Patch.text and Patch.data #893

Merged

jdavid mentioned this pull request Mar 31, 2019

Consistent interface to get text and bytes #895

Open

7 tasks

jdavid closed this Mar 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patch: Add str and bytes for undecoded content. #790

Patch: Add str and bytes for undecoded content. #790

erikvanzijst commented Apr 27, 2018

erikvanzijst commented Apr 27, 2018

erikvanzijst commented Apr 27, 2018

jdavid commented Apr 29, 2018

brandonio21 left a comment

brandonio21 Apr 29, 2018

brandonio21 Apr 29, 2018

jdavid commented Apr 29, 2018

erikvanzijst commented Apr 30, 2018

erikvanzijst May 1, 2018

jdavid commented May 1, 2018

jdavid commented May 1, 2018

erikvanzijst commented May 1, 2018

jdavid commented May 1, 2018

erikvanzijst commented May 1, 2018

jdavid commented May 2, 2018

jnrbsn commented Mar 30, 2019

jdavid commented Mar 31, 2019

Patch: Add __str__ and __bytes__ for undecoded content. #790

Patch: Add __str__ and __bytes__ for undecoded content. #790

Conversation

erikvanzijst commented Apr 27, 2018

erikvanzijst commented Apr 27, 2018

erikvanzijst commented Apr 27, 2018

jdavid commented Apr 29, 2018

brandonio21 left a comment

Choose a reason for hiding this comment

brandonio21 Apr 29, 2018

Choose a reason for hiding this comment

brandonio21 Apr 29, 2018

Choose a reason for hiding this comment

jdavid commented Apr 29, 2018

erikvanzijst commented Apr 30, 2018

erikvanzijst May 1, 2018

Choose a reason for hiding this comment

jdavid commented May 1, 2018

jdavid commented May 1, 2018

erikvanzijst commented May 1, 2018

jdavid commented May 1, 2018

erikvanzijst commented May 1, 2018

jdavid commented May 2, 2018

jnrbsn commented Mar 30, 2019

jdavid commented Mar 31, 2019

Patch: Add str and bytes for undecoded content. #790

Patch: Add str and bytes for undecoded content. #790