New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Patch: Add __str__ and __bytes__ for undecoded content. #790
Conversation
5ff0818
to
95e4180
Compare
Would you mind having a look at this pr, @brandonio21? |
We're using PyGit2 to power parts of Bitbucket, but up until now that didn't include patch generation (for which we still just shelled out git). As we're moving more functionality onto pygit, we ran into the forced unicode decoding issue. Bitbucket needs access to the raw, untouched diff/patch bytes and so it seemed appropriate to put that under str and bytes. If I overlooked something and the raw byte contents of patches are already accessible, please let me know. |
We should deprecate So what we need is a way to get bytes and a way to get text:
Since Don't need to add new tests for Bonus points if Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good. I do agree that .patch
should probably be deprecated (simply because we don't want many ways to do this lying around).
In regards to naming, I like str
and bytes
.
src/patch.c
Outdated
#if PY_MAJOR_VERSION == 2 | ||
ret = Py_BuildValue("s#", buf.ptr, buf.size); | ||
#else | ||
ret = to_unicode(buf.ptr, NULL, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the documentation of Py_BuildValue returns a str using UTF8 for us. Do we even need to use to_unicode here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, looks like to_unicode has some extra logic we might want. Carry on!
The problem with So I think we should have Eventually The case for
|
Yeah, I can add an explicit I propose to also keep I'll also add a runtime deprecation warning to |
Patch.patch assumes all content to be encoded in UTF-8 and forcefully replaces any non-decodable sequences. This can lead to corruption for content that either does not conform to any specific encoding altogether, or uses an encoding that is incompatible with, or ambinuous to UTF-8. This change adds __str__(), __bytes__() methods to Patch that return the unmodified, raw bytes. It also adds a new .decode() method that gives greater control over how text decoding happens, deprecating Patch.patch.
95e4180
to
19561ee
Compare
PyErr_WarnEx(PyExc_DeprecationWarning, "`Patch.patch` assumes UTF-8 encoding and can have unexpected results on " | ||
"other encodings. If decoded text is needed, use `Patch.decode()` " | ||
"instead. Otherwise use `bytes(Patch)`.", 1); | ||
return decode(self, "utf-8", "replace"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These hardwired parameters are equivalent to the original logic of to_unicode(buf.ptr, NULL, NULL)
.
It's true that What about Here we're defining the policy that eventually should be used across the whole lib. |
Or to be more explicit |
That's true only on Python 3 though. What probably shaped my thinking a bit is that most of my day to day work is still Python 2 where However, I agree that it might be best to put Python 3 idioms ahead of Python 2 and I'm happy to axe AFAICT, What would you think of carrying Blob's interface forward and adding |
Objects such as blobs also have #610 is a similar issue, concerning So let leave alone So,
If somebody wants a text string with an encoding different from utf-8 it would be as easy as:
So no need for The buffer protocol I understand is an optimization to reduce memory copies. Whether to implement it or not is up to you, don't know whether it will make a difference for your use case. These could be the commits:
I understand the first one is enough to solve your issue. It's up to you how far do you want to go 😄 Please don't hesitate if you have better ideas than |
Yeah, I'm happy to add Though, would I defined Do we really need a shorthand for decoding a patch/blob/line into text? With Git's lack of encoding information, this is an inherently tricky and unreliable proposition. |
This That commit doesn't have an So I would stick to that behaviour and keep I think it's convenient to have For reference there're a few more places where we get the bytes and text with different naming: In some cases there are several attributes available and here we see 2 conventions:
We need to consider these to have a coherent convention... I like |
@jdavid @erikvanzijst This can probably be closed now that #893 is merged. |
yes, follow up in issue #895 Thank you both! |
0.28.1 (2019-04-19) ------------------------- - Now works with pycparser 2.18 and above `#846 <https://github.com/libgit2/pygit2/issues/846>`_ - Now ``Repository.write_archive(..)`` keeps the file mode `#616 <https://github.com/libgit2/pygit2/issues/616>`_ `#898 <https://github.com/libgit2/pygit2/pull/898>`_ - New ``Patch.data`` returns the raw contents of the patch as a byte string `#790 <https://github.com/libgit2/pygit2/pull/790>`_ `#893 <https://github.com/libgit2/pygit2/pull/893>`_ - New ``Patch.text`` returns the contents of the patch as a text string, deprecates `Patch.patch` `#790 <https://github.com/libgit2/pygit2/pull/790>`_ `#893 <https://github.com/libgit2/pygit2/pull/893>`_ Deprecations: - ``Patch.patch`` is deprecated, use ``Patch.text`` instead
0.28.1 (2019-04-19) ------------------------- - Now works with pycparser 2.18 and above `#846 <https://github.com/libgit2/pygit2/issues/846>`_ - Now ``Repository.write_archive(..)`` keeps the file mode `#616 <https://github.com/libgit2/pygit2/issues/616>`_ `#898 <https://github.com/libgit2/pygit2/pull/898>`_ - New ``Patch.data`` returns the raw contents of the patch as a byte string `#790 <https://github.com/libgit2/pygit2/pull/790>`_ `#893 <https://github.com/libgit2/pygit2/pull/893>`_ - New ``Patch.text`` returns the contents of the patch as a text string, deprecates `Patch.patch` `#790 <https://github.com/libgit2/pygit2/pull/790>`_ `#893 <https://github.com/libgit2/pygit2/pull/893>`_ Deprecations: - ``Patch.patch`` is deprecated, use ``Patch.text`` instead
Patch.patch
assumes all content to be encoded in UTF-8 and forcefullyreplaces any non-decodable sequences. This can lead to corruption for
content that either does not conform to any specific encoding altogether, or
uses an encoding that is incompatible with, or ambinuous to UTF-8.
This change adds
__str__
and__bytes__
implementations toPatch
thatreturn the unmodified, raw bytes.