Explicitly use utf-8 when decoding bytestrings #768
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While Python 3 defaults to utf-8 in
bytes.decode()
, Python 2'sequivalent (
str.decode()
) will use the default encoding as set bysite.py (which is almost always ascii).
From looking at the code, it seems that these decodes have just sort of
been fixed piecemeal (likely when someone realized that pygit2 was
failing to handle unicode properly), but any decodes which run on Python
2 that don't specify utf-8 as the encoding are a ticking time bomb. I
personally noticed this was a problem when I encountered a traceback in
the RemoteCallbacks while fetching a new branch which contained utf-8
characters. During the fetch, when
pygit2.remote.maybe_string()
wasinvoked by
_update_tips_cb()
with a pointer to a bytestring containingunicode, the decode fails because the default encoding is ascii. As it
turns out, this was fixed in master, but there are a number which still
have no explicit encoding.
This commit explicitly uses utf-8 for all remaining bytestring decodes
which do not have an encoding specified, aside from one in PY3-specific
code where doing so would be redundant.