Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiffLine.content is represented as unicode #610

Closed
mrh1997 opened this issue Mar 24, 2016 · 7 comments
Closed

DiffLine.content is represented as unicode #610

mrh1997 opened this issue Mar 24, 2016 · 7 comments
Labels

Comments

@mrh1997
Copy link
Contributor

mrh1997 commented Mar 24, 2016

The field DiffLine.content contains a unicode line. As git does not know anything about the encoding of the files to be diffed (they are blobs), I expect this object to be of type str in py2 and bytes in py3.

Even worse if a file is i.e. latin-1 encoded and contains latin-1 specific characters, all these characters are mapped to '\xfffd'. Thus is impossible to diff non-ascii encoded text files correctly.

I suppose this is a pygit2 bug, as the libgit2.h interface works correctly, as it exposes this field as const char * (see https://github.com/libgit2/libgit2/blob/HEAD/include/git2/diff.h#L555)

@arcsector
Copy link

When you say "The field DiffLine.content" are you talking about in the docs or in your file?

Python has weird issues with encodings; I wouldn't be surprised if it was the language itself.

@mrh1997
Copy link
Contributor Author

mrh1997 commented Apr 23, 2018

The docs do not mention the type at all. But (at least until 0.24.0, which was the last version I tested) the actual implementation of DiffLine used a variable of type unicode for the attribute content.

My file is a unicode file. But as git does not know anything about encodings, from git's point of view it is a stream of bytes. Thus I expect DiffLine.content to be of type bytes (or alternatively that you have to specifiy a encoding somewhere).

What do you mean with "weird issues with encodings"?!? This is definitely a bug in pygit2 (as it uses unicode instead of str/bytes).

@arcsector
Copy link

arcsector commented Apr 23, 2018

  1. For me all I had to do was wrap diffLine.content with str() and everything worked fine. The reason it's in unicode is because git is type agnostic, so if you have a 你 in your code, you don't get busted. I'm not sure why you see this as an issue, other than the fact that it's absent from the docs.

  2. I mean system languages don't usually have the issues with encodings that python has. The errors flag in python 3.6.5 doesn't work properly like it does in other languages, which means you have to use try/except blocks. But that's out of the scope of this issue.

@mrh1997
Copy link
Contributor Author

mrh1997 commented Apr 23, 2018

If the sourcecode contains a non-ascii characters (i.e. "Ö" or 'Ä') in a line, all of them will be replaced by '\xfffd'. This makes pygit2 diff unusable for all codebases that are using non-ascii text (which should be quite a lot).

If pygit2 would return str (in python2) or bytes (in python3):

  • you won't get busted, too
  • you would not loose information (which is the case if all characters > 0x80 are mapped to \xfffd)
  • it would match the behaviour of lib2git2 (which also returns char* instead of unicode)

@arcsector
Copy link

Make a PR for bytes type, then; you can't convert non-ascii to string on the fly.

@mrh1997
Copy link
Contributor Author

mrh1997 commented Apr 23, 2018

This is not about "on-the-fly" convertion. This is about loosing all characters > 0x80.

@jdavid
Copy link
Member

jdavid commented May 1, 2018

This is very much like #790 but concerning DiffLine instead of Patch ; a similar solution should apply. Both bytes and text should be available. Discussion on the specific naming is being discussed in #790 , if you want to give your opinion.

Besides DiffLine.content is a member attribute, I would prefer to use a getter or a method.

Thanks @mrh1997 for raising the issue, thanks @arcsector for bringing it back in my radar.

@jdavid jdavid added the bug label May 1, 2018
@jdavid jdavid closed this as completed in c61acaf May 6, 2018
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jun 5, 2018
0.27.1 (2018-06-02)
-------------------------

Breaking changes:

- Now ``discover_repository`` returns ``None`` if repository not found, instead
  of raising ``KeyError``
  `#531 <https://github.com/libgit2/pygit2/issues/531>`_

Other changes:

- New ``DiffLine.raw_content``
  `#610 <https://github.com/libgit2/pygit2/issues/610>`_

- Fix tests failing in some cases
  `#795 <https://github.com/libgit2/pygit2/issues/795>`_

- Automatize wheels upload to pypi
  `#563 <https://github.com/libgit2/pygit2/issues/563>`_
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Dec 1, 2018
0.27.2 (2018-09-16)
-------------------------

- Add support for Python 3.7
  `#809 <https://github.com/libgit2/pygit2/issues/809>`_

- New ``Object.short_id``
  `#799 <https://github.com/libgit2/pygit2/issues/799>`_
  `#806 <https://github.com/libgit2/pygit2/pull/806>`_
  `#807 <https://github.com/libgit2/pygit2/pull/807>`_

- New ``Repository.descendant_of`` and ``Repository.branches.with_commit``
  `#815 <https://github.com/libgit2/pygit2/issues/815>`_
  `#816 <https://github.com/libgit2/pygit2/pull/816>`_

- Fix repository initialization in ``clone_repository(...)``
  `#818 <https://github.com/libgit2/pygit2/issues/818>`_

- Fix several warnings and errors, commits
  `cd896ddc <https://github.com/libgit2/pygit2/commit/cd896ddc>`_
  and
  `dfa536a3 <https://github.com/libgit2/pygit2/commit/dfa536a3>`_

- Documentation fixes and improvements
  `#805 <https://github.com/libgit2/pygit2/pull/805>`_
  `#808 <https://github.com/libgit2/pygit2/pull/808>`_


0.27.1 (2018-06-02)
-------------------------

Breaking changes:

- Now ``discover_repository`` returns ``None`` if repository not found, instead
  of raising ``KeyError``
  `#531 <https://github.com/libgit2/pygit2/issues/531>`_

Other changes:

- New ``DiffLine.raw_content``
  `#610 <https://github.com/libgit2/pygit2/issues/610>`_

- Fix tests failing in some cases
  `#795 <https://github.com/libgit2/pygit2/issues/795>`_

- Automatize wheels upload to pypi
  `#563 <https://github.com/libgit2/pygit2/issues/563>`_
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants