Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Encoding issue with commit name #387

Open
Haacked opened this Issue · 26 comments

7 participants

@Haacked
Owner

Not sure if this is a LibGit2 or LibGit2Sharp bug but thought I'd start here. :)

Try cloning this repository here: https://github.com/CoolDevs/Mets-palopeli

git show d527f20

Shows the author as: Lassi Hämäläinen

But if you load the commit in LibGit2Sharp, you get: Lassi H�m�l�inen

Phil

@ethomson
Owner

Just to be clear: what do you mean when you say "load it"? Are you looking at this in the debugger? Printing it to the console?

@Haacked
Owner

Looking in the debugger.

@Haacked
Owner

It was reported to us because GHfW displays it wrong. But we're just binding to the value given to us.

@carlosmn
Owner

Author and comitter information must be in UTF-8, as they're compulsory fields in the commit and encoding information can only come afterwards to apply to the commit message (backwards compatibility and whatnot). In this case, that is not provided either, so you can't even use that as a hint to take another look at the author/committer to see if another encoding must be used.

It's hard to see this as a bug, as the libgit2(sharp) layers are doing what they should, but the input just happens to be wrong and GHfW is displaying it wrong because it has the wrong encoding; git-core in a unicode environment also shows it incorrectly.

There is no way to display this name correctly other than checking every single one to see if it really is unicode, or guess the encoding otherwise, which isn't something the library should be trying to do.

@nulltoken
Owner

@carlosmn I came to the same conclusion. Not UTF-8 encoded.

git-core in a unicode environment also shows it incorrectly.

Surprisingly git show and git cat-file produce different outputs

$ git show d527f20
commit d527f20e7347544c55b1ec665f02a706284c476d
Author: Lassi H<E4>m<E4>l<E4>inen <Latexi95@gmail.com>
Date:   Mon Apr 1 07:09:25 2013 +0300

$ git show --format=raw d527
commit d527f20e7347544c55b1ec665f02a706284c476d
tree 2bba9fdb73a885b4710d725536cfdcef41ce0b69
parent 5be375d5a5b39bd207719c6409fcba1e46f1c93c
author Lassi H<E4>m<E4>l<E4>inen <Latexi95@gmail.com> 1364789365 +0300
committer Lassi H<E4>m<E4>l<E4>inen <Latexi95@gmail.com> 1364789365 +0300

versus

$ git cat-file commit d527
tree 2bba9fdb73a885b4710d725536cfdcef41ce0b69
parent 5be375d5a5b39bd207719c6409fcba1e46f1c93c
author Lassi Hämäläinen <Latexi95@gmail.com> 1364789365 +0300
committer Lassi Hämäläinen <Latexi95@gmail.com> 1364789365 +0300

Added sfx, but not. Hahaa
@carlosmn
Owner

Not that surprising. git show calls your pager, and less shows "unknown" chars that way. The cat-file output is going more-or-less straight to the terminal window and it's then free to interpret it however it feels like.

@nulltoken
Owner

There is no way to display this name correctly other than checking every single one to see if it really is unicode, or guess the encoding otherwise, which isn't something the library should be trying to do.

FWIW, I've found some traces of "encoding guessing" in JGit

@Haacked
Owner

@nulltoken: What's interesting is when I run git show --format=raw d527 I see the following:

git-show

git cat-file commit d2527 shows it the same way you saw it.

So I guess the question is, how is this displaying the characters right in gitk and in git show? Should LibGit2 try encoding guessing or leave it to each client?

@carlosmn
Owner

Likely your codepage happens to coincide with the one the Lassi has on their computer (I'd assume tk doesn't use the wide family of the WinAPI). Interestingly, the powershell I get from GHfW manages to mangle this name and a properly encoded one as well.

Guessing encodings is its own black art; the JGit code is wrapping the ones from Java's NIO and that's all we could realistically do. We'd have to find some library on each platform we could link against and wrap or carry the code with us. I'm not sure either of them is really better than letting a higher layer guess if the user feels they need to.

@Haacked
Owner

@carlosmn In PowerShell, what do you get when you run the following (without the back slashes)?

\[System.Text.Encoding\]::Default.CodePage

I get 1252.

So I thought I could try to convert the string to this encoding just to see what happens by doing this:

var utf8 = Encoding.UTF8;
var destination = Encoding.GetEncoding(1252);
var bytes = Encoding.UTF8.GetBytes(commit.Author.Name);
destination.GetString(Encoding.Convert(utf8, destination, bytes));
// Results in "Lassi Hýÿmýÿlýÿinen"

Is there information that's lost when libgit2 does the conversion? Any ideas?

@carlosmn
Owner

I get the same codepage. There is no information lost in libgit2, but when you get the field through lg2#, it gets marshalled from a (assumed utf-8) C string to a C# unicode string, which could be where some information might get lost depending on what bytes were in the original string and what you use to marshal them (but let's assume it's fine for now).

In this snippet you are however converting from utf-8 when the problem is that the string isn't encoded that way, so that won't work. I'd expect it to work with a commit from me or Vicent, though.

@Haacked
Owner

but when you get the field through lg2#, it gets marshalled from a (assumed utf-8) C string to a C# unicode string

Is there a way from lg2# to get the actual bytes of the name? Perhaps there should be a structure for getting that?

@carlosmn
Owner

I don't think lg2# lets you do that currently, but it'd just be a matter of adding a function to Native.cs as lg2 gives you the bytes directly.

@yishaigalatzer

Detecting (and particularly auto detecting) encoding is a rather complex issue, there are a bunch of scenarios to consider beyond the simple UTF-8 case

  1. The bytes may be in a local codepage that is the default on the user's computer. In that case auto detection is very hard because there is very little context to guess from.
  2. The bytes may be in UTF-8 with or without BOM (first is common on windows machines, the latter on unix machines). But auto detecting UTF-8 is fairly easy and should be the common fallback (if the encoding class doesn't throw when trying to decode the data).
  3. The bytes may be in other encoding that have an identifier (BOM) where its easy to detect.

I think cases 2&3 should be supported as the user of the lg2# gets back a string, and cannot anticipate that string will be malformed.

We were dealing with lots of similar issues when the WebMatrix editor was developed. So if folks are interested in a fix I can take a deeper look.

@nulltoken
Owner

We were dealing with lots of similar issues when the WebMatrix editor was developed. So if folks are interested in a fix I can take a deeper look.

That would be amazing!

@yishaigalatzer

So looking at the git2 bare memory, looks like he is using a simple code page and not UTF-8. The only approach I know to deal with that for something this short is to try and decode using UTF-8 if that fails, we can fallback to a local or Latin1 code page. There is just not enough information to reliably do anything better.

But fortunately in this case the solution above resolves the problem quite nicely (assuming you have the same default code page as the author).

Since I can't contribute code directly here I'm providing a sample below:

In the Utf8MArshaler class add a marshallerEncoding object:

    private static readonly Encoding marshallerEncoding = GetEncodingWithExceptionFallbacks(Encoding.UTF8);

    public static Encoding GetEncodingWithExceptionFallbacks(Encoding encoding)
    {
        var tempEncoding = (Encoding)encoding.Clone();
        tempEncoding.DecoderFallback = new DecoderExceptionFallback();
        return tempEncoding;
    }

Then whenever there you want to fallback to default code page, add a try catch block like this:

    public static unsafe String FromNative(IntPtr pNativeData, int length)
    {
        if (IntPtr.Zero == pNativeData)
        {
            return null;
        }

        if (0 == length)
        {
            return String.Empty;
        }

        try
        {
            return new String((sbyte*)pNativeData.ToPointer(), 0, length, marshallerEncoding);
        }
        catch (DecoderFallbackException)
        {
            return new String((sbyte*)pNativeData.ToPointer(), 0, length, Encoding.Default);
        }
    }

This will produce the correct string automatically in this case.

Capture

@martinwoodward martinwoodward referenced this issue from a commit in martinwoodward/libgit2sharp
@martinwoodward martinwoodward Test case to repro libgit2/libgit2sharp#387
When invalid encoding is in the author name or other commit metadata,
libgit2 always returns the text as if it where UTF8 encoded (which it
should be). jGit will attempt to parse using the current systems
default encoder. Core git will simply return the bytes leaving it
up to the client shell to interpret (which posh-git does by
defaulting to cp-1252 but Git Bash in msysgit shows differently)
36aa3bb
@martinwoodward
Collaborator

Added a test case that repo's this bug. Basically the encoding of the username in the commit is not UTF-8 as it should be but CP-1252 (the default codepage on Windows). I'd love to know how that happened as it was a recent commit but I cannot make it happen using Git command line - only by doing it programatically. @Haacked - if you can get more data on how the user name was entered that would be great (maybe a copy of the local .gitconfig, and details of which version of which implementation they used to create the commits)

The method @yishaigalatzer proposes above works because it just so happens that the commit is the same codepage that was the default encoding on the system running that code (i.e. cp-1252). At first glance this feels wrong, if for example you had a default encoding of a Japanese or Chinese codepage then this code would not work. Similarly if the encoding used to create the commit had been MacRoman or something then defaulting to cp-1252 wouldn't have worked either. There is no real way to guess the encoding for these sub-strings.

But, as @nulltoken points out - jGit actually has similar logic in it that will fall back to a default codepage because in their experience if it's not been correctly encoded in UTF-8 then its often encoded by the same user using the system default codepage. This suggests that this is a known problem worth coding for.

So. Shall we default the encoding to the current system default or leave as is today? If we fallback to the current system default then that code logic only kicks in when it would have returned a bad result today anyway so it might make it better or it could be bad still - however that's a higher probability of success that it would be without the work-around.

What do folks think?

@Haacked
Owner

If we fallback to the current system default then that code logic only kicks in when it would have returned a bad result today anyway so it might make it better or it could be bad still

But it can never really make it worse, right? It's strictly better than today's behavior? I vote for this.

@ethomson
Owner

Can we identify which cases UTF8 will fail to parse (and thus throw an exception and allow us to fallback)? What code points exist in some other encoding but not in UTF8?

@ethomson
Owner

I see. I forgot how UTF8 did multibytes, In this case, you have a two byte sequence:

11100100 01101001

In Latin1 this is:

äi

But we try to parse this as UTF-8, which is illegal. The first byte, 11100100, is the first byte of a 3-byte UTF-8 sequence. The subsequence 2 continuation bytes must start with the two bytes 10xxxxxx. If they do not, this is not valid UTF-8.

Thus, we are eligible to fall back in this case (and probably most others when we see 1252.)

@martinwoodward
Collaborator

But if we fall back to the platform default encoding, we would get the correct result when running the code in the same codepage as the user (in this case Windows), but running the test on the Mac where the default codepage is MacRoman we would fail the test.

@martinwoodward martinwoodward referenced this issue from a commit in martinwoodward/libgit2sharp
@martinwoodward martinwoodward Create a test for libgit2/libgit2sharp#387 against a repo with funky …
…commit encodings
7326361
@martinwoodward
Collaborator

Hey @Haacked - that last commit (0032af2) shows how to fallback the UTF8Marshaller. Talking with @nulltoken we're still wondering if this is the correct thing to do. However wanted to push up the code in case you needed it.

@Haacked
Owner

Thanks!

@jbialobr

I'd love to know how that happened as it was a recent commit but I cannot make it happen using Git command line - only by doing it programatically.

The most common case is to edit your config file and save it with encoding different than UTF-8.
If I am not mistaken, then git uses system default code page to save config file, so in that case it is enough to set user.name with git config command.

@jbialobr

@Haacked Do you have two cloned repositories? I see that some commits have good encoding and some have bad.
It is weird as they are commited in short time span with good and bad encoding.

moz-screenshot-25

@nulltoken nulltoken referenced this issue from a commit in GitTools/GitVersion
@SimonCropp SimonCropp remove fody dependency d980913
@Haacked
Owner

@Haacked Do you have two cloned repositories? I see that some commits have good encoding and some have bad. It is weird as they are commited in short time span with good and bad encoding.

Sorry I missed this all these years. :stuck_out_tongue:

This is not my repository, so I don't know much about the context for why there are good and bad commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.