You can clone with
No one assigned
I mean, solve all of the encoding problem in commit_messages, blobs, diffs , etc...
What do you mean? Can you be a bit more specific?
Sorry, let me explain more.
As i know, Grit/Rugget didn't deal with the repo content's encoding. So when you cat-files from repo, you alway get ascii-8bit encoding string( or if u have already set the external encoding in ruby ).
You cant force_encoding all the things to utf-8, because the codebase may not the utf-8 encoding.
So, you need to detect the file's encoding, convert it to utf-8 if your page is utf-8( for show ).
To solve this problem, i just write a gem called grit_ext. but it seems not work perfectly.
If the file's size is too little to detect, the charlock_holmes will detect wrong.
So, i wanna to know how u guys solve this problem?
Hi @SaitoWu - so if charlock_holmes isn't able to detect the encoding, or detects it wrong there isn't much else we can do. This is indeed a tough problem to deal with. I personally wish git forced an encoding be set for every object that is stored in it's database, but it's too late for that as adding that would break compatibility.
One option would be to first (before passing the data through charlock_holmes) check if it's valid UTF-8. It sounds like you're using Ruby 1.9 so you could try doing this:
original_encoding = content.encoding
# looks like it's valid UTF-8, just return it as-is
# well, it's not valid UTF-8 so lets give charlock_holmes a try here
# if detection fails, our last resort could be to forcibly clean `content` to
# ensure it's valid UTF-8. Meaning it'll replace invalid UTF-8 sequences
# with a replacement character. You can look into the docs for String#encode
# for how to do that
If you're using Ruby 1.8, for the UTF-8 validity check and cleaning take a look at my utf8 gem.
Lastly, you could try skipping charlock_holmes detection if the data you're trying to detect is less than say, 256 bytes; And in that case, you might as well assume it's UTF-8 because unfortunately there isn't much else you can do to reliably detect it's actual encoding. In that case you could also try force-cleaning the string to be valid UTF-8.
Hopefully that helps clear it up a little ;)
Hi, @brianmario Thank you very much for your answer!
@holman , your feedback repo is awesome.
Cheers! If I can't answer it, I'll find someone who can. :)