Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
How to work with Grit/Rugged like a boss? #191
Sorry, let me explain more.
As i know, Grit/Rugget didn't deal with the repo content's encoding. So when you cat-files from repo, you alway get ascii-8bit encoding string( or if u have already set the external encoding in ruby ).
You cant force_encoding all the things to utf-8, because the codebase may not the utf-8 encoding.
So, you need to detect the file's encoding, convert it to utf-8 if your page is utf-8( for show ).
To solve this problem, i just write a gem called grit_ext. but it seems not work perfectly.
If the file's size is too little to detect, the charlock_holmes will detect wrong.
So, i wanna to know how u guys solve this problem?
Hi @SaitoWu - so if charlock_holmes isn't able to detect the encoding, or detects it wrong there isn't much else we can do. This is indeed a tough problem to deal with. I personally wish git forced an encoding be set for every object that is stored in it's database, but it's too late for that as adding that would break compatibility.
One option would be to first (before passing the data through charlock_holmes) check if it's valid UTF-8. It sounds like you're using Ruby 1.9 so you could try doing this:
original_encoding = content.encoding content.force_encoding('UTF-8') if content.valid_encoding? # looks like it's valid UTF-8, just return it as-is else # well, it's not valid UTF-8 so lets give charlock_holmes a try here # # if detection fails, our last resort could be to forcibly clean `content` to # ensure it's valid UTF-8. Meaning it'll replace invalid UTF-8 sequences # with a replacement character. You can look into the docs for String#encode # for how to do that end
If you're using Ruby 1.8, for the UTF-8 validity check and cleaning take a look at my utf8 gem.
Lastly, you could try skipping charlock_holmes detection if the data you're trying to detect is less than say, 256 bytes; And in that case, you might as well assume it's UTF-8 because unfortunately there isn't much else you can do to reliably detect it's actual encoding. In that case you could also try force-cleaning the string to be valid UTF-8.
Hopefully that helps clear it up a little ;)