New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to work with Grit/Rugged like a boss? #191

Closed
SaitoWu opened this Issue May 16, 2012 · 5 comments

Comments

Projects
None yet
3 participants
@SaitoWu

SaitoWu commented May 16, 2012

I mean, solve all of the encoding problem in commit_messages, blobs, diffs , etc...

Cheers,
Saito

@holman

This comment has been minimized.

Show comment
Hide comment
@holman

holman May 16, 2012

Owner

What do you mean? Can you be a bit more specific?

Owner

holman commented May 16, 2012

What do you mean? Can you be a bit more specific?

@SaitoWu

This comment has been minimized.

Show comment
Hide comment
@SaitoWu

SaitoWu May 16, 2012

Sorry, let me explain more.

As i know, Grit/Rugget didn't deal with the repo content's encoding. So when you cat-files from repo, you alway get ascii-8bit encoding string( or if u have already set the external encoding in ruby ).

You cant force_encoding all the things to utf-8, because the codebase may not the utf-8 encoding.

So, you need to detect the file's encoding, convert it to utf-8 if your page is utf-8( for show ).

To solve this problem, i just write a gem called grit_ext. but it seems not work perfectly.

If the file's size is too little to detect, the charlock_holmes will detect wrong.

So, i wanna to know how u guys solve this problem?

Thanks. :)

SaitoWu commented May 16, 2012

Sorry, let me explain more.

As i know, Grit/Rugget didn't deal with the repo content's encoding. So when you cat-files from repo, you alway get ascii-8bit encoding string( or if u have already set the external encoding in ruby ).

You cant force_encoding all the things to utf-8, because the codebase may not the utf-8 encoding.

So, you need to detect the file's encoding, convert it to utf-8 if your page is utf-8( for show ).

To solve this problem, i just write a gem called grit_ext. but it seems not work perfectly.

If the file's size is too little to detect, the charlock_holmes will detect wrong.

So, i wanna to know how u guys solve this problem?

Thanks. :)

@brianmario

This comment has been minimized.

Show comment
Hide comment
@brianmario

brianmario May 17, 2012

Hi @SaitoWu - so if charlock_holmes isn't able to detect the encoding, or detects it wrong there isn't much else we can do. This is indeed a tough problem to deal with. I personally wish git forced an encoding be set for every object that is stored in it's database, but it's too late for that as adding that would break compatibility.

One option would be to first (before passing the data through charlock_holmes) check if it's valid UTF-8. It sounds like you're using Ruby 1.9 so you could try doing this:

original_encoding = content.encoding
content.force_encoding('UTF-8')
if content.valid_encoding?
  # looks like it's valid UTF-8, just return it as-is
else
  # well, it's not valid UTF-8 so lets give charlock_holmes a try here
  #
  # if detection fails, our last resort could be to forcibly clean `content` to
  # ensure it's valid UTF-8. Meaning it'll replace invalid UTF-8 sequences
  # with a replacement character. You can look into the docs for String#encode
  # for how to do that
end

If you're using Ruby 1.8, for the UTF-8 validity check and cleaning take a look at my utf8 gem.

Lastly, you could try skipping charlock_holmes detection if the data you're trying to detect is less than say, 256 bytes; And in that case, you might as well assume it's UTF-8 because unfortunately there isn't much else you can do to reliably detect it's actual encoding. In that case you could also try force-cleaning the string to be valid UTF-8.

Hopefully that helps clear it up a little ;)

brianmario commented May 17, 2012

Hi @SaitoWu - so if charlock_holmes isn't able to detect the encoding, or detects it wrong there isn't much else we can do. This is indeed a tough problem to deal with. I personally wish git forced an encoding be set for every object that is stored in it's database, but it's too late for that as adding that would break compatibility.

One option would be to first (before passing the data through charlock_holmes) check if it's valid UTF-8. It sounds like you're using Ruby 1.9 so you could try doing this:

original_encoding = content.encoding
content.force_encoding('UTF-8')
if content.valid_encoding?
  # looks like it's valid UTF-8, just return it as-is
else
  # well, it's not valid UTF-8 so lets give charlock_holmes a try here
  #
  # if detection fails, our last resort could be to forcibly clean `content` to
  # ensure it's valid UTF-8. Meaning it'll replace invalid UTF-8 sequences
  # with a replacement character. You can look into the docs for String#encode
  # for how to do that
end

If you're using Ruby 1.8, for the UTF-8 validity check and cleaning take a look at my utf8 gem.

Lastly, you could try skipping charlock_holmes detection if the data you're trying to detect is less than say, 256 bytes; And in that case, you might as well assume it's UTF-8 because unfortunately there isn't much else you can do to reliably detect it's actual encoding. In that case you could also try force-cleaning the string to be valid UTF-8.

Hopefully that helps clear it up a little ;)

@holman holman closed this May 17, 2012

@SaitoWu

This comment has been minimized.

Show comment
Hide comment
@SaitoWu

SaitoWu May 17, 2012

Hi, @brianmario Thank you very much for your answer! 👍

@holman , your feedback repo is awesome. 🍻

SaitoWu commented May 17, 2012

Hi, @brianmario Thank you very much for your answer! 👍

@holman , your feedback repo is awesome. 🍻

@holman

This comment has been minimized.

Show comment
Hide comment
@holman

holman May 17, 2012

Owner

Cheers! If I can't answer it, I'll find someone who can. :)

Owner

holman commented May 17, 2012

Cheers! If I can't answer it, I'll find someone who can. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment