Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Add support for filenames with special characters in git diffs. #129

Closed
wants to merge 4 commits into from

5 participants

@jonmagic

When there are special (eg. UTF-8) characters in filenames,
git double-quotes them in diff output.

  • New regular expression to grab quoted filenames
  • Then gsub gets rid of the double escaped bytes
  • Force UTF-8 encoding on the string
Jonathan Hoy... and others added some commits
Jonathan Hoyt and Nick Hengeveld Add support for filenames with special characters in git diffs.
When there are special (eg. UTF-8) characters in filenames,
git double-quotes them in diff output.

* New regular expression to grab quoted filenames
* Then gsub gets rid of the double escaped bytes
* Force UTF-8 encoding on the string
88ccad1
@nickh nickh Use CharlockHolmes to detect encoding/transcode to UTF-8
For compatibility with both Ruby 1.8 and 1.9.
9d7c3e6
lib/grit/diff.rb
@@ -29,7 +29,10 @@ def self.list_from_string(repo, text)
diffs = []
while !lines.empty?
- m, a_path, b_path = *lines.shift.match(%r{^diff --git a/(.+?) b/(.+)$})
+ m, q1, a_path, q2, b_path = *lines.shift.match(%r{^diff --git ("?)a/(.+?)\1 ("?)b/(.+?)\3$})
+
+ a_path = a_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr}.force_encoding('UTF-8')
+ b_path = b_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr}.force_encoding('UTF-8')
@rtomayko Collaborator

Weird. How does this even work on Ruby 1.8.7? I thought force_encoding was 1.9 only.

@rtomayko Collaborator

Also, files in git aren't necessarily utf-8 encoded. They could be in some other character set. My understanding is that there's no way to get that info either.

/cc @brianmario Can you take a look at this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@rtomayko
Collaborator

I'd love to see this merged if it works and doesn't cause a bunch of other issues. Definitely need to deploy from a branch for a good while. 30 minutes at least to ensure the fs machines cycle over and have enough time to surface issues.

@nickh

@rtomayko you're right - there's another commit here that uses CharlockHolmes instead, and should work in both 1.8 and 1.9.

@jonmagic could we get that into this PR on your branch?

@jonmagic
lib/grit/diff.rb
@@ -29,7 +32,10 @@ def self.list_from_string(repo, text)
diffs = []
while !lines.empty?
- m, a_path, b_path = *lines.shift.match(%r{^diff --git a/(.+?) b/(.+)$})
+ m, q1, a_path, q2, b_path = *lines.shift.match(%r{^diff --git ("?)a/(.+?)\1 ("?)b/(.+?)\3$})
+
+ a_path = transcode( a_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr} )
+ b_path = transcode( b_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr} )

would love if there was a more efficient way to do this down in C via ruby-core...

@nickh
nickh added a note

Do you think it's worth using something like icu4r instead? Or were you thinking of changing CharlockHolmes to use the C API to ICU?

CharlockHolmes does use libicu4c, if that's what you're asking?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
lib/grit/diff.rb
@@ -74,6 +80,22 @@ def self.list_from_string(repo, text)
diffs
end
+
+ private
+
+ def self.transcode(string, target_encoding = 'UTF-8')
+ detected = string.detect_encoding
+
+ if detected.nil? || detected[:encoding].nil? || detected[:encoding] == target_encoding
+ string
+ else
+ begin
+ CharlockHolmes::Converter.convert( string, detected[:encoding], target_encoding )

we might consider setting the encoding flag (for 1.9 only) on the return value here based on what was detected

@nickh
nickh added a note

For 1.9, convert seems to return an object with the correct target encoding:

irb(main):058:0> i2
=> "El presidente del Congreso amonesta a Fabra, que pidi\xF3 excusas."
irb(main):059:0> i2.encoding
=> #<Encoding:ISO-8859-1>
irb(main):060:0> d = i2.detect_encoding
=> {:type=>:text, :encoding=>"ISO-8859-1", :confidence=>95, :language=>"es"}
irb(main):061:0> t = CharlockHolmes::Converter.convert( i2, d[:encoding], 'UTF-8' )
=> "El presidente del Congreso amonesta a Fabra, que pidió excusas."
irb(main):062:0> t.encoding
=> #<Encoding:UTF-8>

Is there a different encoding setting that we should be using in addition? Or is that to cover the case where convert is unable to transcode to the target encoding?

Ah yeah I forgot about that... ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@nickh

@rtomayko do you have an example of git not using UTF-8 for quoted filenames in diffs? I just tested locally with a file named "市長村長選挙" and it worked correctly on 1.8.7.

Oddly, it looks like the current change will not work correctly on 1.9.3 unless we force_encoding though:

irb(main):116:0> a_path
=> "\\345\\270\\202\\351\\225\\267\\346\\235\\221\\351\\225\\267\\351\\201\\270\\346\\214\\231"
irb(main):117:0> ap = a_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr}
=> "\xE5\xB8\x82\xE9\x95\xB7\xE6\x9D\x91\xE9\x95\xB7\xE9\x81\xB8\xE6\x8C\x99"
irb(main):118:0> ap.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):119:0> ap.force_encoding('UTF-8')
=> "市長村長選挙"

The complete diff in question looks like this:

nictocat:jp github$ git diff --full-index -M 38e9151715724d6bb4c894e1d7d1a5c258201300 11e0844b5bd844014f258faefc74616f4dc40643
diff --git "a/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\214\231" "b/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\214\231"
index 9f4b6d8bfeaf44aaa69872286163784706d1b053..00fa644a1057f8b3ce4c566cc273d39f4d71ef8b 100644
--- "a/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\214\231"
+++ "b/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\214\231"
@@ -1 +1,3 @@
 This is a test file
+
+Now with changes

/cc @brianmario

@nickh

To follow up on my previous comment, I just tested with a file named "Olímpicos.txt". I created the git repo and created/modified the file in a terminal with ISO-8859-1 character encoding, then checked the output of git diff in that terminal and in another running with UTF-8 character encoding.

In both cases, the results look like this:

diff --git "a/Oli\314\201mpicos.txt" "b/Oli\314\201mpicos.txt"
index 5cccbef31a18f2fe2661ff164d3d41dfce9738b9..16fe1507f26ffc69d0e56d5035bd29b9
--- "a/Oli\314\201mpicos.txt"
+++ "b/Oli\314\201mpicos.txt"
@@ -1 +1 @@
-Test Olímpicos file
+Test Olímpicos text file

In the reverse case, where the repo was created/modified in UTF-8 and viewed in ISO8859-1, a diff looks thusly:

diff --git "a/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\2
index e9fe68f684ad479851e3144374a6371ce743b1e3..a8b452a3935176ed6013d8166c9c9978
--- "a/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\214\231"
+++ "b/\345\270\202\351\225\267\346\235\221\351\225\267\351\201\270\346\214\231"
@@ -1 +1 @@
-This is a test ?<82>?<95>??<9D><91>?<95>??<81>??<8C><99> file
+This is a test ?<82>?<95>??<9D><91>?<95>??<81>??<8C><99> text file

The encoding for the file name is handled differently than the encoding for the changes, and I can't figure out how to make the filename encoding be anything other than UTF-8. I tried setting i18n.logOutputEncoding and i18n.commitEncoding to ISO-8859-1 and neither config setting seemed to make any difference.

If we can expect quoted filenames to be UTF-8, it makes this change quite a bit more straightforward. After the gsub to convert escaped characters, for ruby 1.8.7 we don't need to do anything and for ruby 1.9.3 we need to .force_encoding('UTF-8')

/cc @jonmagic @rtomayko @brianmario

@brianmario
@nickh

@brianmario ftw. Discussion in TheLibrary determined that if a filename in a diff is quoted, there's no way to know what encoding to use; and while it's certainly possible for users with eg. old/misconfigured systems to create files with non-UTF8 filenames, the most reasonable thing to do is to treat them as UTF-8.

/cc @carlosmn

@carlosmn

You know that game where you're blinded, spun a few times and then have to put the donkey's tail in the right place? That's what determining the filename encoding in git is like.

The only people who have a an excuse for not using utf-8 are people on Windows using git < 1.7.10 and that should stop being an issue on its own. UTF-8 is also the only way you can get your repo to behave across OSs and localized versions of the OSs so showing the encoded string should also help people realize that they're restricting where their repo can work.

@jonmagic jonmagic closed this
@jonmagic jonmagic deleted the branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jul 2, 2012
  1. @jonmagic

    Add support for filenames with special characters in git diffs.

    Jonathan Hoyt and Nick Hengeveld authored jonmagic committed
    When there are special (eg. UTF-8) characters in filenames,
    git double-quotes them in diff output.
    
    * New regular expression to grab quoted filenames
    * Then gsub gets rid of the double escaped bytes
    * Force UTF-8 encoding on the string
Commits on Jul 9, 2012
  1. @nickh

    Use CharlockHolmes to detect encoding/transcode to UTF-8

    nickh authored
    For compatibility with both Ruby 1.8 and 1.9.
Commits on Jul 20, 2012
  1. @nickh
  2. @nickh

    No longer need charlock_holmes

    nickh authored
This page is out of date. Refresh to see the latest.
View
18 lib/grit/diff.rb
@@ -29,7 +29,22 @@ def self.list_from_string(repo, text)
diffs = []
while !lines.empty?
- m, a_path, b_path = *lines.shift.match(%r{^diff --git a/(.+?) b/(.+)$})
+ m, q1, a_path, q2, b_path = *lines.shift.match(%r{^diff --git ("?)a/(.+?)\1 ("?)b/(.+?)\3$})
+
+ # If the filename(s) are quoted, they contain special characters that may be escaped.
+ #
+ # While it's possible for systems that are eg. old or badly configured to store
+ # filenames that are not UTF-8 encoded, the most reasonable thing to do is assume
+ # that they are UTF-8 encoded.
+ if !q1.empty? || !q2.empty?
+ a_path = a_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr}
+ b_path = b_path.gsub(/\\([0-9]{3})/){|c| c[1..-1].oct.chr}
+
+ # The above gsub returns a string with ASCII-8BIT encoding in Ruby 1.9,
+ # even though a_path and b_path are UTF-8 encoded.
+ a_path.force_encoding( 'UTF-8' ) if a_path.respond_to?( :force_encoding )
+ b_path.force_encoding( 'UTF-8' ) if b_path.respond_to?( :force_encoding )
+ end
if lines.first =~ /^old mode/
m, a_mode = *lines.shift.match(/^old mode (\d+)/)
@@ -74,6 +89,7 @@ def self.list_from_string(repo, text)
diffs
end
+
end # Diff
end # Grit
View
6 test/fixtures/diff_tree_i18n
@@ -0,0 +1,6 @@
+diff --git "a/Administrac\314\247a\314\203o-e-Autorizac\314\247a\314\203o.md" "b/Administrac\314\247a\314\203o-e-Autorizac\314\247a\314\203o.md"
+new file mode 100644
+index 0000000..e69de29
+diff --git a/unquoted.md b/unquoted.md
+new file mode 100644
+index 0000000..e69de29
View
15 test/test_diff.rb
@@ -1,3 +1,4 @@
+# encoding: UTF-8
require File.dirname(__FILE__) + '/helper'
class TestDiff < Test::Unit::TestCase
@@ -53,4 +54,18 @@ def test_list_from_string_with_renames
assert_equal 'foobar', diffs[4].b_path
assert diffs[4].new_file
end
+
+ def test_list_from_string_quoted_filenames
+ output = fixture('diff_tree_i18n')
+ diffs = Grit::Diff.list_from_string(@r, output)
+
+ assert_equal "Administração-e-Autorização.md", diffs[0].b_path
+ end
+
+ def test_list_from_string_unquoted_filenames
+ output = fixture('diff_tree_i18n')
+ diffs = Grit::Diff.list_from_string(@r, output)
+
+ assert_equal 'unquoted.md', diffs[1].b_path
+ end
end
Something went wrong with that request. Please try again.