Skip to content
This repository has been archived by the owner on Nov 9, 2017. It is now read-only.

Add a Clean/Smudge Filter for Windows UTF-16 files #113

Closed
kismert opened this issue Apr 12, 2013 · 6 comments
Closed

Add a Clean/Smudge Filter for Windows UTF-16 files #113

kismert opened this issue Apr 12, 2013 · 6 comments

Comments

@kismert
Copy link

kismert commented Apr 12, 2013

The native encoding for Microsoft Windows is UTF-16 (UCS-2 little-endian). The Windows distribution of msysGit does not handle this format as text, but instead treats it as binary.

As a fix, msysGit Google group members suggested a clean/smudge filter, which does a very good job of handling UTF-16. This would take little effort to include in msysGit:

  1. Distribute iconv.exe and support files in \Program Files\Git\bin for the Windows version.
  2. Add the following to~\Git\etc\gitconfig:
[filter "winutf16"]
        clean = iconv -f utf-16 -t utf-8
        smudge = iconv -f utf-8 -t utf-16
        required
  1. Add documentation to tell users how to setup their global ~/Git/etc/gitattributes or local ~/.gitattributes files, for example:
    *.txt filter=winutf16

I think this would be a valuable enhancement for msysGit that would allow Windows users to quickly configure their repositories to better work with the increasingly common UTF-16 format.

Note that Mercurial and Bazaar do not handle UTF-16 properly either, and this would give msysGit based products a leg up on their peers.

Thanks,
kismert

@sschuberth
Copy link
Contributor

Just out of curiosity, what Windows program stores text file in UTF-16 by default? The only occasion where I saw this recently for the first time is when you pipe e.g. the output of "wmic process get executablepath,processid" to a file. Also, is there a risk of this filter breaking the user experience non-UTF-16 text files?

@dolmen
Copy link

dolmen commented Apr 13, 2013

Regedit.

@sschuberth
Copy link
Contributor

Well, yeah. If you choose to export to .txt files instead of the usual .reg files (note that the suggested filter only applies to .txt files). Quite an academic example.

@kismert
Copy link
Author

kismert commented Apr 15, 2013

@sschuberth:
I'm trying to write a source export plug-in for Microsoft Access. Starting with 2007, Form and Report files are exported as UTF-16.
Visual Studio 2012 supports UTF-16 as a file format.
The default unicode conversion option in SQL Server Management Studio 2008 is UTF-16.
If you save a text file as 'Unicode' in Notepad, you get UTF-16.
Windows uses UTF-16 encoding for its internal functions.
Obviously, Microsoft doesn't use UTF-16 as a file-encoding standard, and UTF-8 has increasing support. Still, there is a fair chance you will run into UTF-16 files in Windows development.

Regarding breaking the user experience, the filter would be off by default. The user would have to manually configure their gitattributes to handle UTF-16 files as required.

patthoyts added a commit that referenced this issue Jun 3, 2013
As discussed in issue #113 it is useful to have a method for converting
between unicode and utf-8 for writing smudge/clean filters. To assist in
this we shall include a functional iconv.exe in the release.
A side effect is we now also ship libintl-8.dll.

Signed-off-by: Pat Thoyts <patthoyts@users.sourceforge.net>
@kusma
Copy link
Member

kusma commented Jun 26, 2013

@zachriggle No, they are not, at least not in Visual Studio 2008.

@dscho
Copy link
Member

dscho commented Dec 30, 2013

@kismert since you did almost all the work (apart from @patthoyts' patch to include iconv.exe by default), how about opening a pull request?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants