Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daff breaks horribly if file is not utf8 #71

Open
SonOfLilit opened this issue Sep 15, 2016 · 5 comments
Open

daff breaks horribly if file is not utf8 #71

SonOfLilit opened this issue Sep 15, 2016 · 5 comments

Comments

@SonOfLilit
Copy link

On Windows, tried both with cmd and a git bash shell:

csv_windows-1255.zip

$ daff.py version
1.3.18
$ daff.py 1.csv 2.csv
Traceback (most recent call last):
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11304, in <module>
    Coopy.main()
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3447, in main
    return coopy.coopyhx(io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3333, in coopyhx
    return self.run(args,io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3284, in run
    a = self.loadTable(aname)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 2640, in loadTable
    txt = self.io.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 9752, in getContent
    return sys_io_File.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11018, in getContent
    content = f.read(-1)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 668, in read
    return self.reader.read(size)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 4: invalid continuation byte
$ which daff
/c/Program Files/nodejs/daff
$ daff version
1.3.18
$ daff 1.csv 2.csv
@@,a,b

of course, the reason I care is that excel works notoriously badly with utf8 csvs, so my git repository is full of csvs in other encodings, and I can't convert them as part of git diff...

P.S. does anyone here know why git would accept my .gitattributes entry for *.tsv but would silently ignore the identical entry for *.csv?

@paulfitz
Copy link
Owner

Thanks for reporting this @SonOfLilit. For daff.py, a hack to make this work is to edit it by hand, replacing codecs.open(path,"r","utf-8") with codecs.open(path,"r","iso-8859-1"). With that change, I see a diff of:

@@,a,b
→, à,á→â

You may need to change more if you want the diff itself to be produced in the same encoding rather than utf-8.

How ideally should this work? A parameter specifying encoding? An attempt at autodetection?

@dogmatic69
Copy link

param should be best, can't rely on what the file says as you can have latin1 in a utf8 file 👎

I guess you could use auto-detection as a default, but will need something to be able to specify when things are crazy.

@SonOfLilit
Copy link
Author

Ideally there should be a cmd parameter because some poor people need to
use utf16, which can't be made sense of without very special treatment.

But more importantly, default behavior should be to work on raw, undecoded
bytes. As long as you never try to split cell contents (e.g. you must
output "[abc->aBc]" and not "a[b->B]c" which might split a character in the
middle in utf8), every other encoding I'm aware of would work just fine,
including utf8, DOS codepages, ISO codepages and Windows codepages (I must
admit I have no idea how pre-Unicode chinese/japanese codepages work, but
they would probably be fine too).

On Sat, Sep 17, 2016, 12:40 AM Carl Sutton notifications@github.com wrote:

param should be best, can't rely on what the file says as you can have
latin1 in a utf8 file 👎

I guess you could use auto-detection as a default, but will need something
to be able to specify when things are crazy.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6fWvR_PGnOYspRD79VcT6HlpCUKtsks5qqwzmgaJpZM4J-are
.

@paulfitz
Copy link
Owner

Ok, sounds like a parameter is important since there'll always be those who need it.

I'm not sure I can completely avoid touching cell contents. There are options for whitespace-insensitive and case-insensitive diffs for example. These obviously get wacky in the general case but people want them for the common special case of plain old ascii. Would auto-detection via delegation to eg chardet [1] in python be adequate do you think @SonOfLilit?

[1] https://github.com/chardet/chardet

@SonOfLilit
Copy link
Author

As long as you're only touching characters that are ASCII (commas, double
quotes, tabs, spaces) you should be fine with all the encodings I listed as
not needing a parameter - the reason they don't is that they only differ in
the non-ASCII code points.

On Tue, Sep 20, 2016, 12:17 AM Paul Fitzpatrick notifications@github.com
wrote:

Ok, sounds like a parameter is important since there'll always be those
who need it.

I'm not sure I can completely avoid touching cell contents. There are
options for whitespace-insensitive and case-insensitive diffs for example.
These obviously get wacky in the general case but people want them for the
common special case of plain old ascii. Would auto-detection via delegation
to eg chardet [1] in python be adequate do you think @SonOfLilit
https://github.com/SonOfLilit?

[1] https://github.com/chardet/chardet


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6fUNNg9OSOnPonqPf1srU3Kx8svQcks5qrvvygaJpZM4J-are
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants