daff breaks horribly if file is not utf8 #71

SonOfLilit · 2016-09-15T22:22:48Z

On Windows, tried both with cmd and a git bash shell:

$ daff.py version
1.3.18
$ daff.py 1.csv 2.csv
Traceback (most recent call last):
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11304, in <module>
    Coopy.main()
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3447, in main
    return coopy.coopyhx(io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3333, in coopyhx
    return self.run(args,io)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 3284, in run
    a = self.loadTable(aname)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 2640, in loadTable
    txt = self.io.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 9752, in getContent
    return sys_io_File.getContent(name)
  File "C:/Users/sonoflilit/.virtualenvs/analysts/Scripts/daff.py", line 11018, in getContent
    content = f.read(-1)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 668, in read
    return self.reader.read(size)
  File "C:\users\sonoflilit\.virtualenvs\analysts\lib\codecs.py", line 474, in read
    newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe0 in position 4: invalid continuation byte

$ which daff
/c/Program Files/nodejs/daff
$ daff version
1.3.18
$ daff 1.csv 2.csv
@@,a,b

of course, the reason I care is that excel works notoriously badly with utf8 csvs, so my git repository is full of csvs in other encodings, and I can't convert them as part of git diff...

P.S. does anyone here know why git would accept my .gitattributes entry for *.tsv but would silently ignore the identical entry for *.csv?

The text was updated successfully, but these errors were encountered:

paulfitz · 2016-09-16T21:19:32Z

Thanks for reporting this @SonOfLilit. For daff.py, a hack to make this work is to edit it by hand, replacing codecs.open(path,"r","utf-8") with codecs.open(path,"r","iso-8859-1"). With that change, I see a diff of:

@@,a,b
→, à,á→â

You may need to change more if you want the diff itself to be produced in the same encoding rather than utf-8.

How ideally should this work? A parameter specifying encoding? An attempt at autodetection?

dogmatic69 · 2016-09-16T21:40:53Z

param should be best, can't rely on what the file says as you can have latin1 in a utf8 file 👎

I guess you could use auto-detection as a default, but will need something to be able to specify when things are crazy.

SonOfLilit · 2016-09-17T18:25:00Z

Ideally there should be a cmd parameter because some poor people need to
use utf16, which can't be made sense of without very special treatment.

But more importantly, default behavior should be to work on raw, undecoded
bytes. As long as you never try to split cell contents (e.g. you must
output "[abc->aBc]" and not "a[b->B]c" which might split a character in the
middle in utf8), every other encoding I'm aware of would work just fine,
including utf8, DOS codepages, ISO codepages and Windows codepages (I must
admit I have no idea how pre-Unicode chinese/japanese codepages work, but
they would probably be fine too).

On Sat, Sep 17, 2016, 12:40 AM Carl Sutton notifications@github.com wrote:

param should be best, can't rely on what the file says as you can have
latin1 in a utf8 file 👎

I guess you could use auto-detection as a default, but will need something
to be able to specify when things are crazy.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6fWvR_PGnOYspRD79VcT6HlpCUKtsks5qqwzmgaJpZM4J-are
.

paulfitz · 2016-09-19T21:17:37Z

Ok, sounds like a parameter is important since there'll always be those who need it.

I'm not sure I can completely avoid touching cell contents. There are options for whitespace-insensitive and case-insensitive diffs for example. These obviously get wacky in the general case but people want them for the common special case of plain old ascii. Would auto-detection via delegation to eg chardet [1] in python be adequate do you think @SonOfLilit?

[1] https://github.com/chardet/chardet

SonOfLilit · 2016-09-20T00:19:18Z

As long as you're only touching characters that are ASCII (commas, double
quotes, tabs, spaces) you should be fine with all the encodings I listed as
not needing a parameter - the reason they don't is that they only differ in
the non-ASCII code points.

On Tue, Sep 20, 2016, 12:17 AM Paul Fitzpatrick notifications@github.com
wrote:

Ok, sounds like a parameter is important since there'll always be those
who need it.

I'm not sure I can completely avoid touching cell contents. There are
options for whitespace-insensitive and case-insensitive diffs for example.
These obviously get wacky in the general case but people want them for the
common special case of plain old ascii. Would auto-detection via delegation
to eg chardet [1] in python be adequate do you think @SonOfLilit
https://github.com/SonOfLilit?

[1] https://github.com/chardet/chardet

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#71 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6fUNNg9OSOnPonqPf1srU3Kx8svQcks5qrvvygaJpZM4J-are
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

daff breaks horribly if file is not utf8 #71

daff breaks horribly if file is not utf8 #71

SonOfLilit commented Sep 15, 2016

paulfitz commented Sep 16, 2016

dogmatic69 commented Sep 16, 2016

SonOfLilit commented Sep 17, 2016

paulfitz commented Sep 19, 2016

SonOfLilit commented Sep 20, 2016

daff breaks horribly if file is not utf8 #71

daff breaks horribly if file is not utf8 #71

Comments

SonOfLilit commented Sep 15, 2016

paulfitz commented Sep 16, 2016

dogmatic69 commented Sep 16, 2016

SonOfLilit commented Sep 17, 2016

paulfitz commented Sep 19, 2016

SonOfLilit commented Sep 20, 2016