Unsupported encodings? #59

kengruven · 2015-07-01T22:46:15Z

I wanted to parse a UTF-16 CSV file, so I did something like this:

r = unicodecsv.reader(f, encoding='UTF-16')

Unfortunately, this just raises an exception when I try to read from it. I looked at the unicodecsv source code, and I don't think the unicodecsv approach can ever work for this case. It tries loading the input stream as 8-bit characters, and then decodes each cell value. Python's 'csv' module can't handle NUL bytes, which are common in UTF-16, so this fails.

I think the answer to this is that the 'unicodecsv' library only works for encodings like UTF-8 or Latin-1 which are supersets of ASCII, and don't use 0x00 bytes. Is this true? We should put it in the documentation.

(Also, I think this means I should really upgrade to Python 3!)

The text was updated successfully, but these errors were encountered:

jdunck · 2015-07-01T22:58:52Z

You're right that the underlying reader is byte-centric and the wrapper approach falls down on null bytes. None of the control characters of CSV are outside the ASCII format -- are you saying that UTF-16 encodings of these control characters include null bytes? (I've not encountered utf-16 csvs in my work.)

kengruven · 2015-07-01T23:08:32Z

Yes. The UTF-16 encoding of an ASCII file is simply that ASCII file with null bytes inserted before each byte (or after, depending on if it's UTF-16LE or UTF-16BE).

jdunck · 2015-07-01T23:15:57Z

... I'm a bit surprised I haven't heard complaints about this before. :)

You're right that the approach would need to change to fix this -- namely wrapping the given file in a decoder before handing it to the underlying csv.reader.

ryanhiebert · 2015-10-05T13:05:00Z

Does this only affect the reader, or does it also affect the writer?

I think the "right" solution is to create a backport of the Python 3 csv module, which only works on unicode, and wrap the file being read in a decoder. However, that is, at best, a ways off.

One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding (utf-8), which then would get fed into the underlying csv.reader.

ryanhiebert · 2015-12-13T18:29:02Z

One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding (utf-8), which then would get fed into the underlying csv.reader.

Turns out that the Python 2 csv module actually has doing that as an example at the end of the documentation. https://docs.python.org/2/library/csv.html#examples

I've now written a pure-python backport of the Python 3 csv module, so we could choose that approach to solving the problem. It would just be using the same code as the Python 3 version of unicodecsv when we detected an encoding other than ascii or utf-8.

But because the implementation is in pure python, I think it could likely be slower than the special encoding wrapper to ensure that csv was always dealing with utf-8 bytes.

@jdunck : I'm interested in writing up a solution to this problem, but I'm not sure which approach would be better. Is it better to use the decoder, or to use the backport?

ryanhiebert mentioned this issue Nov 11, 2015

Backport of Python 3's csv module #69

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unsupported encodings? #59

Unsupported encodings? #59

kengruven commented Jul 1, 2015

jdunck commented Jul 1, 2015

kengruven commented Jul 1, 2015

jdunck commented Jul 1, 2015

ryanhiebert commented Oct 5, 2015

ryanhiebert commented Dec 13, 2015

Unsupported encodings? #59

Unsupported encodings? #59

Comments

kengruven commented Jul 1, 2015

jdunck commented Jul 1, 2015

kengruven commented Jul 1, 2015

jdunck commented Jul 1, 2015

ryanhiebert commented Oct 5, 2015

ryanhiebert commented Dec 13, 2015