Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsupported encodings? #59

Open
kengruven opened this issue Jul 1, 2015 · 5 comments
Open

Unsupported encodings? #59

kengruven opened this issue Jul 1, 2015 · 5 comments

Comments

@kengruven
Copy link

I wanted to parse a UTF-16 CSV file, so I did something like this:

r = unicodecsv.reader(f, encoding='UTF-16')

Unfortunately, this just raises an exception when I try to read from it. I looked at the unicodecsv source code, and I don't think the unicodecsv approach can ever work for this case. It tries loading the input stream as 8-bit characters, and then decodes each cell value. Python's 'csv' module can't handle NUL bytes, which are common in UTF-16, so this fails.

I think the answer to this is that the 'unicodecsv' library only works for encodings like UTF-8 or Latin-1 which are supersets of ASCII, and don't use 0x00 bytes. Is this true? We should put it in the documentation.

(Also, I think this means I should really upgrade to Python 3!)

@jdunck
Copy link
Owner

jdunck commented Jul 1, 2015

You're right that the underlying reader is byte-centric and the wrapper approach falls down on null bytes. None of the control characters of CSV are outside the ASCII format -- are you saying that UTF-16 encodings of these control characters include null bytes? (I've not encountered utf-16 csvs in my work.)

@kengruven
Copy link
Author

Yes. The UTF-16 encoding of an ASCII file is simply that ASCII file with null bytes inserted before each byte (or after, depending on if it's UTF-16LE or UTF-16BE).

@jdunck
Copy link
Owner

jdunck commented Jul 1, 2015

... I'm a bit surprised I haven't heard complaints about this before. :)

You're right that the approach would need to change to fix this -- namely wrapping the given file in a decoder before handing it to the underlying csv.reader.

@ryanhiebert
Copy link
Collaborator

Does this only affect the reader, or does it also affect the writer?

I think the "right" solution is to create a backport of the Python 3 csv module, which only works on unicode, and wrap the file being read in a decoder. However, that is, at best, a ways off.

One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding (utf-8), which then would get fed into the underlying csv.reader.

@ryanhiebert
Copy link
Collaborator

One possible, but performance terrible, approach that we could take would be to wrap the file in a decoder, and also an encoder in some acceptable encoding (utf-8), which then would get fed into the underlying csv.reader.

Turns out that the Python 2 csv module actually has doing that as an example at the end of the documentation. https://docs.python.org/2/library/csv.html#examples

I've now written a pure-python backport of the Python 3 csv module, so we could choose that approach to solving the problem. It would just be using the same code as the Python 3 version of unicodecsv when we detected an encoding other than ascii or utf-8.

But because the implementation is in pure python, I think it could likely be slower than the special encoding wrapper to ensure that csv was always dealing with utf-8 bytes.

@jdunck : I'm interested in writing up a solution to this problem, but I'm not sure which approach would be better. Is it better to use the decoder, or to use the backport?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants