Binary vs. text gzip files (Python 3) #183

larsmans · 2014-03-27T13:50:10Z

gzip files don't work yet in Python 3. The test suite throws a lot of the following errors:

======================================================================
ERROR: test_serialize_compressed (gensim.test.test_corpora.TestMalletCorpus)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/larsb/src/gensim/gensim/test/test_corpora.py", line 77, in test_serialize_compressed
    self.corpus_class.serialize(fname, corpus)
  File "/home/larsb/src/gensim/gensim/corpora/indexedcorpus.py", line 91, in serialize
    offsets = serializer.save_corpus(fname, corpus, id2word, metadata=metadata)
  File "/home/larsb/src/gensim/gensim/corpora/malletcorpus.py", line 106, in save_corpus
    fout.write('%s %s %s\n' % (doc_id, doc_lang, ' '.join(words)))
  File "/tmp/py/lib/python3.4/gzip.py", line 343, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface

It turns out that gzip.open is different from ordinary open in that it defaults to binary mode. You can't write a string to a binary file in Python 3 without first explicitly encoding it. There is text file support, gzip.open(path, "wt"), but only in Py3, not in 2.7.

If I understand the situation correctly, then to be consistent with open we should encode in the system encoding, locale.getpreferredencoding(False). But it might be a wiser idea to switch to UTF-8 for everything, because then behavior is consistent between Py2 and Py3 (which on my box have different preferred encodings).

The text was updated successfully, but these errors were encountered:

piskvorky · 2014-03-27T15:13:20Z

I remember pains around using getpreferredencoding(), so a preliminary -1 on that.

Keeping in-memory stuff as text (unicode) and explicitly serializing to binary (utf8) on I/O seems like the way to go to me.

This weekend I have to work, but the next weekend I want to install python3 and play around with the changes you've made, and help with the py3k port. Basically start converting gensim interfaces to unicode (=mostly update the docs really, there isn't much code).

Probably in a separate branch, because I want to release "0.9.1 bug-fix" soon, without major changes.

Thanks again Lars, I appreciate your reports and fixes.

piskvorky · 2014-04-06T11:36:43Z

More work this weekend; unicode conversion postponed till next time.

piskvorky · 2014-04-21T10:07:44Z

Fixed in #196 .

piskvorky closed this as completed Apr 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary vs. text gzip files (Python 3) #183

Binary vs. text gzip files (Python 3) #183

larsmans commented Mar 27, 2014

piskvorky commented Mar 27, 2014

piskvorky commented Apr 6, 2014

piskvorky commented Apr 21, 2014

Binary vs. text gzip files (Python 3) #183

Binary vs. text gzip files (Python 3) #183

Comments

larsmans commented Mar 27, 2014

piskvorky commented Mar 27, 2014

piskvorky commented Apr 6, 2014

piskvorky commented Apr 21, 2014