Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary vs. text gzip files (Python 3) #183

Closed
larsmans opened this issue Mar 27, 2014 · 3 comments
Closed

Binary vs. text gzip files (Python 3) #183

larsmans opened this issue Mar 27, 2014 · 3 comments

Comments

@larsmans
Copy link
Contributor

gzip files don't work yet in Python 3. The test suite throws a lot of the following errors:

======================================================================
ERROR: test_serialize_compressed (gensim.test.test_corpora.TestMalletCorpus)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/larsb/src/gensim/gensim/test/test_corpora.py", line 77, in test_serialize_compressed
    self.corpus_class.serialize(fname, corpus)
  File "/home/larsb/src/gensim/gensim/corpora/indexedcorpus.py", line 91, in serialize
    offsets = serializer.save_corpus(fname, corpus, id2word, metadata=metadata)
  File "/home/larsb/src/gensim/gensim/corpora/malletcorpus.py", line 106, in save_corpus
    fout.write('%s %s %s\n' % (doc_id, doc_lang, ' '.join(words)))
  File "/tmp/py/lib/python3.4/gzip.py", line 343, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface

It turns out that gzip.open is different from ordinary open in that it defaults to binary mode. You can't write a string to a binary file in Python 3 without first explicitly encoding it. There is text file support, gzip.open(path, "wt"), but only in Py3, not in 2.7.

If I understand the situation correctly, then to be consistent with open we should encode in the system encoding, locale.getpreferredencoding(False). But it might be a wiser idea to switch to UTF-8 for everything, because then behavior is consistent between Py2 and Py3 (which on my box have different preferred encodings).

@piskvorky
Copy link
Owner

I remember pains around using getpreferredencoding(), so a preliminary -1 on that.

Keeping in-memory stuff as text (unicode) and explicitly serializing to binary (utf8) on I/O seems like the way to go to me.

This weekend I have to work, but the next weekend I want to install python3 and play around with the changes you've made, and help with the py3k port. Basically start converting gensim interfaces to unicode (=mostly update the docs really, there isn't much code).

Probably in a separate branch, because I want to release "0.9.1 bug-fix" soon, without major changes.

Thanks again Lars, I appreciate your reports and fixes.

@piskvorky
Copy link
Owner

More work this weekend; unicode conversion postponed till next time.

@piskvorky
Copy link
Owner

Fixed in #196 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants