You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gzip files don't work yet in Python 3. The test suite throws a lot of the following errors:
======================================================================
ERROR: test_serialize_compressed (gensim.test.test_corpora.TestMalletCorpus)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/larsb/src/gensim/gensim/test/test_corpora.py", line 77, in test_serialize_compressed
self.corpus_class.serialize(fname, corpus)
File "/home/larsb/src/gensim/gensim/corpora/indexedcorpus.py", line 91, in serialize
offsets = serializer.save_corpus(fname, corpus, id2word, metadata=metadata)
File "/home/larsb/src/gensim/gensim/corpora/malletcorpus.py", line 106, in save_corpus
fout.write('%s %s %s\n' % (doc_id, doc_lang, ' '.join(words)))
File "/tmp/py/lib/python3.4/gzip.py", line 343, in write
self.crc = zlib.crc32(data, self.crc) & 0xffffffff
TypeError: 'str' does not support the buffer interface
It turns out that gzip.open is different from ordinary open in that it defaults to binary mode. You can't write a string to a binary file in Python 3 without first explicitly encoding it. There is text file support, gzip.open(path, "wt"), but only in Py3, not in 2.7.
If I understand the situation correctly, then to be consistent with open we should encode in the system encoding, locale.getpreferredencoding(False). But it might be a wiser idea to switch to UTF-8 for everything, because then behavior is consistent between Py2 and Py3 (which on my box have different preferred encodings).
The text was updated successfully, but these errors were encountered:
I remember pains around using getpreferredencoding(), so a preliminary -1 on that.
Keeping in-memory stuff as text (unicode) and explicitly serializing to binary (utf8) on I/O seems like the way to go to me.
This weekend I have to work, but the next weekend I want to install python3 and play around with the changes you've made, and help with the py3k port. Basically start converting gensim interfaces to unicode (=mostly update the docs really, there isn't much code).
Probably in a separate branch, because I want to release "0.9.1 bug-fix" soon, without major changes.
Thanks again Lars, I appreciate your reports and fixes.
gzip files don't work yet in Python 3. The test suite throws a lot of the following errors:
It turns out that
gzip.open
is different from ordinaryopen
in that it defaults to binary mode. You can't write a string to a binary file in Python 3 without first explicitly encoding it. There is text file support,gzip.open(path, "wt")
, but only in Py3, not in 2.7.If I understand the situation correctly, then to be consistent with
open
we should encode in the system encoding,locale.getpreferredencoding(False)
. But it might be a wiser idea to switch to UTF-8 for everything, because then behavior is consistent between Py2 and Py3 (which on my box have different preferred encodings).The text was updated successfully, but these errors were encountered: