You can clone with
When I have an instance of SparseMatrixSimilarity, and I try to save() it, I get this:
INFO:gensim.utils:saving Similarity object to 846c57f-dirty--CHNK100-EB0-FW1-FW_NA0.5-FW_NB5-M0-NFdata__bom-nerfile-withmediaobjectfragmentids-NUMBEST10-PATRN_INCL-S_L0-S_P0-SQ_K5-SQ_R1-TFIDF0_sim_dense_disk
Traceback (most recent call last):
File "./build-models.py", line 250, in <module>
File "./build-models.py", line 118, in rebuild_data_files
File "/usr/lib/python2.7/site-packages/gensim/utils.py", line 118, in save
File "/usr/lib/python2.7/site-packages/gensim/utils.py", line 414, in pickle
cPickle.dump(obj, fout, protocol=protocol)
cPickle.PicklingError: Can't pickle <type 'ellipsis'>: attribute lookup __builtin__.ellipsis failed
Interestingly, I have done this hundreds of times before without issues.
I wonder if it has anything to do with an update to python or numpy, but I don't think so (I did upgrade scipy and python2 a week ago, but reverting didn't fix it)
I'm currently working around this issue by using http://jsonpickle.github.com/
diff --git a/src/gensim/utils.py b/src/gensim/utils.py
index 817f3b7..3d797a9 100644
@@ -1,4 +1,4 @@
# -*- coding: utf-8 -*-
# Copyright (C) 2010 Radim Rehurek <firstname.lastname@example.org>
@@ -13,6 +13,7 @@ from __future__ import with_statement
from functools import wraps # for `synchronous` function lock
@@ -421,16 +422,24 @@ def chunkize(corpus, chunks, maxsize=0):
for chunk in chunkize_serial(corpus, chunks):
def pickle(obj, fname, protocol=-1):
"""Pickle object `obj` to file `fname`."""
- with open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
- cPickle.dump(obj, fout, protocol=protocol)
+ with open(fname, 'w') as fout:
"""Load pickled object from `fname`"""
- return cPickle.load(open(fname, 'rb'))
+ with open(fname, 'r') as fin:
+ return jsonpickle.decode(fin.read())
Can jsonpickle handle very large objects (reasonably memory efficient during save/load)? Dedan had another issue with cPickle, see #31 , so perhaps completely switching from pickle to json would solve both at the same time...
Radim, your question triggered this little experiment:http://dieter.plaetinck.be/poor_mans_pickle_implementations_benchmark.html
I shall check out your numpy-based approach, it is probably better than my jsonpickle approach.
Nice! I like benchmarks :)
How about the standard json package? (simplejson in python <2.6)
What do you mean? what about it?
the jsonpickle page says "The standard Python libraries for encoding Python into JSON, such as the stdlib’s json, simplejson, and demjson, can only handle Python primitives that have a direct JSON equivalent (e.g. dicts, lists, strings, ints, etc.). jsonpickle builds on top of these libraries"
Oh, I didn't know it builds on json. In that case its performance is prolly nearly identical, no need to test.
Btw I remembered reading about json speed on metaoptimize some time ago, I managed to googled it up: http://metaoptimize.com/blog/2009/03/22/fast-deserialization-in-python/
Well, the article explicitly discourages using it because it's buggy and unmaintained.
?? It's part of the standard python library. You probably mean cjson.
Yes, I meant cjson. Anyway I don't feel the need to test more things right now, as the numpy native persistency thing you did is probably best anyway. Or am I missing something?
For numpy arrays, I think you're right :) Numpy is also very actively developed/maintained, so there's a good chance potential bugs will be fixed quickly. The core numpy guys are very good engineers.