Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP


cPickle Problem #31

dedan opened this Issue · 5 comments

3 participants


As described in this thread on the mailinglist I have a problem when pickling models that contain very large matrices. As a intermediate solution I would like to contribute something which stores the matrices using the numpy methods.

@piskvorky: We wanted to discuss this when you have been in Berlin but then forgot about it. Do you have some ideas of how to implement this nicely? I would like to hear your suggestions so that the solution fits your ideas of gensim. If you have no time or no ideas, I will just implement something and send you a pull request


Hmm, so the goal here is to avoid the cPickle bug? Then maybe only override the save/load methods, to store a .npy file (for large matrices) and a .pkl file (everything else). Or store both in a single archive, with zipfile, which could have positive impact on file size. But on the other hand, zipfile would mean no mmap = no sharing of memory for the same model between multiple processes...

A more general solution would be using PyTables, like Matt Goodman suggested in the thread you link to. I'm not sure we need another dependency yet, and it's certainly more work, but up to you :)


dedan, you could also try jsonpickle.
All you need to (IIRC) is applying this patch: #30 (comment)


@dedan: I changed (Sparse)MatrixSimilarity to use numpy binary format instead of cPickle. Maybe that solution works for you too?

It's really simple, I override the object's save/load to store the large matrix separately (so that it can be mmap'ed back later) and store the rest of the object normally, with cPickle.


Thank you @piskvorsky. I'll have a look on this solution later or tomorrow. I think it is similar to my current solution.


@dedan, does the new code (based on the numpy binary format instead of cPickle) fix your issue?
the code is actually already included in gensim 0.8.0

@piskvorky piskvorky closed this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.