New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BleiCorpus after serialize cannot be loaded #1171
Comments
@vincentmajor Thanks for reporting it. Are you trying to load the model with 0.13.3 or older version? There is a know issue that is currently being worked on #1082 |
@tmylk Yes, gensim version 0.13.2 on Python 2.7.12. |
Agree, LdaModel is not related to BleiCorpus. There haven't been any changes. Are you serializing and loading on the same platform, say Linux? |
Yep, OS X. It doesn't even work loading straight after serializing. |
do you have a small text and code example to reproduce? |
Not really, I'll generate one now. |
Okay here is what I am doing. Apologies about the nltk import.
|
Hi, I am looking to fix this issue. Could I get some directions as to where these particular functions are implemented in the code base? |
So I looked around and it seems that the load() function loads a pickle whereas the serialize() function calls save_corpus() which does not store a pickle at all but stores in plain text with a .vocab and a .index file if I am not wrong. The fix would be to either change the load() function to read the plain text format or the serialize function calls save() instead of save_corpus() as save() stores in pickle format. |
Everything works fine if instead of using the load() function you use the init method from BleiCorpus ie. |
Does this mean the issue is resolved? Can we close? @pranaydeep-af what would have helped you avoid this mistake? I mean, how would you change the documentation or API so that the behaviour is clear, avoiding the issue? Or at least finding the solution more quickly. |
@piskvorky Maybe we can change the save() and load() functions to save_as_pickle() and load_pickle() so the behaviour is clear. |
-1 on There is only one way to load/save objects in gensim (called |
Ah, So how about if we catch the unpickling error in the load() function and write an error message which suggests the user to try the default constructor for loading instead? |
A warning on the load, the same way as on the save would be good. @piskvorky What is the use case of the save/load of an iterator for a corpus? can it be better isolated? |
Should I add a warning and send a PR or are we looking for a better solution? |
@tmylk I don't think there's any use case for pickling iterators... in fact, I don't think it can even be done in Python. Pickling a "streamed object" itself is pretty much always an error, I'd say. That's not what a user wants (they want to serialize the content inside, not the streaming object itself). |
@pranaydeep-af Can you experiment and see what is actually saved in the save() method? To confirm that it is not a useful method. If confirmed then please add an exception to load and save methods. |
Sure thing, @tmylk |
The exception in load() would be for an Unpickling Error. |
@tmylk Basic PR with try-catch block for load() function has been submitted. |
@pranaydeep-af What is actualy saved and loaded with the save/load methods in the corpus? Is it useful? If they are not useful, then my request above was to throw(not catch) an exception in the corpus load and save methods to alert the user that they are not useful methods. |
@tmylk Apologies for the delay. Had to deal with some college course work. |
@pranaydeep-af Could you please be more explicit and provide code samples when save/load are useful and when they are not? Having difficulty understanding you |
@tmylk What I meant was, save() load() methods are not useful for corpus maybe, but they are part of the larger SaveLoad class which many derived classes use. As an example, the DocVecsArray() class which saves Doc2Vec arrays after the model is generated also is a base class of utils.SaveLoad.
Thus, if save/load are deprecated in general, corpus might work fine with save_corpus() but other things might have trouble. |
Are you saying that corpus should not inherit from SaveLoad? |
@tmylk Yes, that would work because corpus already have independent functions for saving loading. Should I go ahead with that then? |
@tmylk How do I proceed with this? |
@pranaydeep-af Could you please investigate what is actualy saved and loaded with the save/load methods in the corpus? |
@tmylk I did that in a comment 9 days ago. Here's the comment again:
|
Python version 2.7.12
gensim version 0.13.2
I'm serializing my corpus into Blei LDA-C format using
corpora.BleiCorpus.serialize(filename, corpus)
which is then later used in a dynamic topic model, and not in python. (I know I can use the DTMModel wrapper, unrelated.)If I need to come back and load the corpus back into Python I tried
corpora.BleiCorpus.load(filename)
, I get an unpickling error:The only other argument to
load()
ismmap
but I don't believe the arrays were stored separately and usingmmap='r'
doesn't help anyway.The text was updated successfully, but these errors were encountered: