Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BleiCorpus after serialize cannot be loaded #1171

Open
vincentmajor opened this issue Feb 24, 2017 · 30 comments
Open

BleiCorpus after serialize cannot be loaded #1171

vincentmajor opened this issue Feb 24, 2017 · 30 comments
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest

Comments

@vincentmajor
Copy link

Python version 2.7.12
gensim version 0.13.2

I'm serializing my corpus into Blei LDA-C format using corpora.BleiCorpus.serialize(filename, corpus) which is then later used in a dynamic topic model, and not in python. (I know I can use the DTMModel wrapper, unrelated.)

If I need to come back and load the corpus back into Python I tried corpora.BleiCorpus.load(filename), I get an unpickling error:

 ---------------------------------------------------------------------------
 UnpicklingError                           Traceback (most recent call last)
<ipython-input-28-0dd15393a941> in <module>()
----> 1 test = corpora.BleiCorpus.load('corpus-mult.dat')

/Applications/anaconda/envs/py27/lib/python2.7/site-packages/gensim/utils.pyc in load(cls, fname, mmap)
    246         compress, subname = SaveLoad._adapt_by_suffix(fname)
    247 
--> 248         obj = unpickle(fname)
    249         obj._load_specials(fname, mmap, compress, subname)
    250         return obj

/Applications/anaconda/envs/py27/lib/python2.7/site-packages/gensim/utils.pyc in unpickle(fname)
    909     with smart_open(fname) as f:
    910         # Because of loading from S3 load can't be used (missing readline in smart_open)
--> 911         return _pickle.loads(f.read())
    912 
    913 

UnpicklingError: invalid load key, '7'.

The only other argument to load() is mmap but I don't believe the arrays were stored separately and using mmap='r' doesn't help anyway.

@tmylk
Copy link
Contributor

tmylk commented Feb 24, 2017

@vincentmajor Thanks for reporting it. Are you trying to load the model with 0.13.3 or older version? There is a know issue that is currently being worked on #1082

@vincentmajor
Copy link
Author

@tmylk Yes, gensim version 0.13.2 on Python 2.7.12.
But, that issue referenced LDA models whereas I am serializing and loading a corpus without a model. I don't know if the same absence of a random_state variable influences loading of a corpus which should be random state invariant.

@tmylk
Copy link
Contributor

tmylk commented Feb 24, 2017

Agree, LdaModel is not related to BleiCorpus. There haven't been any changes. Are you serializing and loading on the same platform, say Linux?

@vincentmajor
Copy link
Author

Yep, OS X. It doesn't even work loading straight after serializing.

@tmylk
Copy link
Contributor

tmylk commented Feb 24, 2017

do you have a small text and code example to reproduce?

@vincentmajor
Copy link
Author

Not really, I'll generate one now.

@vincentmajor
Copy link
Author

Okay here is what I am doing. Apologies about the nltk import.

import numpy as np
import pandas as pd

df = pd.DataFrame(["The quick brown fox jumps over the lazy dog.", "This is a test."], columns=["text"])
#df.head()

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') 

token_list = [tokenizer.tokenize(document) for document in df["text"]]
#print token_list

from gensim import corpora
dictionary = corpora.Dictionary(token_list)

## using the dictionary to convert from document to bag of words (doc2bow) encoding
corpus = [dictionary.doc2bow(document) for document in token_list]
## store the corpus into lda-c format, same as used by DTM,
corpora.BleiCorpus.serialize('~/Desktop/test_corpus-mult.dat', corpus)

## fails here
test_corpus = corpora.BleiCorpus.load('~/Desktop/test_corpus-mult.dat')

@piskvorky piskvorky added the bug Issue described a bug label Mar 3, 2017
@pranaydeeps
Copy link

pranaydeeps commented Mar 4, 2017

Hi, I am looking to fix this issue. Could I get some directions as to where these particular functions are implemented in the code base?
Thank you.

@pranaydeeps
Copy link

So I looked around and it seems that the load() function loads a pickle whereas the serialize() function calls save_corpus() which does not store a pickle at all but stores in plain text with a .vocab and a .index file if I am not wrong. The fix would be to either change the load() function to read the plain text format or the serialize function calls save() instead of save_corpus() as save() stores in pickle format.
Tell me if I am going wrong anywhere @tmylk

@pranaydeeps
Copy link

Everything works fine if instead of using the load() function you use the init method from BleiCorpus ie.
a = corpora.BleiCorpus('/home/artemis/Desktop/test_corpus-mult.dat')
This uses the same format to load as the save_corpus() function. So to sum it up, the save_corpus() is analogous to the default constructor of the class and load() function stores in pickle and is analogous to the save() method.

@piskvorky
Copy link
Owner

piskvorky commented Mar 4, 2017

Does this mean the issue is resolved? Can we close?

@pranaydeep-af what would have helped you avoid this mistake? I mean, how would you change the documentation or API so that the behaviour is clear, avoiding the issue? Or at least finding the solution more quickly.

@pranaydeeps
Copy link

@piskvorky Maybe we can change the save() and load() functions to save_as_pickle() and load_pickle() so the behaviour is clear.
Also, if both ways to store are not needed. Maybe we can implement one common way to save and load?
Awaiting confirmation of my findings from @tmylk

@piskvorky
Copy link
Owner

piskvorky commented Mar 4, 2017

-1 on save_pickle and load_pickle. The pickle format is only incidental. And it's not even true it's always pickle -- we already store (sub)attributes in various ways, such as direct .npy files to allow memory-mapping.

There is only one way to load/save objects in gensim (called .load() and .save()). Serializing corpora (lossy text formats) is not the same as serializing objects (binary lossless formats).

@pranaydeeps
Copy link

Ah, So how about if we catch the unpickling error in the load() function and write an error message which suggests the user to try the default constructor for loading instead?
@piskvorky

@tmylk
Copy link
Contributor

tmylk commented Mar 5, 2017

A warning on the load, the same way as on the save would be good. @piskvorky What is the use case of the save/load of an iterator for a corpus? can it be better isolated?

@pranaydeeps
Copy link

Should I add a warning and send a PR or are we looking for a better solution?

@piskvorky
Copy link
Owner

piskvorky commented Mar 5, 2017

@tmylk I don't think there's any use case for pickling iterators... in fact, I don't think it can even be done in Python.

Pickling a "streamed object" itself is pretty much always an error, I'd say. That's not what a user wants (they want to serialize the content inside, not the streaming object itself).

@tmylk
Copy link
Contributor

tmylk commented Mar 5, 2017

@pranaydeep-af Can you experiment and see what is actually saved in the save() method? To confirm that it is not a useful method. If confirmed then please add an exception to load and save methods.

@pranaydeeps
Copy link

Sure thing, @tmylk

@pranaydeeps
Copy link

The exception in load() would be for an Unpickling Error.
What sort of exception is desired in the save() method since it does not generate an error?
@tmylk

@pranaydeeps
Copy link

@tmylk Basic PR with try-catch block for load() function has been submitted.

@tmylk
Copy link
Contributor

tmylk commented Mar 6, 2017

@pranaydeep-af What is actualy saved and loaded with the save/load methods in the corpus? Is it useful?

If they are not useful, then my request above was to throw(not catch) an exception in the corpus load and save methods to alert the user that they are not useful methods.

pranaydeeps pushed a commit to pranaydeeps/gensim that referenced this issue Mar 15, 2017
pranaydeeps pushed a commit to pranaydeeps/gensim that referenced this issue Mar 15, 2017
@pranaydeeps
Copy link

@tmylk Apologies for the delay. Had to deal with some college course work.
I went through the save() and load() methods and did a few experiments. The save() method stores a pickle dump of the particular object. So in the case of a corpora.dictionary object that is rendered useless due to the presence of a superior method ie. save_corpus(), however the function may be handy when used for saving pickles of some other objects.
So I have added a check to see if the function was called from a corpora object and if it was, I have raised an exception. Otherwise the function works as usual.
For the load function I have added a warning to not use it for loading corpora objects.
Do tell me if some naive mistake has been made.

@tmylk
Copy link
Contributor

tmylk commented Mar 16, 2017

@pranaydeep-af Could you please be more explicit and provide code samples when save/load are useful and when they are not? Having difficulty understanding you

@pranaydeeps
Copy link

pranaydeeps commented Mar 18, 2017

@tmylk What I meant was, save() load() methods are not useful for corpus maybe, but they are part of the larger SaveLoad class which many derived classes use. As an example, the DocVecsArray() class which saves Doc2Vec arrays after the model is generated also is a base class of utils.SaveLoad.

model = Doc2Vec(sentences)
model.save('/tmp/my_model.doc2vec')
model_loaded = Doc2Vec.load('/tmp/my_model.doc2vec')

Thus, if save/load are deprecated in general, corpus might work fine with save_corpus() but other things might have trouble.

@tmylk
Copy link
Contributor

tmylk commented Mar 19, 2017

Are you saying that corpus should not inherit from SaveLoad?

@pranaydeeps
Copy link

@tmylk Yes, that would work because corpus already have independent functions for saving loading. Should I go ahead with that then?

@pranaydeeps
Copy link

@tmylk How do I proceed with this?

@tmylk
Copy link
Contributor

tmylk commented Mar 24, 2017

@pranaydeep-af Could you please investigate what is actualy saved and loaded with the save/load methods in the corpus?

@pranaydeeps
Copy link

@tmylk I did that in a comment 9 days ago. Here's the comment again:

@tmylk Apologies for the delay. Had to deal with some college course work.
I went through the save() and load() methods and did a few experiments. The save() method stores a pickle dump of the particular object. So in the case of a corpora.dictionary object that is rendered useless due to the presence of a superior method ie. save_corpus(), however the function may be handy when used for saving pickles of some other objects.

@menshikh-iv menshikh-iv added difficulty medium Medium issue: required good gensim understanding & python skills test before incubator labels Oct 2, 2017
@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) needs domain knowledge and removed test before incubator labels Oct 16, 2017
@mpenkov mpenkov added the Hacktoberfest Issues marked for hacktoberfest label Sep 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest
Projects
None yet
Development

No branches or pull requests

6 participants