Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MmCorpus.load --> UnpicklingError: invalid load key, '%'. #1889

Closed
obeavers opened this issue Feb 9, 2018 · 10 comments
Closed

MmCorpus.load --> UnpicklingError: invalid load key, '%'. #1889

obeavers opened this issue Feb 9, 2018 · 10 comments
Labels
need info Not enough information for reproduce an issue, need more info from author

Comments

@obeavers
Copy link

obeavers commented Feb 9, 2018

Description

I'm getting an error in using MmCorpus.load('file.mm'), even immediately after saving saving with MmCorpus.serialize('file.mm', corpus). I am using windows10.

Steps/Code/Corpus to Reproduce

Corpus created with:

corpus = [dictionary.doc2bow(text) for text in texts]
MmCorpus.serialize('file.mm', corpus')

corpus = MmCorpus.serialize('file.mm') #breaks here

Expected Results

Expecting corpus to load as called.

Actual Results

1 c = MmCorpus.load(str(path))

c:\users\user.virtualenvs\key_log-v5coq-ss\lib\site-packages\gensim\utils.py in load(cls, fname, mmap)
393 compress, subname = SaveLoad._adapt_by_suffix(fname)
394
--> 395 obj = unpickle(fname)
396 obj._load_specials(fname, mmap, compress, subname)
397 logger.info("loaded %s", fname)

c:\users\user.virtualenvs\key_log-v5coq-ss\lib\site-packages\gensim\utils.py in unpickle(fname)
1300 # Because of loading from S3 load can't be used (missing readline in smart_open)
1301 if sys.version_info > (3, 0):
-> 1302 return _pickle.load(f, encoding='latin1')
1303 else:
1304 return _pickle.loads(f.read())

UnpicklingError: invalid load key, '%'.

Versions

Windows-10-10.0.16299-SP0
Python 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.0
SciPy 1.0.0
gensim 3.3.0
FAST_VERSION 0

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Feb 9, 2018

Thanks for report @obeavers, can you share your corpus (needed for reproducing your error)?

@menshikh-iv menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Feb 9, 2018
@arlenk
Copy link
Contributor

arlenk commented Feb 12, 2018

MmCorpus.load('file.mm')

Are you definitely calling MmCorpus.load('file.mm') or are you calling MmCorpus('file.mm')?
It should be the latter per https://radimrehurek.com/gensim/tut1.html#corpus-formats

@menshikh-iv
Copy link
Contributor

@obeavers also, you said that

corpus = MmCorpus.serialize('file.mm') #breaks here

but in your stacktrace, I see different line

c = MmCorpus.load(str(path))

This looks strange, can you fix your first message & share file?

Also, as @arlenk suggested, if you call "serialize", you should load it as MmCorpus(path) (not MmCorpus.load)

@menshikh-iv
Copy link
Contributor

So, I investigate it again, the problem really with serialize + load, this is an incorrect way of usage.

You should call MmCorpus.serizalize("file.mm", corpus) and after - load it as MmCorpus("file.mm") (don't use save/load here, this have no sense).

@sreenathelloti
Copy link

Error while executing following command
self.train_data = pickle.load(f,encoding='cp1252')
UnpicklingError: invalid load key, '\xd9'.

@menshikh-iv
Copy link
Contributor

@sreenathelloti how this related with current thread? What is this code?

@aristila
Copy link

aristila commented Sep 21, 2021

Here's my two cents: I had serialized a corpus in a Linux server and transferred the .mm and .mm.index files into my windows 10 environment, then tried to load the corpus.

try #1:
corpus = corpora.MmCorpus(path_to_mm_file)
(NB! this worked fine in the original Linux environment)

resulting error:
Traceback (most recent call last):
File [path_to_code], line 178, in
corpus = corpora.MmCorpus(ser_path)
File ".......\gensim\corpora\mmcorpus.py", line 55, in init
matutils.MmReader.init(self, fname)
File "gensim/corpora/_mmreader.pyx", line 55, in gensim.corpora._mmreader.MmReader.init
self.input, self.transposed = input, transposed
File "gensim/corpora/_mmreader.pyx", line 70, in gensim.corpora._mmreader.MmReader.init
if not line.startswith('%'):
ValueError: need more than 0 values to unpack

try #2:
corpus = corpora.MmCorpus.load(ser_path)

resulting error:
Traceback (most recent call last):
File [path_to_code], line 178, in
corpus = corpora.MmCorpus.load(ser_path)
File "......\gensim\utils.py", line 486, in load
obj = unpickle(fname)
File ".......\gensim\utils.py", line 1458, in unpickle
return _pickle.load(f, encoding='latin1') # needed because loading from S3 doesn't support readline()
_pickle.UnpicklingError: invalid load key, '%'.

@arlenk
Copy link
Contributor

arlenk commented Sep 21, 2021

Here's my two cents: I had serialized a corpus in a Linux server and transferred the .mm and .mm.index files into my windows 10 environment, then tried to load the corpus.

The mmCorpus file (path_to_mm_file) should just be a plain text file. Have you tried looking at the file to make sure the transfer from linux didn't somehow corrupt the file?

@aristila
Copy link

Have you tried looking at the file to make sure the transfer from linux didn't somehow corrupt the file?

I compared checksums and they match, at least.

One thing that caught my eye is encoding='latin1' in the second try. I have been using encoding='utf-8' everywhere I can, but here the code seems to just guess. That's probably not the cause of this problem though, just thought I'd mention.

@piskvorky
Copy link
Owner

piskvorky commented Sep 21, 2021

The first way is correct and should work:

corpus = corpora.MmCorpus(path_to_mm_file)

The error ValueError: need more than 0 values to unpack is weird though, I cannot reproduce it. What Python are you using?

Seems unrelated to this ticket. Please open a new ticket, with the necessary info (incl. a minimal example, if possible), thanks.

Repository owner locked as resolved and limited conversation to collaborators Sep 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
need info Not enough information for reproduce an issue, need more info from author
Projects
None yet
Development

No branches or pull requests

6 participants