Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732

Closed
nmoran opened this issue Nov 21, 2017 · 7 comments
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation

Comments

@nmoran
Copy link

nmoran commented Nov 21, 2017

Description

Attempting to train LSI models in online mode results in a segmentation fault.

Steps/Code/Corpus to Reproduce

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
from gensim import models

corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
           [(3, 1.0), (5, 1.0), (6, 1.0)],
           [(9, 1.0)],
           [(9, 1.0), (10, 1.0)],
           [(9, 1.0), (10, 1.0), (11, 1.0)],
           [(8, 1.0), (10, 1.0), (11, 1.0)]]

lsi2 = models.lsimodel.LsiModel(corpus[:5], num_topics=5)
lsi2.add_documents(corpus[5:])

Expected Results

That model is trained successfully and gives the same topics as if full corpus was input in one go.

Actual Results

Crashes when documents are added in the stochastic_svd function of matutils.py. Output is:

2017-11-21 16:50:24,213 : DEBUG : Fast version of gensim.models.doc2vec is being used
2017-11-21 16:50:24,219 : INFO : 'pattern' package not found; tag filters are not available for English
2017-11-21 16:50:24,220 : WARNING : no word id mapping provided; initializing from corpus, assuming identity
2017-11-21 16:50:24,220 : INFO : using serial LSI version on this node
2017-11-21 16:50:24,220 : INFO : updating model with new documents
2017-11-21 16:50:24,220 : INFO : preparing a new chunk of documents
2017-11-21 16:50:24,220 : DEBUG : converting corpus to csc format
2017-11-21 16:50:24,220 : INFO : using 100 extra samples and 2 power iterations
2017-11-21 16:50:24,220 : INFO : 1st phase: constructing (9, 105) action matrix
2017-11-21 16:50:24,220 : INFO : orthonormalizing (9, 105) action matrix
2017-11-21 16:50:24,220 : DEBUG : computing QR of (9, 105) dense matrix
2017-11-21 16:50:24,221 : DEBUG : running 2 power iterations
2017-11-21 16:50:24,221 : DEBUG : computing QR of (9, 9) dense matrix
2017-11-21 16:50:24,221 : DEBUG : computing QR of (9, 9) dense matrix
2017-11-21 16:50:24,221 : INFO : 2nd phase: running dense svd on (9, 5) matrix
2017-11-21 16:50:24,221 : INFO : computing the final decomposition
2017-11-21 16:50:24,221 : INFO : keeping 5 factors (discarding 0.000% of energy spectrum)
2017-11-21 16:50:24,222 : INFO : processed documents up to #5
2017-11-21 16:50:24,222 : INFO : topic #0(3.333): 0.650*"4" + 0.405*"3" + 0.305*"7" + 0.264*"5" + 0.264*"6" + 0.240*"2" + 0.225*"0" + 0.200*"1" + 0.180*"8"
2017-11-21 16:50:24,222 : INFO : topic #1(2.363): 0.448*"6" + 0.448*"5" + -0.390*"4" + -0.356*"7" + 0.350*"3" + -0.309*"0" + 0.225*"8" + 0.174*"2" + -0.148*"1"
2017-11-21 16:50:24,222 : INFO : topic #2(1.644): 0.594*"2" + 0.557*"1" + 0.413*"0" + -0.337*"4" + -0.187*"7" + -0.091*"3" + -0.070*"6" + -0.070*"5" + 0.016*"8"
2017-11-21 16:50:24,222 : INFO : topic #3(1.365): 0.562*"1" + 0.503*"3" + -0.324*"0" + 0.297*"7" + -0.293*"2" + -0.263*"8" + -0.261*"4" + -0.088*"6" + -0.088*"5"
2017-11-21 16:50:24,222 : INFO : topic #4(0.858): -0.583*"0" + 0.555*"8" + -0.323*"5" + -0.323*"6" + 0.322*"2" + 0.177*"4" + 0.089*"1" + -0.028*"7" + -0.001*"3"
2017-11-21 16:50:24,222 : INFO : updating model with new documents
2017-11-21 16:50:24,222 : INFO : preparing a new chunk of documents
2017-11-21 16:50:24,222 : DEBUG : converting corpus to csc format
2017-11-21 16:50:24,222 : INFO : using 100 extra samples and 2 power iterations
2017-11-21 16:50:24,222 : INFO : 1st phase: constructing (9, 105) action matrix
2017-11-21 16:50:24,222 : INFO : orthonormalizing (9, 105) action matrix
*** Error in `python': double free or corruption (!prev): 0x0000000001785490 ***

Full backtrace output found at https://gist.github.com/nmoran/09a13599156e56d6b40a0e7627642e7a

Versions

Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.3')
('SciPy', '1.0.0')
('gensim', '3.1.0')
('FAST_VERSION', 1)

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 22, 2017

Thanks @nmoran, looks like a serious bug (but I'm not sure that bug on our side, maybe it's numpy/scipy problem).

Please add your code + dataset, we want to reproduce this bug.

@menshikh-iv menshikh-iv added bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Nov 22, 2017
@nmoran
Copy link
Author

nmoran commented Nov 22, 2017

Apologies, I thought I had included the code but it was inside the comment. This is fixed now. It's just the example dataset from the tutorial.

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 22, 2017

I found what's a problem.
For the first time, you pass corpus[:5] that contains indexes from range 0-8, next time, you passed corpus[5:] that contains indexes 8-11, that doesn't contain in LsiModel -> produce this problem.

To avoid this, please specify id2word in LsiModel first, like

id2word = {x: str(x) for x in range(12)}   # 12 is number of tokens in your corpus
lsi2 = models.lsimodel.LsiModel(corpus[:5], num_topics=5, id2word=id2word)
lsi2.add_documents(corpus[5:])

In the real task, you will fit gensim.corpora.Dictionary, convert all documents (list of tokens) with Dictionary.doc2bow first and pass dictionary to id2word parameter and this problem doesn't happen.

@nmoran
Copy link
Author

nmoran commented Nov 22, 2017

Thanks @menshikh-iv for the speedy resolution. Perhaps this should exit a little more gracefully, maybe by checking that the number of ids matches? In my particular case I am not applying this to language data so probably not going to use gensim.corpora.Dictionary.

Also wondering if this means that it is not supported to add more words in online mode?

@menshikh-iv
Copy link
Contributor

menshikh-iv commented Nov 22, 2017

@nmoran you are correct, dictionary must be defined before training (or using something based on hashing-trick with hash dictionary).

Agree with you, a clean exception will be better for this situation, reopen the issue.

@menshikh-iv menshikh-iv reopened this Nov 22, 2017
@menshikh-iv menshikh-iv added documentation Current issue related to documentation difficulty easy Easy issue: required small fix and removed difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Nov 22, 2017
@menshikh-iv menshikh-iv changed the title LSI online training is crashing Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) Nov 22, 2017
@piskvorky
Copy link
Owner

piskvorky commented Nov 22, 2017

Yes, online feature expansion is not supported. The feature set must be known in advance. Though they don't have to be words of course (no need to use Dictionary), that's just NLP terminology for "features".

The hashing trick sort of bypasses this limitation, by having a pre-determined, static way to convert the features into ids, using a limited id range: https://radimrehurek.com/gensim/corpora/hashdictionary.html

We have an issue open #74 for checking whether the supplied feature ids match the expected ids, but decided against implementing it so far, for performance reasons: one extra "checking" pass through the data can be quite costly, especially for non-repeatable streams.

@menshikh-iv
Copy link
Contributor

@piskvorky ok, I close this PR and added the comment to #74 about this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation
Projects
None yet
Development

No branches or pull requests

3 participants