Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732

nmoran · 2017-11-21T16:58:13Z

Description

Attempting to train LSI models in online mode results in a segmentation fault.

Steps/Code/Corpus to Reproduce

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.DEBUG)
from gensim import models

corpus = [[(0, 1.0), (1, 1.0), (2, 1.0)],
          [(2, 1.0), (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (8, 1.0)],
          [(1, 1.0), (3, 1.0), (4, 1.0), (7, 1.0)],
          [(0, 1.0), (4, 2.0), (7, 1.0)],
           [(3, 1.0), (5, 1.0), (6, 1.0)],
           [(9, 1.0)],
           [(9, 1.0), (10, 1.0)],
           [(9, 1.0), (10, 1.0), (11, 1.0)],
           [(8, 1.0), (10, 1.0), (11, 1.0)]]

lsi2 = models.lsimodel.LsiModel(corpus[:5], num_topics=5)
lsi2.add_documents(corpus[5:])

Expected Results

That model is trained successfully and gives the same topics as if full corpus was input in one go.

Actual Results

Crashes when documents are added in the stochastic_svd function of matutils.py. Output is:

2017-11-21 16:50:24,213 : DEBUG : Fast version of gensim.models.doc2vec is being used
2017-11-21 16:50:24,219 : INFO : 'pattern' package not found; tag filters are not available for English
2017-11-21 16:50:24,220 : WARNING : no word id mapping provided; initializing from corpus, assuming identity
2017-11-21 16:50:24,220 : INFO : using serial LSI version on this node
2017-11-21 16:50:24,220 : INFO : updating model with new documents
2017-11-21 16:50:24,220 : INFO : preparing a new chunk of documents
2017-11-21 16:50:24,220 : DEBUG : converting corpus to csc format
2017-11-21 16:50:24,220 : INFO : using 100 extra samples and 2 power iterations
2017-11-21 16:50:24,220 : INFO : 1st phase: constructing (9, 105) action matrix
2017-11-21 16:50:24,220 : INFO : orthonormalizing (9, 105) action matrix
2017-11-21 16:50:24,220 : DEBUG : computing QR of (9, 105) dense matrix
2017-11-21 16:50:24,221 : DEBUG : running 2 power iterations
2017-11-21 16:50:24,221 : DEBUG : computing QR of (9, 9) dense matrix
2017-11-21 16:50:24,221 : DEBUG : computing QR of (9, 9) dense matrix
2017-11-21 16:50:24,221 : INFO : 2nd phase: running dense svd on (9, 5) matrix
2017-11-21 16:50:24,221 : INFO : computing the final decomposition
2017-11-21 16:50:24,221 : INFO : keeping 5 factors (discarding 0.000% of energy spectrum)
2017-11-21 16:50:24,222 : INFO : processed documents up to #5
2017-11-21 16:50:24,222 : INFO : topic #0(3.333): 0.650*"4" + 0.405*"3" + 0.305*"7" + 0.264*"5" + 0.264*"6" + 0.240*"2" + 0.225*"0" + 0.200*"1" + 0.180*"8"
2017-11-21 16:50:24,222 : INFO : topic #1(2.363): 0.448*"6" + 0.448*"5" + -0.390*"4" + -0.356*"7" + 0.350*"3" + -0.309*"0" + 0.225*"8" + 0.174*"2" + -0.148*"1"
2017-11-21 16:50:24,222 : INFO : topic #2(1.644): 0.594*"2" + 0.557*"1" + 0.413*"0" + -0.337*"4" + -0.187*"7" + -0.091*"3" + -0.070*"6" + -0.070*"5" + 0.016*"8"
2017-11-21 16:50:24,222 : INFO : topic #3(1.365): 0.562*"1" + 0.503*"3" + -0.324*"0" + 0.297*"7" + -0.293*"2" + -0.263*"8" + -0.261*"4" + -0.088*"6" + -0.088*"5"
2017-11-21 16:50:24,222 : INFO : topic #4(0.858): -0.583*"0" + 0.555*"8" + -0.323*"5" + -0.323*"6" + 0.322*"2" + 0.177*"4" + 0.089*"1" + -0.028*"7" + -0.001*"3"
2017-11-21 16:50:24,222 : INFO : updating model with new documents
2017-11-21 16:50:24,222 : INFO : preparing a new chunk of documents
2017-11-21 16:50:24,222 : DEBUG : converting corpus to csc format
2017-11-21 16:50:24,222 : INFO : using 100 extra samples and 2 power iterations
2017-11-21 16:50:24,222 : INFO : 1st phase: constructing (9, 105) action matrix
2017-11-21 16:50:24,222 : INFO : orthonormalizing (9, 105) action matrix
*** Error in `python': double free or corruption (!prev): 0x0000000001785490 ***

Full backtrace output found at https://gist.github.com/nmoran/09a13599156e56d6b40a0e7627642e7a

Versions

Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.3')
('SciPy', '1.0.0')
('gensim', '3.1.0')
('FAST_VERSION', 1)

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2017-11-22T08:38:20Z

Thanks @nmoran, looks like a serious bug (but I'm not sure that bug on our side, maybe it's numpy/scipy problem).

Please add your code + dataset, we want to reproduce this bug.

nmoran · 2017-11-22T10:03:25Z

Apologies, I thought I had included the code but it was inside the comment. This is fixed now. It's just the example dataset from the tutorial.

menshikh-iv · 2017-11-22T10:28:00Z

I found what's a problem.
For the first time, you pass corpus[:5] that contains indexes from range 0-8, next time, you passed corpus[5:] that contains indexes 8-11, that doesn't contain in LsiModel -> produce this problem.

To avoid this, please specify id2word in LsiModel first, like

id2word = {x: str(x) for x in range(12)}   # 12 is number of tokens in your corpus
lsi2 = models.lsimodel.LsiModel(corpus[:5], num_topics=5, id2word=id2word)
lsi2.add_documents(corpus[5:])

In the real task, you will fit gensim.corpora.Dictionary, convert all documents (list of tokens) with Dictionary.doc2bow first and pass dictionary to id2word parameter and this problem doesn't happen.

nmoran · 2017-11-22T11:42:59Z

Thanks @menshikh-iv for the speedy resolution. Perhaps this should exit a little more gracefully, maybe by checking that the number of ids matches? In my particular case I am not applying this to language data so probably not going to use gensim.corpora.Dictionary.

Also wondering if this means that it is not supported to add more words in online mode?

menshikh-iv · 2017-11-22T11:44:54Z

@nmoran you are correct, dictionary must be defined before training (or using something based on hashing-trick with hash dictionary).

Agree with you, a clean exception will be better for this situation, reopen the issue.

piskvorky · 2017-11-22T17:58:35Z

Yes, online feature expansion is not supported. The feature set must be known in advance. Though they don't have to be words of course (no need to use Dictionary), that's just NLP terminology for "features".

The hashing trick sort of bypasses this limitation, by having a pre-determined, static way to convert the features into ids, using a limited id range: https://radimrehurek.com/gensim/corpora/hashdictionary.html

We have an issue open #74 for checking whether the supplied feature ids match the expected ids, but decided against implementing it so far, for performance reasons: one extra "checking" pass through the data can be quite costly, especially for non-repeatable streams.

menshikh-iv · 2017-11-23T04:39:01Z

@piskvorky ok, I close this PR and added the comment to #74 about this issue.

menshikh-iv added bug Issue described a bug difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Nov 22, 2017

menshikh-iv closed this as completed Nov 22, 2017

menshikh-iv reopened this Nov 22, 2017

menshikh-iv added documentation Current issue related to documentation difficulty easy Easy issue: required small fix and removed difficulty hard Hard issue: required deep gensim understanding & high python/cython skills labels Nov 22, 2017

menshikh-iv changed the title ~~LSI online training is crashing~~ Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) Nov 22, 2017

menshikh-iv mentioned this issue Nov 23, 2017

Add more data sanity checks #74

Open

menshikh-iv closed this as completed Nov 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732

Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732

nmoran commented Nov 21, 2017 •

edited

menshikh-iv commented Nov 22, 2017 •

edited

nmoran commented Nov 22, 2017

menshikh-iv commented Nov 22, 2017 •

edited

nmoran commented Nov 22, 2017

menshikh-iv commented Nov 22, 2017 •

edited

piskvorky commented Nov 22, 2017 •

edited

menshikh-iv commented Nov 23, 2017

Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732

Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732

Comments

nmoran commented Nov 21, 2017 • edited

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

menshikh-iv commented Nov 22, 2017 • edited

nmoran commented Nov 22, 2017

menshikh-iv commented Nov 22, 2017 • edited

nmoran commented Nov 22, 2017

menshikh-iv commented Nov 22, 2017 • edited

piskvorky commented Nov 22, 2017 • edited

menshikh-iv commented Nov 23, 2017

nmoran commented Nov 21, 2017 •

edited

menshikh-iv commented Nov 22, 2017 •

edited

menshikh-iv commented Nov 22, 2017 •

edited

menshikh-iv commented Nov 22, 2017 •

edited

piskvorky commented Nov 22, 2017 •

edited