-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process case for dictionary id mismatch in LsiModel (instead of break the interpreter) #1732
Comments
Thanks @nmoran, looks like a serious bug (but I'm not sure that bug on our side, maybe it's numpy/scipy problem). Please add your code + dataset, we want to reproduce this bug. |
Apologies, I thought I had included the code but it was inside the comment. This is fixed now. It's just the example dataset from the tutorial. |
I found what's a problem. To avoid this, please specify id2word = {x: str(x) for x in range(12)} # 12 is number of tokens in your corpus
lsi2 = models.lsimodel.LsiModel(corpus[:5], num_topics=5, id2word=id2word)
lsi2.add_documents(corpus[5:]) In the real task, you will fit |
Thanks @menshikh-iv for the speedy resolution. Perhaps this should exit a little more gracefully, maybe by checking that the number of ids matches? In my particular case I am not applying this to language data so probably not going to use Also wondering if this means that it is not supported to add more words in online mode? |
@nmoran you are correct, dictionary must be defined before training (or using something based on hashing-trick with hash dictionary). Agree with you, a clean exception will be better for this situation, reopen the issue. |
Yes, online feature expansion is not supported. The feature set must be known in advance. Though they don't have to be words of course (no need to use The hashing trick sort of bypasses this limitation, by having a pre-determined, static way to convert the features into ids, using a limited id range: https://radimrehurek.com/gensim/corpora/hashdictionary.html We have an issue open #74 for checking whether the supplied feature ids match the expected ids, but decided against implementing it so far, for performance reasons: one extra "checking" pass through the data can be quite costly, especially for non-repeatable streams. |
@piskvorky ok, I close this PR and added the comment to #74 about this issue. |
Description
Attempting to train LSI models in online mode results in a segmentation fault.
Steps/Code/Corpus to Reproduce
Expected Results
That model is trained successfully and gives the same topics as if full corpus was input in one go.
Actual Results
Crashes when documents are added in the stochastic_svd function of matutils.py. Output is:
Full backtrace output found at https://gist.github.com/nmoran/09a13599156e56d6b40a0e7627642e7a
Versions
Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial
('Python', '2.7.12 (default, Nov 19 2016, 06:48:10) \n[GCC 5.4.0 20160609]')
('NumPy', '1.13.3')
('SciPy', '1.0.0')
('gensim', '3.1.0')
('FAST_VERSION', 1)
The text was updated successfully, but these errors were encountered: