-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577
Comments
Yes, there's a bug that is fixed by pending PR #560. (It's ok for You could apply that PR as a fix, or separately convert the returned indexs, as described in a recent forum post: https://groups.google.com/forum/#!msg/gensim/qBKh8DHFn-A/ByOPSEUFAQAJ |
This is now fixed in the 'develop' branch. |
I installed the 'develop' branch using |
Are you by chance reloading a model from before this was fixed? |
That is almost certain. Do I need to retrain? |
If you can easily retrain, that'd be best: you'd be sure to have a model that's in sync with the latest code. But it's also likely the old model could be patched to match the current code expectations. I haven't tested this, but the two key changes are: (a) the field formerly called So I would try, after loading your older model:
That might adapt the old model properly; please let me know if it seems to work. |
Summary of the problem
When using string tags with doc2vec, calling model.docvecs.most_similar("string_tag") returns the internal numeric indices of the documents, instead of the respective string tag.
Code to easily reproduce the problem
Expected result: [("string_tag_x", 0.9093217849731445), ("string_tag_y", 0.9033896327018738), ...]
Actual result: [(9997, 0.9093217849731445), (3530, 0.9033896327018738), ...]
My investigation in the code
I apologize in advance if I am wrong, since this is the first time I am even using gensim.
I have looked a bit into the code and I think the problem is related to the fact that max_rawint is -1 when the most_similar function calls self.index_to_doctag(sim):
Looking further in the code, I think the reason why max_rawint is -1 might be the function below.
Note max_rawint is not changed if the tag is a string.
The text was updated successfully, but these errors were encountered: