Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577

Closed
zmarty opened this issue Jan 10, 2016 · 6 comments
Closed

Comments

@zmarty
Copy link

zmarty commented Jan 10, 2016

Summary of the problem

When using string tags with doc2vec, calling model.docvecs.most_similar("string_tag") returns the internal numeric indices of the documents, instead of the respective string tag.

Code to easily reproduce the problem

import gensim
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

documents = []

# Downloaded from http://www.cs.umb.edu/~smimarog/textmining/datasets/20ng-train-all-terms.txt
with open('20ng-train-all-terms.txt') as alldata:
  for line_no, line in enumerate(alldata):
    tokens = gensim.utils.to_unicode(line).split()
    words = tokens[1:]
    tags = ["myline" + str(line_no)]
    documents.append(TaggedDocument(words=words, tags=tags))

model = Doc2Vec(documents, size=10, window=8, min_count=3, workers=8)

print (model.docvecs.most_similar("myline2"))

Expected result: [("string_tag_x", 0.9093217849731445), ("string_tag_y", 0.9033896327018738), ...]
Actual result: [(9997, 0.9093217849731445), (3530, 0.9033896327018738), ...]

My investigation in the code

I apologize in advance if I am wrong, since this is the first time I am even using gensim.

I have looked a bit into the code and I think the problem is related to the fact that max_rawint is -1 when the most_similar function calls self.index_to_doctag(sim):

def index_to_doctag(self, i_index):
    """Return string key for given i_index, if available. Otherwise return raw int doctag (same int)."""
    candidate_offset = self.max_rawint - i_index - 1
    if 0 <= candidate_offset < len(self.offset2doctag):
        return self.offset2doctag[candidate_offset]
    else:
        return i_index

Looking further in the code, I think the reason why max_rawint is -1 might be the function below.

Note max_rawint is not changed if the tag is a string.

def note_doctag(self, key, document_no, document_length):
    """Note a document tag during initial corpus scan, for structure sizing."""
    if isinstance(key, int):
        self.max_rawint = max(self.max_rawint, key)
    else:
        if key in self.doctags:
            self.doctags[key] = self.doctags[key].repeat(document_length)
        else:
            self.doctags[key] = Doctag(len(self.offset2doctag), document_length, 1)
            self.offset2doctag.append(key)
    self.count = self.max_rawint + 1 + len(self.offset2doctag)
@gojomo
Copy link
Collaborator

gojomo commented Jan 10, 2016

Yes, there's a bug that is fixed by pending PR #560.

(It's ok for max_rawint to be -1, if only string doctags were encountered during the `build_vocab() survey... but the calculation later of when string tags should be returned instead was wrong...)

You could apply that PR as a fix, or separately convert the returned indexs, as described in a recent forum post: https://groups.google.com/forum/#!msg/gensim/qBKh8DHFn-A/ByOPSEUFAQAJ

@gojomo
Copy link
Collaborator

gojomo commented Jan 14, 2016

This is now fixed in the 'develop' branch.

@gojomo gojomo closed this as completed Jan 14, 2016
@jberwald
Copy link

jberwald commented Feb 8, 2016

I installed the 'develop' branch using pip install --upgrade https://github.com/piskvorky/gensim/archive/develop.zip. I'm still seeing the 'max_rawint' exception (AttributeError: 'DocvecsArray' object has no attribute 'max_rawint'). Could you confirm that this is fixed? Thanks for you help.

@gojomo
Copy link
Collaborator

gojomo commented Feb 8, 2016

Are you by chance reloading a model from before this was fixed?

@jberwald
Copy link

jberwald commented Feb 8, 2016

That is almost certain. Do I need to retrain?

@gojomo
Copy link
Collaborator

gojomo commented Feb 8, 2016

If you can easily retrain, that'd be best: you'd be sure to have a model that's in sync with the latest code.

But it's also likely the old model could be patched to match the current code expectations. I haven't tested this, but the two key changes are: (a) the field formerly called index2doctag is now called offsetToDoctag; (b) there's the new field max_rawint.

So I would try, after loading your older model:

model.offset2doctag = model.index2doctag
model.max_rawint = model.count - len(model.offset2doctag) - 1

That might adapt the old model properly; please let me know if it seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants