Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577

zmarty · 2016-01-10T02:38:44Z

Summary of the problem

When using string tags with doc2vec, calling model.docvecs.most_similar("string_tag") returns the internal numeric indices of the documents, instead of the respective string tag.

Code to easily reproduce the problem

import gensim
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

documents = []

# Downloaded from http://www.cs.umb.edu/~smimarog/textmining/datasets/20ng-train-all-terms.txt
with open('20ng-train-all-terms.txt') as alldata:
  for line_no, line in enumerate(alldata):
    tokens = gensim.utils.to_unicode(line).split()
    words = tokens[1:]
    tags = ["myline" + str(line_no)]
    documents.append(TaggedDocument(words=words, tags=tags))

model = Doc2Vec(documents, size=10, window=8, min_count=3, workers=8)

print (model.docvecs.most_similar("myline2"))

Expected result: [("string_tag_x", 0.9093217849731445), ("string_tag_y", 0.9033896327018738), ...]
Actual result: [(9997, 0.9093217849731445), (3530, 0.9033896327018738), ...]

My investigation in the code

I apologize in advance if I am wrong, since this is the first time I am even using gensim.

I have looked a bit into the code and I think the problem is related to the fact that max_rawint is -1 when the most_similar function calls self.index_to_doctag(sim):

def index_to_doctag(self, i_index):
    """Return string key for given i_index, if available. Otherwise return raw int doctag (same int)."""
    candidate_offset = self.max_rawint - i_index - 1
    if 0 <= candidate_offset < len(self.offset2doctag):
        return self.offset2doctag[candidate_offset]
    else:
        return i_index

Looking further in the code, I think the reason why max_rawint is -1 might be the function below.

Note max_rawint is not changed if the tag is a string.

def note_doctag(self, key, document_no, document_length):
    """Note a document tag during initial corpus scan, for structure sizing."""
    if isinstance(key, int):
        self.max_rawint = max(self.max_rawint, key)
    else:
        if key in self.doctags:
            self.doctags[key] = self.doctags[key].repeat(document_length)
        else:
            self.doctags[key] = Doctag(len(self.offset2doctag), document_length, 1)
            self.offset2doctag.append(key)
    self.count = self.max_rawint + 1 + len(self.offset2doctag)

gojomo · 2016-01-10T09:08:58Z

Yes, there's a bug that is fixed by pending PR #560.

(It's ok for max_rawint to be -1, if only string doctags were encountered during the `build_vocab() survey... but the calculation later of when string tags should be returned instead was wrong...)

You could apply that PR as a fix, or separately convert the returned indexs, as described in a recent forum post: https://groups.google.com/forum/#!msg/gensim/qBKh8DHFn-A/ByOPSEUFAQAJ

gojomo · 2016-01-14T07:15:33Z

This is now fixed in the 'develop' branch.

jberwald · 2016-02-08T03:21:35Z

I installed the 'develop' branch using pip install --upgrade https://github.com/piskvorky/gensim/archive/develop.zip. I'm still seeing the 'max_rawint' exception (AttributeError: 'DocvecsArray' object has no attribute 'max_rawint'). Could you confirm that this is fixed? Thanks for you help.

gojomo · 2016-02-08T03:29:17Z

Are you by chance reloading a model from before this was fixed?

jberwald · 2016-02-08T03:33:36Z

That is almost certain. Do I need to retrain?

gojomo · 2016-02-08T04:58:59Z

If you can easily retrain, that'd be best: you'd be sure to have a model that's in sync with the latest code.

But it's also likely the old model could be patched to match the current code expectations. I haven't tested this, but the two key changes are: (a) the field formerly called index2doctag is now called offsetToDoctag; (b) there's the new field max_rawint.

So I would try, after loading your older model:

model.offset2doctag = model.index2doctag
model.max_rawint = model.count - len(model.offset2doctag) - 1

That might adapt the old model properly; please let me know if it seems to work.

gojomo closed this as completed Jan 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577

Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577

zmarty commented Jan 10, 2016

gojomo commented Jan 10, 2016

gojomo commented Jan 14, 2016

jberwald commented Feb 8, 2016

gojomo commented Feb 8, 2016

jberwald commented Feb 8, 2016

gojomo commented Feb 8, 2016

Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577

Potential bug in doc2vec.py related to max_rawint and doctags string lookup #577

Comments

zmarty commented Jan 10, 2016

Summary of the problem

Code to easily reproduce the problem

My investigation in the code

gojomo commented Jan 10, 2016

gojomo commented Jan 14, 2016

jberwald commented Feb 8, 2016

gojomo commented Feb 8, 2016

jberwald commented Feb 8, 2016

gojomo commented Feb 8, 2016