Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for `most_similar()` #1737

souveekbose · 2017-11-23T08:29:44Z

I am facing an issue while using the model.docvecs.most_similar() function.
gensim: 2.3.0
Python 3.6.0
Error message: '<' not supported between instances of 'str' and 'int'

My code follows:

def vectorize_doc2vec():
    sentence1 = 'Dogo is a dog'
    sentence2 = 'dog is a pet'
    sentence3 = 'pets are cool'
    sentence={sentence1:'j1', sentence2:'j2', sentence3:'j3'}
    
    sentences = LabeledLineSentence(sentence)
    
    model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
    model.build_vocab(sentences.to_array())
    
    model.train(sentences.sentences_perm(), total_examples=model.corpus_count, epochs=10)
    
    print(model.docvecs.most_similar('dog', topn=1))

The class LabeledLineSentence is as follows:

class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for description, id in self.sources.items():
            yield LabeledSentence(description, id)

    def to_array(self):
        self.sentences = []
        for description, id in self.sources.items():
            self.sentences.append(LabeledSentence(description, id))
        return self.sentences
    
    def sentences_perm(self):
        shuffle(self.sentences)
        return self.sentences

Error StackTrace:

INFO:gensim.models.doc2vec:collecting all words and their counts
WARNING:gensim.models.doc2vec:Each 'words' should be a list of words (usually unicode strings).First 'words' here is instead plain <class 'str'>.
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 14 word types and 4 unique tags from a corpus of 3 examples and 38 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=1 retains 14 unique words (100% of original 14, drops 0)
INFO:gensim.models.word2vec:min_count=1 leaves 38 word corpus (100% of original 38, drops 0)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 14 items
INFO:gensim.models.word2vec:sample=0.0001 downsamples 14 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 1 word corpus (3.7% of prior 38)
INFO:gensim.models.word2vec:estimated required memory for 14 words and 100 dimensions: 20600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:training model with 8 workers on 14 vocabulary and 100 features, using sg=0 hs=0 sample=0.0001 negative=5 window=10
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 7 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 6 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 5 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 4 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 380 raw words (77 effective words) took 0.0s, 3374 effective words/s
WARNING:gensim.models.word2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
INFO:gensim.models.doc2vec:precomputing L2-norms of doc weight vectors
Traceback (most recent call last):

File "", line 17, in #
 vectorize_doc2vec()
File "", line 14, in vectorize_doc2vec
print(model.docvecs.most_similar('dog', topn=1))

File "C:\Users\humblebee\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\doc2vec.py", line 461, in most_similar
elif doc in self.doctags or doc < self.count:

TypeError: '<' not supported between instances of 'str' and 'int' # #

#1586

The text was updated successfully, but these errors were encountered:

gojomo · 2017-11-26T09:28:49Z

The tag dog is not in your model. An example's tags should be a list-of-tags, not a plain string. (Similarly, an example's words should be a list-of-words, not a plain string – as indicated by the WARNING in your 2nd log-line.)

We should in fact be showing a better error here, by verifying doc is an int before checking if it works as an int-index:

https://github.com/RaRe-Technologies/gensim/blob/7fabdbd281cfb4f09a7ff26f7da82223f192f766/gensim/models/doc2vec.py#L473

(Additionally, just as we're showing a WARNING when a plain-string is offered as words, we should do the same if a plain-string is offered as tags – so that at least people watching their log WARNINGs will have better chance of resolving such issues.)

piskvorky · 2017-11-26T09:34:17Z

Yes, that error message is terrible, for such a (relatively) common and easy-to-make mistake.

piskvorky · 2017-11-26T09:37:36Z

In fact, that whole int-vs-string design is clearly confusing, users report issues there all the time. At the same time, the "expected contract" behaviour is complex to explain, which is a possible code smell.

What can we do to make the API saner? Drop the ints? Drop strings? What was the original rationale for this design?

CC @manneshiva

menshikh-iv · 2017-11-27T04:16:16Z

@piskvorky one of the possible solution - write clear docstring and describe types + examples of usage

piskvorky · 2017-11-27T13:35:04Z

Sure, better documentation always helps. But in this case, I wonder if we can do better with the API itself.

gojomo · 2017-11-30T20:36:11Z

The motivation for allowing plain-int tags, and handling them specially (in fact more-simply), is to allow sophisticated users the option of avoiding the string-to-int lookup dict overhead. That overhead could be significant for giant training sets. (The much more common issue, and the primary problem here, is failing to understand words should be a list-of-tokens, and tags a list-of-tags, rather than plain strings.)

piskvorky · 2017-11-30T21:34:22Z

Sure, I remember that int optimization.

But can we streamline the API so that users stop failing to understand that?

Disable parameter type overloading, provide better checks, better error messages? "Too many users misunderstanding" = API needs some re-thinking.

gojomo · 2017-11-30T21:36:04Z

My suggestions are as in #1737 (comment)

gojomo changed the title ~~model.docvecs.most_similar not working~~ Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar() Nov 26, 2017

menshikh-iv added the documentation Current issue related to documentation label Nov 27, 2017

gojomo added difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Nov 4, 2018

menshikh-iv assigned mpenkov Dec 14, 2018

mpenkov added the Hacktoberfest Issues marked for hacktoberfest label Sep 28, 2019

gojomo added the bug Issue described a bug label Jan 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for `most_similar()` #1737

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for `most_similar()` #1737

souveekbose commented Nov 23, 2017 •

edited by mpenkov

gojomo commented Nov 26, 2017

piskvorky commented Nov 26, 2017 •

edited

piskvorky commented Nov 26, 2017 •

edited

menshikh-iv commented Nov 27, 2017

piskvorky commented Nov 27, 2017

gojomo commented Nov 30, 2017

piskvorky commented Nov 30, 2017 •

edited

gojomo commented Nov 30, 2017

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar() #1737

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar() #1737

Comments

souveekbose commented Nov 23, 2017 • edited by mpenkov

gojomo commented Nov 26, 2017

piskvorky commented Nov 26, 2017 • edited

piskvorky commented Nov 26, 2017 • edited

menshikh-iv commented Nov 27, 2017

piskvorky commented Nov 27, 2017

gojomo commented Nov 30, 2017

piskvorky commented Nov 30, 2017 • edited

gojomo commented Nov 30, 2017

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for `most_similar()` #1737

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for `most_similar()` #1737

souveekbose commented Nov 23, 2017 •

edited by mpenkov

piskvorky commented Nov 26, 2017 •

edited

piskvorky commented Nov 26, 2017 •

edited

piskvorky commented Nov 30, 2017 •

edited