Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar() #1737

Open
souveekbose opened this issue Nov 23, 2017 · 8 comments
Assignees
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest

Comments

@souveekbose
Copy link

souveekbose commented Nov 23, 2017

I am facing an issue while using the model.docvecs.most_similar() function.
gensim: 2.3.0
Python 3.6.0
Error message: '<' not supported between instances of 'str' and 'int'

My code follows:

def vectorize_doc2vec():
    sentence1 = 'Dogo is a dog'
    sentence2 = 'dog is a pet'
    sentence3 = 'pets are cool'
    sentence={sentence1:'j1', sentence2:'j2', sentence3:'j3'}
    
    sentences = LabeledLineSentence(sentence)
    
    model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
    model.build_vocab(sentences.to_array())
    
    model.train(sentences.sentences_perm(), total_examples=model.corpus_count, epochs=10)
    
    print(model.docvecs.most_similar('dog', topn=1))

The class LabeledLineSentence is as follows:

class LabeledLineSentence(object):
    def __init__(self, sources):
        self.sources = sources
        
        flipped = {}
        
        # make sure that keys are unique
        for key, value in sources.items():
            if value not in flipped:
                flipped[value] = [key]
            else:
                raise Exception('Non-unique prefix encountered')
    
    def __iter__(self):
        for description, id in self.sources.items():
            yield LabeledSentence(description, id)

    def to_array(self):
        self.sentences = []
        for description, id in self.sources.items():
            self.sentences.append(LabeledSentence(description, id))
        return self.sentences
    
    def sentences_perm(self):
        shuffle(self.sentences)
        return self.sentences

Error StackTrace:

INFO:gensim.models.doc2vec:collecting all words and their counts
WARNING:gensim.models.doc2vec:Each 'words' should be a list of words (usually unicode strings).First 'words' here is instead plain <class 'str'>.
INFO:gensim.models.doc2vec:PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
INFO:gensim.models.doc2vec:collected 14 word types and 4 unique tags from a corpus of 3 examples and 38 words
INFO:gensim.models.word2vec:Loading a fresh vocabulary
INFO:gensim.models.word2vec:min_count=1 retains 14 unique words (100% of original 14, drops 0)
INFO:gensim.models.word2vec:min_count=1 leaves 38 word corpus (100% of original 38, drops 0)
INFO:gensim.models.word2vec:deleting the raw counts dictionary of 14 items
INFO:gensim.models.word2vec:sample=0.0001 downsamples 14 most-common words
INFO:gensim.models.word2vec:downsampling leaves estimated 1 word corpus (3.7% of prior 38)
INFO:gensim.models.word2vec:estimated required memory for 14 words and 100 dimensions: 20600 bytes
INFO:gensim.models.word2vec:resetting layer weights
INFO:gensim.models.word2vec:training model with 8 workers on 14 vocabulary and 100 features, using sg=0 hs=0 sample=0.0001 negative=5 window=10
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 7 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 6 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 5 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 4 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 3 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 2 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 1 more threads
INFO:gensim.models.word2vec:worker thread finished; awaiting finish of 0 more threads
INFO:gensim.models.word2vec:training on 380 raw words (77 effective words) took 0.0s, 3374 effective words/s
WARNING:gensim.models.word2vec:under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
INFO:gensim.models.doc2vec:precomputing L2-norms of doc weight vectors
Traceback (most recent call last):

File "", line 17, in #
 vectorize_doc2vec()
File "", line 14, in vectorize_doc2vec
print(model.docvecs.most_similar('dog', topn=1))

File "C:\Users\humblebee\AppData\Local\Continuum\Anaconda3\lib\site-packages\gensim\models\doc2vec.py", line 461, in most_similar
elif doc in self.doctags or doc < self.count:

TypeError: '<' not supported between instances of 'str' and 'int' # #

#1586

@gojomo
Copy link
Collaborator

gojomo commented Nov 26, 2017

The tag dog is not in your model. An example's tags should be a list-of-tags, not a plain string. (Similarly, an example's words should be a list-of-words, not a plain string – as indicated by the WARNING in your 2nd log-line.)

We should in fact be showing a better error here, by verifying doc is an int before checking if it works as an int-index:

https://github.com/RaRe-Technologies/gensim/blob/7fabdbd281cfb4f09a7ff26f7da82223f192f766/gensim/models/doc2vec.py#L473

(Additionally, just as we're showing a WARNING when a plain-string is offered as words, we should do the same if a plain-string is offered as tags – so that at least people watching their log WARNINGs will have better chance of resolving such issues.)

@gojomo gojomo changed the title model.docvecs.most_similar not working Confusing "TypeError: '<' not supported between … 'str' and 'int'" when doc-tag not present for most_similar() Nov 26, 2017
@piskvorky
Copy link
Owner

piskvorky commented Nov 26, 2017

Yes, that error message is terrible, for such a (relatively) common and easy-to-make mistake.

@piskvorky
Copy link
Owner

piskvorky commented Nov 26, 2017

In fact, that whole int-vs-string design is clearly confusing, users report issues there all the time. At the same time, the "expected contract" behaviour is complex to explain, which is a possible code smell.

What can we do to make the API saner? Drop the ints? Drop strings? What was the original rationale for this design?

CC @manneshiva

@menshikh-iv menshikh-iv added the documentation Current issue related to documentation label Nov 27, 2017
@menshikh-iv
Copy link
Contributor

@piskvorky one of the possible solution - write clear docstring and describe types + examples of usage

@piskvorky
Copy link
Owner

Sure, better documentation always helps. But in this case, I wonder if we can do better with the API itself.

@gojomo
Copy link
Collaborator

gojomo commented Nov 30, 2017

The motivation for allowing plain-int tags, and handling them specially (in fact more-simply), is to allow sophisticated users the option of avoiding the string-to-int lookup dict overhead. That overhead could be significant for giant training sets. (The much more common issue, and the primary problem here, is failing to understand words should be a list-of-tokens, and tags a list-of-tags, rather than plain strings.)

@piskvorky
Copy link
Owner

piskvorky commented Nov 30, 2017

Sure, I remember that int optimization.

But can we streamline the API so that users stop failing to understand that?

Disable parameter type overloading, provide better checks, better error messages? "Too many users misunderstanding" = API needs some re-thinking.

@gojomo
Copy link
Collaborator

gojomo commented Nov 30, 2017

My suggestions are as in #1737 (comment)

@gojomo gojomo added difficulty easy Easy issue: required small fix good first issue Issue for new contributors (not required gensim understanding + very simple) labels Nov 4, 2018
@mpenkov mpenkov added the Hacktoberfest Issues marked for hacktoberfest label Sep 28, 2019
@gojomo gojomo added the bug Issue described a bug label Jan 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix documentation Current issue related to documentation good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest
Projects
None yet
Development

No branches or pull requests

5 participants