[WIP] Added method to restrict vocab of Word2Vec most similar search #481

jimgoo · 2015-10-14T22:02:11Z

I've added a method to gensim.models.word2vec.py:

def most_similar_in_list(self, positive=[], negative=[], topn=10, restrict_vocab=None):

which allows restrict_vocab to be a list containing words to restrict the search over.

For example, these are the top 10 most similar results using the original most_similar method:

from gensim.models import Word2Vec
from nltk.corpus import brown

model = Word2Vec(brown.sents())
model.most_similar('vector')

[(u'V', 0.9310650825500488),
 (u'Q', 0.9126538634300232),
 (u'**zg', 0.9065112471580505),
 (u'null', 0.9064960479736328),
 (u'subspace', 0.9033916592597961),
 (u'intersection', 0.8994101881980896),
 (u'T', 0.8956964015960693),
 (u'staining', 0.8929149508476257),
 (u'secant', 0.8926326036453247),
 (u'concentration', 0.883994460105896)]

And we can restrict the search to a list of words with the new most_similar_in_list method:

model.most_similar_in_list('vector', restrict_vocab=['subspace', 'intersection', 'secant'])

[(u'subspace', 0.9033915996551514),
 (u'intersection', 0.8994101881980896),
 (u'secant', 0.8926326036453247)]

Passing an integer for restrict_vocab has the same behavior as the original,

model.most_similar('vector', restrict_vocab=3) == model.most_similar_in_list('vector', restrict_vocab=3)

True

For large vocabularies, there is some benefit to reducing the number of rows in limited when you're only interested in a subset of words:

dists = dot(limited, mean)

The number of rows is len(restrict_vocab) rather than the total number of words in the vocab.

piskvorky · 2015-10-15T03:18:15Z

Nice, that sounds really useful!

I'm thinking this should even be the "default" (promoted) way of using the restrict_vocab parameter: you pass in a list of words you care about. The "int" version is kind of opaque and non-explicit, I like this "list of words" better. Same goes for the model.accuracy() method.

It's also easy to simulate "int" by means of "list of words", but not the other way round, so "list of words" is more flexible.

We probably don't need an extra most_similar_in_list method though; is there a reason not to add this functionality directly to most_similar?

jimgoo · 2015-10-15T05:31:40Z

Cool, if you like it then there is no reason to have both methods. I made it separate so I could test against the original and made sure they matched. I'll rename most_similar_in_list to most_similar and commit the change.

tmylk · 2015-10-15T08:29:06Z

@jimgoo Talking about testing...Could you please add some for the new feature to /test/test_word2vec.py#L175

tmylk · 2016-01-10T08:43:53Z

@jimgoo Please fix the Python 3 syntax issues, add CHANGELOG and test.
And then it can go into January Gensim release!

tmylk · 2016-01-23T21:20:54Z

Hey @jimgoo, please post another commit to trigger the Travis build. And ignore appveyor test failures for now - we are working to fix them. But I would expect Travis unix tests to be green after the next commit.

atran · 2016-03-01T02:20:53Z

+1 @jimgoo, I think it's just a print statement.

hitochan777 · 2016-09-02T05:02:10Z

It seems that this PR is not incorporated in the latest gensim. Have any update?

gojomo · 2016-09-07T19:40:59Z

Note that the current implementation is very inefficient if used with large lists of eligible-words. (It takes the argument and converts to a set then a list. Then it calculates all distances, then does linear-probes against the restricted-list to test if every word is in the list.)

Also, the PR seems to include other unrelated code for other features.

I would suggest splitting such functionality off into a different method, perhaps most_similar_among(), and refactoring to truly limit the collection-duplication and distance-calculations. (This could also avoid the clunky type-based overloading of the restrict_vocab parameter.)

shubhvachher · 2017-03-22T02:11:17Z

I had built this out for a recent project. Should I complete this issue?

tmylk · 2017-05-02T21:44:45Z

Duplicate of #1229

Added method to restrict vocab of Word2Vec most similar search

9b8ade9

jimgoo added 2 commits October 27, 2015 15:13

Removed old most_similar method, renamed new method

51d0bc2

Added support for pretrained word2vec model.

d389db1

Removed unwanted cast to ASCII.

3972f9c

jimgoo added 2 commits February 9, 2016 10:25

Triggering Travis build.

3462e60

Merge branch 'develop' of https://github.com/jimgoo/gensim into develop

122055a

piskvorky assigned tmylk Sep 2, 2016

tmylk added feature Issue described a new feature difficulty easy Easy issue: required small fix labels Sep 27, 2016

tmylk changed the title ~~Added method to restrict vocab of Word2Vec most similar search~~ [WIP] Added method to restrict vocab of Word2Vec most similar search Oct 4, 2016

shubhvachher mentioned this pull request Mar 22, 2017

[WIP] Add new restrict_vocab functionality, most_similar_among #1229

Closed

tmylk closed this May 2, 2017

gojomo mentioned this pull request Oct 16, 2017

Add "most_similar_to_given" method for KeyedVectors #1582

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

jimgoo commented Oct 14, 2015

piskvorky commented Oct 15, 2015

jimgoo commented Oct 15, 2015

tmylk commented Oct 15, 2015

tmylk commented Jan 10, 2016

tmylk commented Jan 23, 2016

atran commented Mar 1, 2016

hitochan777 commented Sep 2, 2016

gojomo commented Sep 7, 2016

shubhvachher commented Mar 22, 2017

tmylk commented May 2, 2017

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

Conversation

jimgoo commented Oct 14, 2015

piskvorky commented Oct 15, 2015

jimgoo commented Oct 15, 2015

tmylk commented Oct 15, 2015

tmylk commented Jan 10, 2016

tmylk commented Jan 23, 2016

atran commented Mar 1, 2016

hitochan777 commented Sep 2, 2016

gojomo commented Sep 7, 2016

shubhvachher commented Mar 22, 2017

tmylk commented May 2, 2017