Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Added method to restrict vocab of Word2Vec most similar search #481

Closed
wants to merge 6 commits into from
Closed

Conversation

jimgoo
Copy link

@jimgoo jimgoo commented Oct 14, 2015

I've added a method to gensim.models.word2vec.py:

def most_similar_in_list(self, positive=[], negative=[], topn=10, restrict_vocab=None):

which allows restrict_vocab to be a list containing words to restrict the search over.

For example, these are the top 10 most similar results using the original most_similar method:

from gensim.models import Word2Vec
from nltk.corpus import brown

model = Word2Vec(brown.sents())
model.most_similar('vector')
[(u'V', 0.9310650825500488),
 (u'Q', 0.9126538634300232),
 (u'**zg', 0.9065112471580505),
 (u'null', 0.9064960479736328),
 (u'subspace', 0.9033916592597961),
 (u'intersection', 0.8994101881980896),
 (u'T', 0.8956964015960693),
 (u'staining', 0.8929149508476257),
 (u'secant', 0.8926326036453247),
 (u'concentration', 0.883994460105896)]

And we can restrict the search to a list of words with the new most_similar_in_list method:

model.most_similar_in_list('vector', restrict_vocab=['subspace', 'intersection', 'secant'])
[(u'subspace', 0.9033915996551514),
 (u'intersection', 0.8994101881980896),
 (u'secant', 0.8926326036453247)]

Passing an integer for restrict_vocab has the same behavior as the original,

model.most_similar('vector', restrict_vocab=3) == model.most_similar_in_list('vector', restrict_vocab=3)
True

For large vocabularies, there is some benefit to reducing the number of rows in limited when you're only interested in a subset of words:

dists = dot(limited, mean) 

The number of rows is len(restrict_vocab) rather than the total number of words in the vocab.

@piskvorky
Copy link
Owner

Nice, that sounds really useful!

I'm thinking this should even be the "default" (promoted) way of using the restrict_vocab parameter: you pass in a list of words you care about. The "int" version is kind of opaque and non-explicit, I like this "list of words" better. Same goes for the model.accuracy() method.

It's also easy to simulate "int" by means of "list of words", but not the other way round, so "list of words" is more flexible.

We probably don't need an extra most_similar_in_list method though; is there a reason not to add this functionality directly to most_similar?

@jimgoo
Copy link
Author

jimgoo commented Oct 15, 2015

Cool, if you like it then there is no reason to have both methods. I made it separate so I could test against the original and made sure they matched. I'll rename most_similar_in_list to most_similar and commit the change.

@tmylk
Copy link
Contributor

tmylk commented Oct 15, 2015

@jimgoo Talking about testing...Could you please add some for the new feature to /test/test_word2vec.py#L175

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2016

@jimgoo Please fix the Python 3 syntax issues, add CHANGELOG and test.
And then it can go into January Gensim release!

@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

Hey @jimgoo, please post another commit to trigger the Travis build. And ignore appveyor test failures for now - we are working to fix them. But I would expect Travis unix tests to be green after the next commit.

@atran
Copy link

atran commented Mar 1, 2016

+1 @jimgoo, I think it's just a print statement.

@hitochan777
Copy link

It seems that this PR is not incorporated in the latest gensim. Have any update?

@gojomo
Copy link
Collaborator

gojomo commented Sep 7, 2016

Note that the current implementation is very inefficient if used with large lists of eligible-words. (It takes the argument and converts to a set then a list. Then it calculates all distances, then does linear-probes against the restricted-list to test if every word is in the list.)

Also, the PR seems to include other unrelated code for other features.

I would suggest splitting such functionality off into a different method, perhaps most_similar_among(), and refactoring to truly limit the collection-duplication and distance-calculations. (This could also avoid the clunky type-based overloading of the restrict_vocab parameter.)

@tmylk tmylk added feature Issue described a new feature difficulty easy Easy issue: required small fix labels Sep 27, 2016
@tmylk tmylk changed the title Added method to restrict vocab of Word2Vec most similar search [WIP] Added method to restrict vocab of Word2Vec most similar search Oct 4, 2016
@shubhvachher
Copy link
Contributor

I had built this out for a recent project. Should I complete this issue?

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Duplicate of #1229

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants