Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WMD Earth Mover distance problems #981

Closed
josepablog opened this issue Oct 28, 2016 · 7 comments
Closed

WMD Earth Mover distance problems #981

josepablog opened this issue Oct 28, 2016 · 7 comments

Comments

@josepablog
Copy link

josepablog commented Oct 28, 2016

I'm not able to successfully run the sample WMD code:

#Import libraries:
import gensim
from gensim.models.word2vec import Word2Vec
import nltk
stopwords = nltk.corpus.stopwords.words('english')

# Load pre-trained vectors:
wv = Word2Vec.load_word2vec_format("data/GoogleNews-vectors-negative300.bin.gz",binary=True)
wv.init_sims(replace=True)
wv.save("cache_wv")

# Sample sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

# Remove stopwords:
stopwords = nltk.corpus.stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Print distance
print wv.wmdistance(sentence_obama, sentence_president)

This returns 1.25. This is troublesome because it is more than 1, and because the correct answer should be ~0.74

@josepablog
Copy link
Author

josepablog commented Oct 28, 2016

Mmm... I've been looking at the code

In word2vec.py

def wmdistance(self, document1, document2):

...
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
    for j, t2 in dictionary.items():
      if not t1 in docset1 or not t2 in docset2:
          continue
       # Compute Euclidean distance between word vectors.
       distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))

If I replace it with:

def wmdistance(self, document1, document2):

...
W_ = [self[w] for w in dictionary.values()]
distance_matrix = euclidean_distances(W_).astype(np.double)
distance_matrix /= distance_matrix.max()

The return value is 0.75

Do we need to normalize the distance matrix?

@josepablog
Copy link
Author

Argh. I just noticed that the blog is deviating from the paper, and this implementation returns the value from the paper.

Though, the code would be simpler if used euclidean_distances

@tmylk
Copy link
Contributor

tmylk commented Oct 28, 2016

@josepablog Thanks for your suggestion. We don't have a dependency on sklearn so unfortunately can't use euclidean_distances

@piskvorky
Copy link
Owner

piskvorky commented Oct 29, 2016

Computing euclidean distances is trivial (one-liner) and should not need any external dependencies or inefficiencies.

@tmylk .items() in the code above will create unnecessary lists -- why doesn't this use lazy iteration (via six for py2/py3 compatibility)?

Also, since the distance is symmetric (and zero along diagonal), isn't this doing double work?

The whole nested loop looks inefficient and should be using numpy broadcasting and array operations.

@narvind2003
Copy link

Thanks for this awesome implementation!
I'm trying to use pre-trained embeddings (glove, fasttext) to compute wmdistance between sentences.

@josepablog, @tmylk , @piskvorky - I see that this issue is closed but the wmdistance is still doing the ineffcient loops. Could you please share any updates since your last post?

Have you or anyone else been able to make wmdistance work more accurately and faster(say, using numpy vectorized operations)?

@piskvorky
Copy link
Owner

piskvorky commented Sep 21, 2017

@narvind2003 not that I know of. CC @menshikh-iv

Such optimizations look trivial though, do you want to give it a try?

@menshikh-iv
Copy link
Contributor

@narvind2003 yes, we can help.
Please start your improvements and create PR, in PR you can ping us and we'll discuss any problems/ideas:+1:

simonwiles added a commit to simonwiles/gensim that referenced this issue Nov 17, 2020
I'm not sure if this fully addresses the
`# TODO: Update to better match & share code with most_similar()`
at line piskvorky#981 or not, so I've left it in.
mpenkov added a commit that referenced this issue Aug 13, 2021
* Allow supplying a string-key as the negative arg. to most_similar()

* Allow a single vector as a positive or negative arg. to most_similar()

* Update comments

* Accept single arguments when positive and negative are both supplied

* Update most_similar_cosmul to match most_similar

I'm not sure if this fully addresses the
`# TODO: Update to better match & share code with most_similar()`
at line #981 or not, so I've left it in.

* minor code cleanup

* add unit tests

* Update CHANGELOG.md

* remove redundant variable declaration

* enforce consistency

* respond to review feedback

* Update keyedvectors.py

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
Co-authored-by: Michael Penkov <m@penkov.dev>
tbbharaj pushed a commit to tbbharaj/gensim that referenced this issue Aug 19, 2021
* Allow supplying a string-key as the negative arg. to most_similar()

* Allow a single vector as a positive or negative arg. to most_similar()

* Update comments

* Accept single arguments when positive and negative are both supplied

* Update most_similar_cosmul to match most_similar

I'm not sure if this fully addresses the
`# TODO: Update to better match & share code with most_similar()`
at line piskvorky#981 or not, so I've left it in.

* minor code cleanup

* add unit tests

* Update CHANGELOG.md

* remove redundant variable declaration

* enforce consistency

* respond to review feedback

* Update keyedvectors.py

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
Co-authored-by: Michael Penkov <m@penkov.dev>
tbbharaj pushed a commit to tbbharaj/gensim that referenced this issue Aug 19, 2021
* Allow supplying a string-key as the negative arg. to most_similar()

* Allow a single vector as a positive or negative arg. to most_similar()

* Update comments

* Accept single arguments when positive and negative are both supplied

* Update most_similar_cosmul to match most_similar

I'm not sure if this fully addresses the
`# TODO: Update to better match & share code with most_similar()`
at line piskvorky#981 or not, so I've left it in.

* minor code cleanup

* add unit tests

* Update CHANGELOG.md

* remove redundant variable declaration

* enforce consistency

* respond to review feedback

* Update keyedvectors.py

Co-authored-by: Michael Penkov <misha.penkov@gmail.com>
Co-authored-by: Michael Penkov <m@penkov.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants