WMD Earth Mover distance problems #981

josepablog · 2016-10-28T01:11:19Z

I'm not able to successfully run the sample WMD code:

#Import libraries:
import gensim
from gensim.models.word2vec import Word2Vec
import nltk
stopwords = nltk.corpus.stopwords.words('english')

# Load pre-trained vectors:
wv = Word2Vec.load_word2vec_format("data/GoogleNews-vectors-negative300.bin.gz",binary=True)
wv.init_sims(replace=True)
wv.save("cache_wv")

# Sample sentences:
sentence_obama = 'Obama speaks to the media in Illinois'.lower().split()
sentence_president = 'The president greets the press in Chicago'.lower().split()

# Remove stopwords:
stopwords = nltk.corpus.stopwords.words('english')
sentence_obama = [w for w in sentence_obama if w not in stopwords]
sentence_president = [w for w in sentence_president if w not in stopwords]

# Print distance
print wv.wmdistance(sentence_obama, sentence_president)

This returns 1.25. This is troublesome because it is more than 1, and because the correct answer should be ~0.74

The text was updated successfully, but these errors were encountered:

josepablog · 2016-10-28T04:39:19Z

Mmm... I've been looking at the code

In word2vec.py

def wmdistance(self, document1, document2):

...
distance_matrix = zeros((vocab_len, vocab_len), dtype=double)
for i, t1 in dictionary.items():
    for j, t2 in dictionary.items():
      if not t1 in docset1 or not t2 in docset2:
          continue
       # Compute Euclidean distance between word vectors.
       distance_matrix[i, j] = sqrt(np_sum((self[t1] - self[t2])**2))

If I replace it with:

def wmdistance(self, document1, document2):

...
W_ = [self[w] for w in dictionary.values()]
distance_matrix = euclidean_distances(W_).astype(np.double)
distance_matrix /= distance_matrix.max()

The return value is 0.75

Do we need to normalize the distance matrix?

josepablog · 2016-10-28T04:57:02Z

Argh. I just noticed that the blog is deviating from the paper, and this implementation returns the value from the paper.

Though, the code would be simpler if used euclidean_distances

tmylk · 2016-10-28T07:37:30Z

@josepablog Thanks for your suggestion. We don't have a dependency on sklearn so unfortunately can't use euclidean_distances

piskvorky · 2016-10-29T02:11:04Z

Computing euclidean distances is trivial (one-liner) and should not need any external dependencies or inefficiencies.

@tmylk .items() in the code above will create unnecessary lists -- why doesn't this use lazy iteration (via six for py2/py3 compatibility)?

Also, since the distance is symmetric (and zero along diagonal), isn't this doing double work?

The whole nested loop looks inefficient and should be using numpy broadcasting and array operations.

narvind2003 · 2017-09-21T17:53:06Z

Thanks for this awesome implementation!
I'm trying to use pre-trained embeddings (glove, fasttext) to compute wmdistance between sentences.

@josepablog, @tmylk , @piskvorky - I see that this issue is closed but the wmdistance is still doing the ineffcient loops. Could you please share any updates since your last post?

Have you or anyone else been able to make wmdistance work more accurately and faster(say, using numpy vectorized operations)?

piskvorky · 2017-09-21T18:57:52Z

@narvind2003 not that I know of. CC @menshikh-iv

Such optimizations look trivial though, do you want to give it a try?

menshikh-iv · 2017-09-22T05:09:11Z

@narvind2003 yes, we can help.
Please start your improvements and create PR, in PR you can ping us and we'll discuss any problems/ideas:+1:

I'm not sure if this fully addresses the `# TODO: Update to better match & share code with most_similar()` at line piskvorky#981 or not, so I've left it in.

* Allow supplying a string-key as the negative arg. to most_similar() * Allow a single vector as a positive or negative arg. to most_similar() * Update comments * Accept single arguments when positive and negative are both supplied * Update most_similar_cosmul to match most_similar I'm not sure if this fully addresses the `# TODO: Update to better match & share code with most_similar()` at line #981 or not, so I've left it in. * minor code cleanup * add unit tests * Update CHANGELOG.md * remove redundant variable declaration * enforce consistency * respond to review feedback * Update keyedvectors.py Co-authored-by: Michael Penkov <misha.penkov@gmail.com> Co-authored-by: Michael Penkov <m@penkov.dev>

* Allow supplying a string-key as the negative arg. to most_similar() * Allow a single vector as a positive or negative arg. to most_similar() * Update comments * Accept single arguments when positive and negative are both supplied * Update most_similar_cosmul to match most_similar I'm not sure if this fully addresses the `# TODO: Update to better match & share code with most_similar()` at line piskvorky#981 or not, so I've left it in. * minor code cleanup * add unit tests * Update CHANGELOG.md * remove redundant variable declaration * enforce consistency * respond to review feedback * Update keyedvectors.py Co-authored-by: Michael Penkov <misha.penkov@gmail.com> Co-authored-by: Michael Penkov <m@penkov.dev>

josepablog closed this as completed Oct 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WMD Earth Mover distance problems #981

WMD Earth Mover distance problems #981

josepablog commented Oct 28, 2016 •

edited

josepablog commented Oct 28, 2016 •

edited

josepablog commented Oct 28, 2016

tmylk commented Oct 28, 2016

piskvorky commented Oct 29, 2016 •

edited

narvind2003 commented Sep 21, 2017

piskvorky commented Sep 21, 2017 •

edited

menshikh-iv commented Sep 22, 2017

WMD Earth Mover distance problems #981

WMD Earth Mover distance problems #981

Comments

josepablog commented Oct 28, 2016 • edited

josepablog commented Oct 28, 2016 • edited

josepablog commented Oct 28, 2016

tmylk commented Oct 28, 2016

piskvorky commented Oct 29, 2016 • edited

narvind2003 commented Sep 21, 2017

piskvorky commented Sep 21, 2017 • edited

menshikh-iv commented Sep 22, 2017

josepablog commented Oct 28, 2016 •

edited

josepablog commented Oct 28, 2016 •

edited

piskvorky commented Oct 29, 2016 •

edited

piskvorky commented Sep 21, 2017 •

edited