[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

olavurmortensen · 2015-11-10T16:05:11Z

In response to issue #482.

We want to compute WMD between documents, based on "From Word Embeddings To Documents Distances" by Matt Kusner et al. (link to paper). Using word2vec embeddings, WMD finds the minimum "traveling distance" (in a matter of speaking) of two documents. The method shows great promise in kNN classification, as shown by Vlad Niculae and Matt Kusner in a blog post using Scikit-Learn.

Our implementation will be based on pyemd, which is a Python wrapper of a C implementation of EMD (Earth Mover's Distance) (link to GitHub repo) by Will Mayner, which will be directly included in Gensim.

After the simple implementation of a method that computes the WMD, a kNN classifier will be implemented, inspired by the aforementioned blog post of Vlad Niculae and Matt Kusner. Gensim's docsim module will be used for this purpose (link to docsim).

At the moment, only WMD between two documents is implemented in this PR, and it's done by importing pyemd. This Python wrapper needs to be directly included in Gensim, with no need for this external module.

@tmylk @piskvorky

olavurmortensen · 2015-11-12T14:33:30Z

@tmylk @piskvorky I'm done implementing the wmdistance method, using pyemd natively in Gensim.

tmylk · 2015-11-12T16:21:30Z

Great. Could you please add unit tests for this new functionality?

piskvorky · 2015-11-13T01:58:36Z

Good job! @tmylk please review.

@olavurmortensen how do I use it? Please provide a concrete example.

…just yet.

…orial. Added .ipynb_checkpoints to .gitignore

…ow returns 'inf' instead of 'nan' when a document is empty.

tmylk · 2016-01-23T20:36:04Z

@mkusner Should we expect RWMD to be much faster than WMD?
It gives very little speed up in these tests, maybe because it is numpy vs C implementation of EMD. If we can't improve RWMD runtime then suggest not using it at all.

RWMD code here is almost the same as in matlab RWMD code, just without a square root.

tmylk · 2016-01-24T09:14:06Z

gensim/models/word2vec.py

+            # Compute RWMD.
+            with warnings.catch_warnings():
+                # Ignore Numpy warning: "All-NaN axis encountered".
+                warnings.filterwarnings('ignore', r'All-NaN axis encountered')


How much time does this take? Where are the NaNs coming from?

mkusner · 2016-01-25T00:13:27Z

Hi all! Thanks a lot for spending so much careful time on this, I just
looked through the thread!

It depends a bit on the dataset how much RWMD speeds up WMD; we saw
speedups between 1.1x and 16x (in Figure 7 of the paper:
http://mkusner.github.io/publications/WMD.pdf) if EMD is in C and RWMD is
in numpy this could certainly shrink the speedup. RWMD is even more
effective at speeding up document distances when used to prune documents
for approximate WMD computation (in the prefetch and prune algorithm at the
end of Section 4 in the paper). I'd be happy to give more details about
this if this is something that people would want included.

In any case, thanks again!

On Sat, Jan 23, 2016 at 3:36 PM, Lev Konstantinovskiy <
notifications@github.com> wrote:

@mkusner https://github.com/mkusner Should we expect RWMD to be much
faster than WMD?
It gives very little speed up in these tests, maybe because it is numpy vs
C implementation of EMD. If we can't improve RWMD runtime then suggest not
using it at all.

RWMD code here is almost the same as in matlab RWMD code
https://github.com/mkusner/wmd/blob/master/compute_rwmd.m, just without
a square root.

—
Reply to this email directly or view it on GitHub
#521 (comment).

…stance (brute force doesn't work, need optimization method in Python).

…es), using it in WmdSimilarity, speeds up wmdistance.

tmylk · 2016-02-03T22:14:02Z

gensim/similarities/docsim.py

+        self.normalize = False
+
+        # index is simply an array from 0 to size of corpus.
+        self.index = numpy.array(range(len(corpus)))


Where is this variable used later?

In gensim.interfaces.SimilarityABC, which WmdSimilarity extends. The Similarity base class uses index to sort the documents.

…out data (histogram and more), updated some text.

…nsen/gensim into word_movers_distance

…a call to word2vec.init_sims in WmdSimilarity remved.

…nsen/gensim into word_movers_distance

…ed some time ago).

…the one prepared for the upcoming WMD release (piskvorky#619). Changed the pre-computed distances to a dictionary instead of a matrix. Added my notebook with all kinds of tests of WMD, including line profiling.

menshikh-iv · 2017-06-13T08:32:51Z

Ping @olavurmortensen, what status of this PR? Will you finish it soon?

ghost · 2017-06-13T08:41:48Z

I believe part of this PR was merged in #659 (just WMD), and that that optimizations (RWMD and WCD) were taken up by @RishabGoel in #800.

I'm no longer working on this PR and haven't been for over a year. So the PR should be closed.

olavurmortensen · 2017-06-13T09:25:08Z

Just want to note that I used the wrong account in my previous comment (olavurfargen), but it was me.

Added wmdistance Word2Vec class method.

4830477

olavurmortensen mentioned this pull request Nov 10, 2015

add 'Word Mover's Distance' implementation to gensim? #482

Closed

olavurmortensen added 3 commits November 12, 2015 11:41

Trying to add pyemd as an extension to Gensim.

8da46c6

Fixed setup.py, pyemd now works natively in Gensim.

478cd12

Updated wmdistance docstring. Removed some TODOs and FIXMEs.

5eef47d

olavurmortensen added 3 commits November 12, 2015 20:43

Added tests from the pyemd repo.

1ff24c1

Changed the pyemd tests to use unittest.

2ef3051

Fixed import mistake in test_pyemd.py

d9a5038

olavurmortensen added 3 commits November 15, 2015 11:52

Changed pyemd test case names.

8a0d9c9

Added tests of wmdistance to test_word2vec.py

5e8fb0c

Small change to wmdistance docstring example.

88516d2

piskvorky changed the title ~~Implement Word Mover's Distance (WMD) in Gensim~~ [WIP] Implement Word Mover's Distance (WMD) in Gensim Nov 17, 2015

olavurmortensen added 15 commits November 17, 2015 11:41

Added a draft for w WMD tutorial.

30f0196

Changed one wmdistance unit test.

0d574e0

Added draft for pure Python wmdistance method. No guarantee it works …

b0a0d1a

…just yet.

Minor changes.

66354a8

Changed how wmdistance falls back on pure Python version. Updated tut…

6f7a42a

…orial. Added .ipynb_checkpoints to .gitignore

Check difference within some bound instead of equal in unit test.

2b34dd9

Fixed syntax error in test.

fe7b6ad

Fixed pure Python version. Added unittest.

7a4ca68

Added information about license to pyemd files.

5ae7350

Using corpora.dictionary for BOW.

21bf200

Removed code duplication from wmdistance_slow.

4bb0f83

Added word2vec_inner.c from clean develop branch.

48d4797

Fixed error in unit test. Some formatting in wmdistance.

3e8c388

Updated tutorial.

09b0683

Added symmetry test.

b093f07

olavurmortensen added 2 commits January 23, 2016 16:42

Accidentally broke num_best functionality, fixed now.

7e53970

Fixed prefetch and prune in WmdSimilarity, it now works. wmdistance n…

bbb0ffc

…ow returns 'inf' instead of 'nan' when a document is empty.

tmylk reviewed Jan 24, 2016
View reviewed changes

olavurmortensen added 5 commits January 27, 2016 17:02

Normalizing embeddings in WmdSimilarity. Improves similarity queries.

e7713f5

Updated tutorial (work in progress).

286438e

Simplified distance matrix computation loop. Removed pure Python wmdi…

06fb2fe

…stance (brute force doesn't work, need optimization method in Python).

Some improvements to the tutorial (text).

725827d

Pre-computing distances between words in vocab (word2vec.init_distanc…

3c92d25

…es), using it in WmdSimilarity, speeds up wmdistance.

tmylk reviewed Feb 3, 2016
View reviewed changes

olavurmortensen added 6 commits February 15, 2016 14:03

In process of adding timings to tutorial.

f059ce9

Added timings to tutorial.

106b14b

Updated tutorial.

27af837

Prefetch and prune requires pp=True. Other minor changes.

dcc68c1

Updated notebook. Timings of fast cells removed, added information ab…

23ccf1e

…out data (histogram and more), updated some text.

Merge branch 'word_movers_distance' of https://github.com/olavurmorte…

3dc765b

…nsen/gensim into word_movers_distance

olavurmortensen mentioned this pull request Feb 28, 2016

[WIP] wmd WCD and RWMD optimisations #619

Closed

olavurmortensen added 4 commits March 3, 2016 13:46

Fixed some mistakes. force_pure_python option in wmdistance removed, …

2913475

…a call to word2vec.init_sims in WmdSimilarity remved.

Merge branch 'word_movers_distance' of https://github.com/olavurmorte…

c78811f

…nsen/gensim into word_movers_distance

Removed unit test that tests the pure Python version (which was remov…

cae9acb

…ed some time ago).

Updated notebook.

e92c9ee

tmylk added a commit that referenced this pull request Apr 5, 2016

Word Movers Distance #482 , #521 and #619

a355c69

olavurmortensen added 2 commits April 15, 2016 17:58

Wrote a report on the WMD implementation. Replaced the tutorial with …

abae44c

…the one prepared for the upcoming WMD release (piskvorky#619). Changed the pre-computed distances to a dictionary instead of a matrix. Added my notebook with all kinds of tests of WMD, including line profiling.

Added note about table of contents in report and test notebooks.

9d909e8

RishabGoel mentioned this pull request Jul 26, 2016

[WIP] Potential Word Movers Distance performance improvement: WCD and RWMD #800

Closed

menshikh-iv closed this Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

olavurmortensen commented Nov 10, 2015

olavurmortensen commented Nov 12, 2015

tmylk commented Nov 12, 2015

piskvorky commented Nov 13, 2015

tmylk commented Jan 23, 2016

tmylk Jan 24, 2016

mkusner commented Jan 25, 2016

tmylk Feb 3, 2016

olavurmortensen Feb 15, 2016

menshikh-iv commented Jun 13, 2017 •

edited

Loading

ghost commented Jun 13, 2017

olavurmortensen commented Jun 13, 2017

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

Conversation

olavurmortensen commented Nov 10, 2015

olavurmortensen commented Nov 12, 2015

tmylk commented Nov 12, 2015

piskvorky commented Nov 13, 2015

tmylk commented Jan 23, 2016

tmylk Jan 24, 2016

Choose a reason for hiding this comment

mkusner commented Jan 25, 2016

tmylk Feb 3, 2016

Choose a reason for hiding this comment

olavurmortensen Feb 15, 2016

Choose a reason for hiding this comment

menshikh-iv commented Jun 13, 2017 • edited Loading

ghost commented Jun 13, 2017

olavurmortensen commented Jun 13, 2017

menshikh-iv commented Jun 13, 2017 •

edited

Loading