Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement Word Mover's Distance (WMD) in Gensim #521

Closed

Conversation

olavurmortensen
Copy link
Contributor

In response to issue #482.

We want to compute WMD between documents, based on "From Word Embeddings To Documents Distances" by Matt Kusner et al. (link to paper). Using word2vec embeddings, WMD finds the minimum "traveling distance" (in a matter of speaking) of two documents. The method shows great promise in kNN classification, as shown by Vlad Niculae and Matt Kusner in a blog post using Scikit-Learn.

Our implementation will be based on pyemd, which is a Python wrapper of a C implementation of EMD (Earth Mover's Distance) (link to GitHub repo) by Will Mayner, which will be directly included in Gensim.

After the simple implementation of a method that computes the WMD, a kNN classifier will be implemented, inspired by the aforementioned blog post of Vlad Niculae and Matt Kusner. Gensim's docsim module will be used for this purpose (link to docsim).

At the moment, only WMD between two documents is implemented in this PR, and it's done by importing pyemd. This Python wrapper needs to be directly included in Gensim, with no need for this external module.

@tmylk @piskvorky

@olavurmortensen
Copy link
Contributor Author

@tmylk @piskvorky I'm done implementing the wmdistance method, using pyemd natively in Gensim.

@tmylk
Copy link
Contributor

tmylk commented Nov 12, 2015

Great. Could you please add unit tests for this new functionality?

@piskvorky
Copy link
Owner

Good job! @tmylk please review.

@olavurmortensen how do I use it? Please provide a concrete example.

@piskvorky piskvorky changed the title Implement Word Mover's Distance (WMD) in Gensim [WIP] Implement Word Mover's Distance (WMD) in Gensim Nov 17, 2015
@tmylk
Copy link
Contributor

tmylk commented Jan 23, 2016

@mkusner Should we expect RWMD to be much faster than WMD?
It gives very little speed up in these tests, maybe because it is numpy vs C implementation of EMD. If we can't improve RWMD runtime then suggest not using it at all.

RWMD code here is almost the same as in matlab RWMD code, just without a square root.

# Compute RWMD.
with warnings.catch_warnings():
# Ignore Numpy warning: "All-NaN axis encountered".
warnings.filterwarnings('ignore', r'All-NaN axis encountered')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much time does this take? Where are the NaNs coming from?

@mkusner
Copy link

mkusner commented Jan 25, 2016

Hi all! Thanks a lot for spending so much careful time on this, I just
looked through the thread!

It depends a bit on the dataset how much RWMD speeds up WMD; we saw
speedups between 1.1x and 16x (in Figure 7 of the paper:
http://mkusner.github.io/publications/WMD.pdf) if EMD is in C and RWMD is
in numpy this could certainly shrink the speedup. RWMD is even more
effective at speeding up document distances when used to prune documents
for approximate WMD computation (in the prefetch and prune algorithm at the
end of Section 4 in the paper). I'd be happy to give more details about
this if this is something that people would want included.

In any case, thanks again!

On Sat, Jan 23, 2016 at 3:36 PM, Lev Konstantinovskiy <
notifications@github.com> wrote:

@mkusner https://github.com/mkusner Should we expect RWMD to be much
faster than WMD?
It gives very little speed up in these tests, maybe because it is numpy vs
C implementation of EMD. If we can't improve RWMD runtime then suggest not
using it at all.

RWMD code here is almost the same as in matlab RWMD code
https://github.com/mkusner/wmd/blob/master/compute_rwmd.m, just without
a square root.


Reply to this email directly or view it on GitHub
#521 (comment).

self.normalize = False

# index is simply an array from 0 to size of corpus.
self.index = numpy.array(range(len(corpus)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this variable used later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In gensim.interfaces.SimilarityABC, which WmdSimilarity extends. The Similarity base class uses index to sort the documents.

tmylk added a commit that referenced this pull request Apr 5, 2016
…the one prepared for the upcoming WMD release (piskvorky#619). Changed the pre-computed distances to a dictionary instead of a matrix. Added my notebook with all kinds of tests of WMD, including line profiling.
@menshikh-iv
Copy link
Contributor

menshikh-iv commented Jun 13, 2017

Ping @olavurmortensen, what status of this PR? Will you finish it soon?

@ghost
Copy link

ghost commented Jun 13, 2017

I believe part of this PR was merged in #659 (just WMD), and that that optimizations (RWMD and WCD) were taken up by @RishabGoel in #800.

I'm no longer working on this PR and haven't been for over a year. So the PR should be closed.

@olavurmortensen
Copy link
Contributor Author

Just want to note that I used the wrong account in my previous comment (olavurfargen), but it was me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants