-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement Word Mover's Distance (WMD) in Gensim #521
[WIP] Implement Word Mover's Distance (WMD) in Gensim #521
Conversation
@tmylk @piskvorky I'm done implementing the |
Great. Could you please add unit tests for this new functionality? |
Good job! @tmylk please review. @olavurmortensen how do I use it? Please provide a concrete example. |
…orial. Added .ipynb_checkpoints to .gitignore
…ow returns 'inf' instead of 'nan' when a document is empty.
@mkusner Should we expect RWMD to be much faster than WMD? RWMD code here is almost the same as in matlab RWMD code, just without a square root. |
# Compute RWMD. | ||
with warnings.catch_warnings(): | ||
# Ignore Numpy warning: "All-NaN axis encountered". | ||
warnings.filterwarnings('ignore', r'All-NaN axis encountered') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much time does this take? Where are the NaNs coming from?
Hi all! Thanks a lot for spending so much careful time on this, I just It depends a bit on the dataset how much RWMD speeds up WMD; we saw In any case, thanks again! On Sat, Jan 23, 2016 at 3:36 PM, Lev Konstantinovskiy <
|
…stance (brute force doesn't work, need optimization method in Python).
…es), using it in WmdSimilarity, speeds up wmdistance.
self.normalize = False | ||
|
||
# index is simply an array from 0 to size of corpus. | ||
self.index = numpy.array(range(len(corpus))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this variable used later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In gensim.interfaces.SimilarityABC
, which WmdSimilarity
extends. The Similarity base class uses index
to sort the documents.
…out data (histogram and more), updated some text.
…nsen/gensim into word_movers_distance
…a call to word2vec.init_sims in WmdSimilarity remved.
…nsen/gensim into word_movers_distance
…ed some time ago).
…the one prepared for the upcoming WMD release (piskvorky#619). Changed the pre-computed distances to a dictionary instead of a matrix. Added my notebook with all kinds of tests of WMD, including line profiling.
Ping @olavurmortensen, what status of this PR? Will you finish it soon? |
I believe part of this PR was merged in #659 (just WMD), and that that optimizations (RWMD and WCD) were taken up by @RishabGoel in #800. I'm no longer working on this PR and haven't been for over a year. So the PR should be closed. |
Just want to note that I used the wrong account in my previous comment (olavurfargen), but it was me. |
In response to issue #482.
We want to compute WMD between documents, based on "From Word Embeddings To Documents Distances" by Matt Kusner et al. (link to paper). Using word2vec embeddings, WMD finds the minimum "traveling distance" (in a matter of speaking) of two documents. The method shows great promise in kNN classification, as shown by Vlad Niculae and Matt Kusner in a blog post using Scikit-Learn.
Our implementation will be based on
pyemd
, which is a Python wrapper of a C implementation of EMD (Earth Mover's Distance) (link to GitHub repo) by Will Mayner, which will be directly included in Gensim.After the simple implementation of a method that computes the WMD, a kNN classifier will be implemented, inspired by the aforementioned blog post of Vlad Niculae and Matt Kusner. Gensim's
docsim
module will be used for this purpose (link to docsim).At the moment, only WMD between two documents is implemented in this PR, and it's done by importing
pyemd
. This Python wrapper needs to be directly included in Gensim, with no need for this external module.@tmylk @piskvorky