# Word Mover's Distance

Word Mover's Distance (WMD) is a new method that seeks to find the distance between two sets of words (documents, sentences, etc.). The method uses word embeddings via word2vec, and the rest is quite intuitive: match the closest words and sum the distances between them. Another way of saying it is that we find the "minimum traveling distance" from one document to another. 

WMD is illustrated below for two very similar sentences. The sentences have no words in common (non-trivial ones), but by matching the relevant words, WMD is able to accurately measure the similarity between the two sentences.

<img src='https://vene.ro/images/wmd-obama.png' height='700' width='700'>


This method comes from the article "From Word Embeddings To Document Distances" by Matt Kusner et al. ([link to PDF](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf)). It is inspired by the "Earth Mover's Distance", and employs a solver of the "transportation problem".

This short tutorial shows the use of the `wmdistance` method of the Gensim `Word2Vec` class. Stay tuned for another tutorial in kNN classification using Gensim's [docsim](http://radimrehurek.com/gensim/similarities/docsim.html), when this functionality is implemented!

## Using WMD

To use WMD, we need some word embeddings first of all. You could train a word2vec (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but in this tutorial we will simply download some pre-trained word2vec embeddings. Download these embeddings [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).

Let's take some sentences to compute the distance between.

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')

sentence1 = 'Obama speaks to the media in Illinois'
sentence2 = 'The president greets the press in Chicago'
sentence1 = sentence1.lower().split()
sentence2 = sentence2.lower().split()

These sentences have very similar content, and as such the WMD should be low. Before we compute the WMD, we want to remove stopwords ("the", "to", etc.), as these do not contribute a lot to the information in the sentences.

In [2]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')

# Remove stopwords.
stop_words = stopwords.words('english')
sentence1 = [w for w in sentence1 if w not in stop_words]
sentence2 = [w for w in sentence2 if w not in stop_words]

[nltk_data] Downloading package stopwords to /home/olavur/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, as mentioned earlier, we will be using some downloaded pre-trained embeddings. We load these into a Gensim Word2Vec model class.

In [3]:
from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('/data/w2v_googlenews/GoogleNews-vectors-negative300.bin.gz', binary=True)

So let's compute the WMD using the `wmdistance` method.

In [4]:
distance = model.wmdistance(sentence1, sentence2)

print distance

1.01746462593


Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.

In [5]:
sentence1 = 'Obama speaks to the media in Illinois'
sentence2 = 'Oranges are my favorite type of fruit'
sentence1 = sentence1.lower().split()
sentence2 = sentence2.lower().split()
stop_words = stopwords.words('english')
sentence1 = [w for w in sentence1 if w not in stop_words]
sentence2 = [w for w in sentence2 if w not in stop_words]

distance = model.wmdistance(sentence1, sentence2)

print distance

1.36042560915
