# Word Mover's Distance

Word Mover's Distance (WMD) is a new method that seeks to find the distance between two sets of words (documents, sentences, etc.). The method uses word embeddings via word2vec, and the rest is quite intuitive: match the closest words and sum the distances between them. Another way of saying it is that we find the "minimum traveling distance" from one document to another. 

WMD is illustrated below for two very similar sentences. The sentences have no words in common (non-trivial ones), but by matching the relevant words, WMD is able to accurately measure the similarity between the two sentences.

<img src='https://vene.ro/images/wmd-obama.png' height='700' width='700'>


This method comes from the article "From Word Embeddings To Document Distances" by Matt Kusner et al. ([link to PDF](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf)). It is inspired by the "Earth Mover's Distance", and employs a solver of the "transportation problem".

This short tutorial shows the use of the `wmdistance` method of the Gensim `Word2Vec` class. Stay tuned for another tutorial in kNN classification using Gensim's [docsim](http://radimrehurek.com/gensim/similarities/docsim.html), when this functionality is implemented!

## Using WMD

To use WMD, we need some word embeddings first of all. You could train a word2vec (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but in this tutorial we will simply download some pre-trained word2vec embeddings. Download these embeddings [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit).

Let's take some sentences to compute the distance between.

In [1]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')

sentence1 = 'Obama speaks to the media in Illinois'
sentence2 = 'The president greets the press in Chicago'
sentence1 = sentence1.lower().split()
sentence2 = sentence2.lower().split()

These sentences have very similar content, and as such the WMD should be low. Before we compute the WMD, we want to remove stopwords ("the", "to", etc.), as these do not contribute a lot to the information in the sentences.

In [2]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')

# Remove stopwords.
stop_words = stopwords.words('english')
sentence1 = [w for w in sentence1 if w not in stop_words]
sentence2 = [w for w in sentence2 if w not in stop_words]

[nltk_data] Downloading package stopwords to /home/olavur/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, as mentioned earlier, we will be using some downloaded pre-trained embeddings. We load these into a Gensim Word2Vec model class. Note that the embeddings we have chosen here require a lot of memory.

In [3]:
from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('/data/w2v_googlenews/GoogleNews-vectors-negative300.bin.gz', binary=True)

So let's compute the WMD using the `wmdistance` method.

In [4]:
distance = model.wmdistance(sentence1, sentence2)

print distance

1.01746462593


Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.

In [5]:
sentence1 = 'Obama speaks to the media in Illinois'
sentence2 = 'Oranges are my favorite type of fruit'
sentence1 = sentence1.lower().split()
sentence2 = sentence2.lower().split()
stop_words = stopwords.words('english')
sentence1 = [w for w in sentence1 if w not in stop_words]
sentence2 = [w for w in sentence2 if w not in stop_words]

distance = model.wmdistance(sentence1, sentence2)

print distance

1.36042560915


## WmdSimilarity

You can use WMD to get the most similar documents to a query, using `WmdSimilarity`. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.

We need some documents to try it out on, and for that we'll be using Yelp reviews, available at http://www.yelp.com/dataset_challenge. We are going to keep using the GoogleNews embeddings, although we could also have trained a word2vec model on the Yelp data.

Below a JSON file with Yelp reviews is read line by line, the text is extracted, tokenized, and stopwords and punctuation are removed.

In [27]:
import json
from nltk import word_tokenize
download('punkt')  # Download data for tokenizer.

# Load some text data.
n_items = 20
corpus = []
with open('/home/olavur/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json') as data_file:
    for i, line in enumerate(data_file):
        if i == n_items:
            break
            
        json_line = json.loads(line)
        text = json_line['text'].lower()  # Lower the text.
        text = word_tokenize(text)  # Split into words.
        text = [w for w in text if not w in stop_words]  # Remove stopwords.
        text = [w for w in text if w.isalpha()]  # Remove numbers and punctuation.
        corpus.append(text)

[nltk_data] Downloading package punkt to /home/olavur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Now we want to initialize the similarity class with a corpus and a word2vec model (which provides the embeddings and the `wmdistance` method itself).

In [28]:
from gensim.similarities import WmdSimilarity
instance = WmdSimilarity(corpus, model, num_best=10)

The `num_best` parameter decides how many results the queries return. Now let's try making a query.

In [29]:
sent = 'the food in this restaurant is quite delicious'
sent = sent.split()
sims = instance[sent]
print sims

[(13, -1.1387773003809611), (10, -1.1666546691779864), (12, -1.1725773024431625), (11, -1.1919036977021324), (14, -1.1993376932711148), (2, -1.2032962138672465), (18, -1.2063037700971175), (0, -1.2096674475548443), (6, -1.2106174098981202), (9, -1.2137328878539591)]


We can also compare all the documents in the corpus, simply by iterating over the similarity class.

In [31]:
for i, result in enumerate(instance):
    print result[0][0], result[0][1]  # Print index and similarity to the most similar document.

18 -1.02450615166
5 -0.927757094667
7 -0.946903190483
7 -1.01046151423
2 -0.970021707804
1 -0.927757094667
3 -1.01838309872
1 -0.934173622083
5 -0.93840869294
5 -0.967889142254
11 -0.844034032851
10 -0.844034032851
11 -0.90776322182
10 -0.977599374215
15 -0.99086094287
17 -0.980916173889
15 -1.0030764677
15 -0.980916173889
14 -1.02268712348
8 -1.03511599213


As mentioned previously, for more information about Gensim's similarity classes in general, turn to [this tutorial](https://radimrehurek.com/gensim/tut3.html).