# Word Mover's Distance

Word Mover's Distance (WMD) is a new and exciting method in machine learning that helps us measure the similarity of two sentences. WMD can be used for document retrieval, and has been shown to outperform many of the current state-of-the-art methods in $k$-nearest neighbors classification [1]. As we shall see, the benefit of WMD is that we can accurately assess the similarity of two documents, even when they have no words in common.

WMD uses word2vec vector embeddings [2] of words (read about word2vec in Gensim [here](http://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) and [here](http://rare-technologies.com/word2vec-tutorial/)).

WMD is illustrated below for two very similar sentences. The sentences have no words in common, but by matching the relevant words, WMD is able to accurately measure the similarity between the two sentences. The method also uses the bag-of-words representation of the documents (simply put, the word's frequencies in the documents), noted as $d$ in the figure below. The intution behind the method is that we find the minimum "traveling distance" between documents, the most efficient way to "move" the distribution of document 1 to the distribution of document 2.

<img src='https://vene.ro/images/wmd-obama.png' height='600' width='600'>


This method was introduced in the article "From Word Embeddings To Document Distances" by Matt Kusner et al. ([link to PDF](http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf)). It is inspired by the "Earth Mover's Distance", and employs a solver of the "transportation problem".

In this tutorial, we will learn how to use Gensim's WMD functionality, which consists of the `wmdistance` method for distance computation, and the `WmdSimilarity` class for corpus based similarity queries.

> **Note**:
>
> If you use this software, please consider citing the following papers:
>
> Ofir Pele and Michael Werman, "A linear time histogram metric for improved SIFT matching".
> 
> Ofir Pele and Michael Werman, "Fast and robust earth mover's distances".
>
> Matt Kusner et al. "From Word Embeddings To Document Distances".

> **Running this notebook:**
>
> You can download this [iPython Notebook](http://ipython.org/notebook.html) (**FIXME:** how?), and run it on your own computer, provided you have installed Gensim and NLTK, and downloaded the necessary data.
>
> The notebook was run on an Ubuntu machine with an Intel core i7-4770 CPU 3.40GHz (8 cores) and 32 GB memory. Running the entire notebook on this machine takes XYZ minutes (**FIXME**).

## Using WMD

To use WMD, we need some word embeddings first of all. You could train a word2vec (see tutorial [here](http://rare-technologies.com/word2vec-tutorial/)) model on some corpus, but we will start by download some pre-trained word2vec embeddings. Download these embeddings [here](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit). Training your own embeddings can be beneficial, but to simplify this tutorial, we will be using pre-trained embeddings at first.

Let's take some sentences to compute the distance between.

In [1]:
# Initialize logging.
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')

sentence1 = 'Obama speaks to the media in Illinois'
sentence2 = 'The president greets the press in Chicago'
sentence1 = sentence1.lower().split()
sentence2 = sentence2.lower().split()

These sentences have very similar content, and as such the WMD should be low. Before we compute the WMD, we want to remove stopwords ("the", "to", etc.), as these do not contribute a lot to the information in the sentences.

In [2]:
# Import and download stopwords from NLTK.
from nltk.corpus import stopwords
from nltk import download
download('stopwords')  # Download stopwords list.

# Remove stopwords.
stop_words = stopwords.words('english')
sentence1 = [w for w in sentence1 if w not in stop_words]
sentence2 = [w for w in sentence2 if w not in stop_words]

[nltk_data] Downloading package stopwords to /home/olavur/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, as mentioned earlier, we will be using some downloaded pre-trained embeddings. We load these into a Gensim Word2Vec model class. Note that the embeddings we have chosen here require a lot of memory.

In [4]:
from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format('/data/w2v_googlenews/GoogleNews-vectors-negative300.bin.gz', binary=True)

So let's compute WMD using the `wmdistance` method.

In [4]:
distance = model.wmdistance(sentence1, sentence2)

print distance

1.01746462593


Let's try the same thing with two completely unrelated sentences. Notice that the distance is larger.

In [5]:
sentence1 = 'Obama speaks to the media in Illinois'
sentence2 = 'Oranges are my favorite type of fruit'
sentence1 = sentence1.lower().split()
sentence2 = sentence2.lower().split()
sentence1 = [w for w in sentence1 if w not in stop_words]
sentence2 = [w for w in sentence2 if w not in stop_words]

distance = model.wmdistance(sentence1, sentence2)

print distance

1.36042560915


## Similarity queries

You can use WMD to get the most similar documents to a query, using the `WmdSimilarity` class. Its interface is similar to what is described in the [Similarity Queries](https://radimrehurek.com/gensim/tut3.html) Gensim tutorial.

> **Important note:**
>
> WMD is a measure of *distance*, which is in fact the opposite of similarity. The similarities in `WmdSimilarity` are simply the *negative distance*. Be careful not to confuse distances and similarities.

Let's try similarity queries using some real world data. For that we'll be using Yelp reviews, available at http://www.yelp.com/dataset_challenge. This time around, we are going to train the Word2Vec embeddings on the data ourselves.

Below a JSON file with Yelp reviews is read line by line, the text is extracted, tokenized, and stopwords and punctuation are removed. We use the restaurant that has the most reviews to run our queries against, and train Word2Vec on the top 6 restaurants in terms of number of reviews.

In [3]:
import json
from nltk import word_tokenize
download('punkt')  # Download data for tokenizer.

# Business IDs of some restaurants.
ids = ['4bEjOyTaDG24SY5TxsaUNQ', '2e2e7WgqU1BnpxmQL5jbfw', 'zt1TpTuJ6y9n551sw9TaEg',
      'Xhg93cMdemu5pAMkDoEdtQ', 'sIyHTizqAiGu12XMLX3N3g', 'YNQgak-ZLtYJQxlDwN-qIg']

# Load some text data.
w2v_corpus = []  # Documents to train word2vec on.
wmd_corpus = []  # Documents to run queries against.
documents = []  # wmd_corpus, with no pre-processing.
with open('/home/olavur/yelp_dataset_challenge_academic_dataset/yelp_academic_dataset_review.json') as data_file:
    for line in data_file:
        json_line = json.loads(line)
        
        if json_line['business_id'] not in ids:
            # Not interested in this business.
            continue
        
        # Pre-process document.
        text = json_line['text'].lower()  # Lower the text.
        text = word_tokenize(text)  # Split into words.
        text = [w for w in text if not w in stop_words]  # Remove stopwords.
        text = [w for w in text if w.isalpha()]  # Remove numbers and punctuation.
        
        # Add to corpus for training Word2Vec.
        w2v_corpus.append(text)
        
        if json_line['business_id'] == ids[0]:
            # Only use this restaurant in corpus for WmdSimilarity.
            wmd_corpus.append(text)
            documents.append(json_line['text'])

[nltk_data] Downloading package punkt to /home/olavur/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
# Train Word2Vec on all the restaurants.
model = Word2Vec(w2v_corpus, workers=7, size=100)

Now we want to initialize the similarity class with a corpus and a word2vec model (which provides the embeddings and the `wmdistance` method itself).

In [8]:
from gensim.similarities import WmdSimilarity
instance = WmdSimilarity(wmd_corpus, model, num_best=10)

The `num_best` parameter decides how many results the queries return. Now let's try making a query. The output is a list of indeces and similarities of documents in the corpus, sorted by similarity. Note that the output format is slightly different when `num_best` is `None` (i.e. not assigned).

The query below is taken directly from one of the reviews in the corpus. Let's see if there are other reviews that are similar to this one.

In [9]:
sent = 'Very good, you should seat outdoor.'

# Pre-process query.
query = sent.lower()  # Lower the text.
query = word_tokenize(query)  # Split into words.
query = [w for w in query if not w in stop_words]  # Remove stopwords.
query = [w for w in query if w.isalpha()]  # Remove numbers and punctuation.

sims = instance[query]  # A query is simply a "look-up" in the similarity class.

The query and the most similar documents, together with the similarities, are printed below. We see that the retrieved documents use the word "outside", while the query uses the word "outdoor", and these documents are close because these words have the same meaning.

In [10]:
# Print the query and the retrieved documents, together with their similarities.
print 'Query:'
print sent
for i in range(3):
    print
    print sims[i][1]
    print documents[sims[i][0]]

Query:
Very good, you should seat outdoor.

-0.750829606837
It was good I like the outside

-0.755294813249
It's a great place if you can sit outside in good weather.

-0.852301178568
Sat outside under heat lamps.  Good service and good food.  Wonderful place


Let's try a different query.

In [11]:
sent = 'I felt that the prices were extremely reasonable for the Strip'

# Pre-process query.
query = sent.lower()  # Lower the text.
query = word_tokenize(query)  # Split into words.
query = [w for w in query if not w in stop_words]  # Remove stopwords.
query = [w for w in query if w.isalpha()]  # Remove numbers and punctuation.

sims = instance[query]  # A query is simply a "look-up" in the similarity class.

print 'Query:'
print sent
for i in range(3):
    print
    print sims[i][1]
    print documents[sims[i][0]]

Query:
I felt that the prices were extremely reasonable for the Strip

-0.710589861037
Reasonable prices. Makes for a nice dinner out in the town.

-0.831299321403
Exceptional food at reasonable prices.  Reservations are a must.

-0.858231103992
Had lunch here, food price was very reasonable for vegas and the atmosphere was great.


This time around, the results are more straight forward; the retrieved documents basically contain the same words as the query. 

## References

1. Matt Kusner et al. *From Embeddings To Document Distances*, 2015.
* Thomas Mikolov et al. *Efficient Estimation of Word Representations in Vector Space*, 2013.