The **Word Mover's Distance (WMD)** suggests that the distances between embedded word vectors are semantically meaningful. It treats documents as a weighted point cloud of embedded words.

The distance between two text documents A and B is calculated as the minimum cumulative distance that words from the text document A needs to travel to match exactly the point cloud of text document B.

The Word Mover's Distance exploits semantic and syntactic relationships to get similarity between text documents. 

Some properties of the WMD distance:
* no hyper parameters
* interpretable - the distance between documents is broken down as sparse distances between a few individual words
* incorporates knowledge encoded in a word embedding

To use WMD, you need some word embeddings.

`gensim.models.Word2Vec` models have a `wmdistance` method, allowing you to load embeddings and compute distance using this metric.

In [1]:
sentence_a = "Modi had a chat with Bear Grylls in Jim Corbett"
sentence_b = "The prime minister met the TV host in a National Park"

sentence_a = sentence_a.lower().split()
sentence_b = sentence_b.lower().split()

In [2]:
from nltk.corpus import stopwords
stop_words = stopwords.words("english")
sentence_a = [word for word in sentence_a if word not in stop_words]
sentence_b = [word for word in sentence_b if word not in stop_words]

Running `wmdistance` is going to have a runtime error because I am missing pyemd.

You can install it with `conda install -c conda-forge pyemd`.

In [3]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

glove_file = "data/glove_wikipedia_embeddings/glove.6B.100d.txt.word2vec"

model = KeyedVectors.load_word2vec_format(glove_file)
distance = model.wmdistance(sentence_a, sentence_b)
print("Sentence 1: {}\nSentence 2: {}\nDistance: {}".format(sentence_a,sentence_b,distance))

ModuleNotFoundError: No module named 'pyemd'

Now we try a sentence that is further apart:

In [4]:
new_sentence = "Leos are born in august".lower().split()
new_sentence = [word for word in new_sentence if word not in stop_words]
new_distance = model.wmdistance(sentence_a, new_sentence)
print(new_distance)

ModuleNotFoundError: No module named 'pyemd'