## Finding similar words and analogies with word embeddings

In [None]:
!pip install -qU gluonnlp mxnet awscli botocore boto3 --upgrade

In [None]:
import mxnet as mx
from mxnet import nd      # NDArray API
import gluonnlp as nlp

### Cosine similarity

The cosine similarity of two vectors is the normalized dot product of the two vectors. This value is actually the cosinus of the angle between the two vectors, hence its name!

  * colinear vectors --> close to +1 : words are used in the same context
  * opposite vectors --> close to -1 : words are used in a different context

In [None]:
from IPython.display import Image
Image("cosine.png")
# Source: Wikipedia

In [None]:
# Almost co-linear: b ≈ 2a 
a = mx.nd.array([1, 2])
b = mx.nd.array([2.1, 3.9])
print(mx.nd.dot(a, b) / (a.norm() * b.norm()))

# Almost opposite: b ≈ -2a
a = mx.nd.array([1, 2])
b = mx.nd.array([-2.1, -3.9])
print(mx.nd.dot(a, b) / (a.norm() * b.norm()))

# Almost orthogonal
a = mx.nd.array([1, 2])
b = mx.nd.array([2.1, -0.9])
print(mx.nd.dot(a, b) / (a.norm() * b.norm()))

Let's define a function that transforms two words into their embeddings, and compute their cosine similarity.

In [None]:
def cos_similarity(embedding, word1, word2):
    vec1, vec2 = embedding[word1], embedding[word2]
    return mx.nd.dot(vec1, vec2) / (vec1.norm() * vec2.norm())

### Comparing word embeddings

Glove is a popular algorithm for word embeddings. Let's see which pre-trained embeddings are available in GluonNLP.

In [None]:
nlp.embedding.list_sources('glove')     # we could also use 'fasttext' and 'word2vec'

Let's download embeddings built from a 6-billion word corpus encoded in 50-dimension.

In [None]:
glove = nlp.embedding.create('glove', source='glove.6B.50d')

Let's build the vocabulary for that corpus.

In [None]:
vocab = nlp.Vocab(nlp.data.Counter(glove.idx_to_token))
vocab.set_embedding(glove)

In [None]:
len(vocab)

These are the two words we'd like to compare.

In [None]:
word1 = 'burger'
word2 = 'fries'

Let's print their embeddings.

In [None]:
print(vocab.embedding[word1])
print(vocab.embedding[word2])

In [None]:
print('Similarity:', cos_similarity(glove, word1, word2).asnumpy()[0])

### Finding words that have a similar meaning

We need a simple way to compute the dot product of a given embedding with respect to all other embeddings, and select the top values.

First, we need to normalize all embeddings, to make sure all dot products will be in the [-1, +1] range.

In [None]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1) + 1E-10).reshape((-1,1))

In [None]:
Image("normalized vectors.png")
# Source: "Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation", Xing et al, 2015

This function computes the dot products for a given word, and return the top 'k' ones. We use the NDArray API from Apache MXNet.

In [None]:
def get_k_nearest_words(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[1:])

In [None]:
get_k_nearest_words(vocab, 5, 'burger')

Of course, we could do the opposite, and pick the 'k' smallest values to find unrelated words.

In [None]:
def get_k_furthest_words(vocab, k, word):
    word_vec = vocab.embedding[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k+1, ret_typ='indices', is_ascend=True)
    indices = [int(i.asscalar()) for i in indices]
    # Remove unknown and input tokens.
    return vocab.to_tokens(indices[1:])

In [None]:
get_k_furthest_words(vocab, 5, 'burger')

### Finding analogies

If the distance between vector1 and vector2 is close to the distance between vector3 and vector4, then (word1, word2) and (word3, word4) illustrate a similar relationship.

This function takes three words as input: it literally computes vector2 - vector1 + vector3, computes the dot product of the resulting vector with all other embeddings, and return the top 'k' values.

In [None]:
def get_top_k_by_analogy(vocab, k, word1, word2, word3):
    word_vecs = vocab.embedding[word1, word2, word3]
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(vocab), )), k=k, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return vocab.to_tokens(indices)

'Cold' is to 'colder' what 'warm' is to ?

In [None]:
get_top_k_by_analogy(vocab, 1, 'cold', 'colder', 'warm')

'King' is to 'man' what 'queen' is to ?

In [None]:
get_top_k_by_analogy(vocab, 1, 'king', 'man', 'queen')

'Cars' is to 'ferrari' what 'jewelry' is to ?

In [None]:
get_top_k_by_analogy(vocab, 1, 'cars', 'ferrari', 'jewelry')