# Finding Synonyms and Analogies

## Using Pre-trained Word Vectors

In [0]:
!pip install mxnet

#https://en.wikipedia.org/wiki/Apache_MXNet

Collecting mxnet
[?25l  Downloading https://files.pythonhosted.org/packages/81/f5/d79b5b40735086ff1100c680703e0f3efc830fa455e268e9e96f3c857e93/mxnet-1.6.0-py2.py3-none-any.whl (68.7MB)
[K     |████████████████████████████████| 68.7MB 55kB/s 
Collecting graphviz<0.9.0,>=0.8.1
  Downloading https://files.pythonhosted.org/packages/53/39/4ab213673844e0c004bed8a0781a0721a3f6bb23eb8854ee75c236428892/graphviz-0.8.4-py2.py3-none-any.whl
Installing collected packages: graphviz, mxnet
  Found existing installation: graphviz 0.10.1
    Uninstalling graphviz-0.10.1:
      Successfully uninstalled graphviz-0.10.1
Successfully installed graphviz-0.8.4 mxnet-1.6.0


In [0]:
from mxnet import nd
from mxnet.contrib import text

text.embedding.get_pretrained_file_names().keys()

dict_keys(['glove', 'fasttext'])

All pre-trained models

In [0]:
print(text.embedding.get_pretrained_file_names('glove'))

['glove.42B.300d.txt', 'glove.6B.50d.txt', 'glove.6B.100d.txt', 'glove.6B.200d.txt', 'glove.6B.300d.txt', 'glove.840B.300d.txt', 'glove.twitter.27B.25d.txt', 'glove.twitter.27B.50d.txt', 'glove.twitter.27B.100d.txt', 'glove.twitter.27B.200d.txt']


Grab a particular model.

In [0]:
embedding = text.embedding.create(
    '...', pretrained_file_name='...')

Print the dictionary size.

In [0]:
len(embedding)

400001

We can use a word to get its index in the dictionary, or we can get the word from its index.

In [0]:
embedding.token_to_idx['...'], glove_6b50d.idx_to_token[...]

(3367, 'beautiful')

## Applying Pre-trained Word Vectors





In [0]:
def knn(W, x, k):
    # The added 1e-9 is for numerical stability
    cos = nd.dot(W, x.reshape((-1,))) / (
        (nd.sum(.., axis=1)).sqrt() * nd.sum(..).sqrt())
    topk = nd.topk(cos, k=k, ret_typ='indices').asnumpy().astype('int32')
    return topk, [cos[i].asscalar() for i in topk]

Then, we search for synonyms by pre-training the word vector instance `embed`.

In [0]:
def get_similar_tokens(query_token, k, embed):
    topk, cos = knn(embed.idx_to_vec,
                    embed.get_vecs_by_tokens([query_token]), k+1)
    for i, c in zip(topk[1:], cos[1:]):  # Remove input words
        print('cosine sim=%.3f: %s' % (c, (embed.idx_to_token[i])))

Search in 400,000 words.

In [0]:
get_similar_tokens('amazon', 3, glove_6b50d)

cosine sim=0.663: unbox
cosine sim=0.653: amazon.com
cosine sim=0.647: palm


In [0]:
get_similar_tokens('baby', 3, glove_6b50d)

cosine sim=0.839: babies
cosine sim=0.800: boy
cosine sim=0.792: girl


In [0]:
get_similar_tokens('beautiful', 3, glove_6b50d)

cosine sim=0.921: lovely
cosine sim=0.893: gorgeous
cosine sim=0.830: wonderful


### Finding Analogies

In [0]:
def get_analogy(token_a, token_b, token_c, embed):
    vecs = embed.get_vecs_by_tokens([token_a, token_b, token_c])
    x = ...
    topk, cos = knn(...)
    return embed.idx_to_token[topk[0]]  

In [0]:
get_analogy('man', 'woman', 'boy', glove_6b50d)

'girl'

In [0]:
get_analogy('china', 'beijing', 'japan', glove_6b50d)

'tokyo'

In [0]:
get_analogy('bad', 'worst', 'nice', glove_6b50d)

'place'

In [0]:
get_analogy('do', 'did', 'go', glove_6b50d)

'went'