# Using Pre-trained Word Embeddings

This notebook demonstrates how to use pre-trained word embeddings in GluonNLP.

To see why word embeddings are useful, it's worth comparing them to the alternative.
Without word embeddings, we might represent each word with a one-hot vector `[0, ...,0, 1, 0, ... 0]`, that takes value `1` at the index corresponding to the appropriate vocabulary word, 
and value `0` everywhere else. 
The weight matrices connecting our word-level inputs to the network's hidden layers would each be $v \times h$,
where $v$ is the size of the vocabulary and $h$ is the size of the hidden layer. 
With 100,000 words feeding into an LSTM layer with $1000$ nodes, the model would need to learn
$4$ different weight matrices (one for each of the LSTM gates), each with 100M weights, and thus 400 million parameters in total.

Fortunately, it turns out that a number of efficient techniques 
can quickly discover broadly useful word embeddings in an *unsupervised* manner.
These embeddings map each word onto a low-dimensional vector $w \in R^d$ with $d$ commonly chosen to be roughly $100$.
Intuitively, these embeddings are chosen based on the contexts in which words appear. 
Words that appear in similar contexts (like "tennis" and "racquet") should have similar embeddings
while words that do not like (like "rat" and "gourmet") should have dissimilar embeddings.

Practitioners of deep learning for NLP typically inititalize their models 
using *pretrained* word embeddings, bringing in outside information, and reducing the number of parameters that a neural network needs to learn from scratch.


Two popular word embeddings are Word2Vec and fastText. 
The following examples uses pre-trained word embeddings drawn from the following sources:

* Word2Vec https://arxiv.org/abs/1301.3781
* fastText project website：https://fasttext.cc/

To begin, let's first import the packages that we'll need for this example:

In [None]:
from mxnet import nd
import gluonnlp as nlp

## Pre-trained Word Embeddings

GluonNLP provides a number of pre-trained Word Embeddings.

In [None]:
nlp.embedding.list_sources('fasttext')[:5]

For simplicity of demonstration, we use a smaller word embedding file, such as
the 50-dimensional one.

In [None]:
emb = nlp.embedding.create('fasttext', source='wiki.en')

In [None]:
vocab_size, dim = emb.idx_to_vec.shape
print('Pre-trained embedding vocabulary size: {}, dimension: {}'.format(vocab_size, dim))

### Word Similarity

Given an input word, we can find the nearest word from
the vocabulary by similarity. The
similarity between any pair of words can be represented by the cosine similarity
of their vectors.

In [None]:
def norm_vecs_by_row(x):
    return x / nd.sqrt(nd.sum(x * x, axis=1) + 1E-10).reshape((-1,1))

In [None]:
def get_knn(emb, k, word):
    word_vec = emb[word].reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(emb.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_vec)
    indices = nd.topk(dot_prod.reshape((len(emb.idx_to_token), )), k=k+1, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return [emb.idx_to_token[i] for i in indices[1:]] # Remove input tokens.

Let us find the 5 most similar words of 'baby' from the vocabulary (size:
400,000 words).

In [None]:
get_knn(emb, 5, 'baby')

We can verify the cosine similarity of vectors of 'baby' and 'babies'.

In [None]:
from mxnet import nd
def cos_sim(x, y):
    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))

cos_sim(emb['baby'], emb['babies'])

### Word Analogy

We can also apply pre-trained word embeddings to the word
analogy problem. 

For instance, "man : woman :: son : daughter" is an analogy.

The word analogy completion problem is defined as: for analogy 'a : b :: c : d',
given the first three words 'a', 'b', 'c', find 'd'. The idea is to find the
most similar word vector for vec('c') + (vec('b')-vec('a')).

In this example, we will find words by analogy from the 400,000 indexed words in `vocab`.

In [None]:
def get_top_k_by_analogy(emb, k, word1, word2, word3):
    word_vecs = emb[word1, word2, word3]
    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2]).reshape((-1, 1))
    vocab_vecs = norm_vecs_by_row(emb.idx_to_vec)
    dot_prod = nd.dot(vocab_vecs, word_diff)
    indices = nd.topk(dot_prod.reshape((len(emb.idx_to_token), )), k=k, ret_typ='indices')
    indices = [int(i.asscalar()) for i in indices]
    return [emb.idx_to_token[i] for i in indices]

Complete word analogy 'man : woman :: son :'.

In [None]:
get_top_k_by_analogy(emb, 1, 'man', 'woman', 'son')

## API Docs

- [gluonnlp.embedding](https://gluon-nlp.mxnet.io/v0.9.x/api/modules/embedding.html) module, including
  - [`list_sources`](http://gluon-nlp.mxnet.io/v0.9.x/api/modules/embedding.html#gluonnlp.embedding.list_sources) function.
  - [`create`](http://gluon-nlp.mxnet.io/v0.9.x/api/modules/embedding.html#gluonnlp.embedding.create) function.
  - [`TokenEmbedding`](http://gluon-nlp.mxnet.io/v0.9.x/api/modules/embedding.html#gluonnlp.embedding.TokenEmbedding) class.
- [Vocabulary and Embedding API](http://gluon-nlp.mxnet.io/v0.9.x/api/notes/vocab_emb.html) notes.
- [`mxnet.ndarray`](https://mxnet.apache.org/api/python/docs/api/ndarray/index.html) module.


## Exercise

- Try a couple of other words in similarity, and see if the result makes sense. Pick some analogical word pairs and see if the embedding gets it right.
- Replace the embedding with some other pre-trained embeddings. Compare the results with the fastText embedding.
  - Pick an available pre-trained GloVe embeddings.
  - You can find them with `gluonnlp.embedding.list_sources('glove')`
- Can you find any bias in the embeddings? Look at the top-k results for `doctor - man + woman = ?`

In [None]:
get_knn(emb, 5, 'neural')

In [None]:
get_top_k_by_analogy(emb, 5, 'man', 'doctor', 'woman')

In [None]:
get_top_k_by_analogy(emb, 5, 'man', 'clever', 'woman')