# Word Embeddings

### Introduction

In this lesson, we'll consider different ways that we can represent an NLP dataset so that it can be used by a machine learning algorithm.  So far, we have seen relatively simple ways for translating text into something numerical.  Essentially, we represented each word in our corpus by a different feature, and then indicated how important that word was to a document either by it's frequency (as in bag of words), or it's `term frequency` * `inverse document frequency`.

With either our bag of words or `tfidf` representation, each word is a separate basis vector, and thus linearly independent of each of the other words.

<img src="./car-nlp.png" width="60%">

This technique has a couple of significant issues:

1. Sparse vectors
* The first is that with each additional word in our corpus, we need an additional feature.  More problematic is that some of these words will rarely occur, so it can become difficult to determine the significance of a word with relatively few training examples.

2. Unrelated words

Another issue is that with each word represented as a separate dimension, we cannot tell the relationship between words. That is, we *do* think of words as being related to other words.  And even being combinations of other words, or qualities.

In this lesson, we'll see how we can represent each *word* as a vector of attributes with word embeddings. 

### Representing Words

Now the idea of word embeddings is to represent words by some deeper underlying meaning.  For example, let's say we represent the words `king`, `queen`, and `president` with the following vectors:

In [5]:
king = [1, 1400, 2]
queen = [0, 1400, 3]
president = [.7, 1900, 1]
prime_minister = [.6, 1900, 2]

Above, the first coordinate could represent the sex of associated with the word, a 1 for male and 0 for female.  The second coordinate represents the time period associated with the word, and the third represents the geography associated with the word (U.S. vs Europe).

One benefit with this representation of words is that our vectors are no longer completely independent of each other.  Instead, they have components of the other words.  So now, we can see that King and Queen appear more related than King and Prime Minister.

In addition, we can now represent our corpus with fewer dimensions.  Previously, we essentially had a one hot encoding to represent each word, where every word is assigned an index.  With English having over 1 million words, this leads to a lot of dimensions.  With word embeddings, the number of dimensions ranges between 50 and 300 for each word.  And because our vectors are not perpendicular to one another, we can calculate how similar words are by calculating their closeness to each other.

### Real World Embeddings

Now let's see how we can work with embeddings in Pytorch.  Pytorch makes available for different libraries of pretrained embeddings that allow us to represent our words with word embeddings.  Ok, let's once again download our IMDB dataset with torchtext.

In [40]:
from torchtext import datasets, data
import torch

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)



Ok, now this time when we build vocab we'll pass through an argument of `vectors = "glove.6B.100d"`.

In [41]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

In [15]:
LABEL.build_vocab(train_data)

> Now downloading the glove library takes both time and a significant amount of space on a computer.  If you wish to delete the file look in the hidden `.vector_cache` folder, located in the your current folder.  If you wish to reuse the your local glove library, simply specify the name of the file to reference.

Now the way embeddings work is a bit of a two step.  Each word is assigned a separate index, just like before.  But then torchtext will use this index to find the 100 dimensional vector related to the word.  Let's see this.

> We can again use string to integer to find the related index.

In [73]:
TEXT.vocab.stoi['dog']

1124

And then from there, we can use this index to find the related vector.

In [74]:
TEXT.vocab.vectors[1124].shape

torch.Size([100])

We can calculate how similar a word is to another word simply by looking at the distance between two words.

In [75]:
puppy_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['puppy']]
dog_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['dog']]

torch.cosine_similarity(puppy_vec.unsqueeze(0), dog_vec.unsqueeze(0))

tensor([0.7236])

In [76]:
banana_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['banana']]

In [77]:
torch.cosine_similarity(puppy_vec.unsqueeze(0), banana_vec.unsqueeze(0))

tensor([0.2448])

If we were to plot different words, we would see that words that we think of being more closely related are more closely related in our multidimensional space.

<img src="./word2vec-img.png" width="60%">

### Summary

In this lesson, we saw the benefits of using words vectors as opposed to one hot encoding to numericalize our text.  We saw that by using word vectors, we replace our sparse vectors with dense vectors.  And we saw that   

# Using Embeddings in A Neural Network

Now we may be wondering how we can incorporate an embedding into a neural network.  It turns out that Pytorch has an embedding library already built for us.  Let's see how it works.

We'll initialize an embedding layer with a size equal to the number of words vocabulary, and an embedding dimension of 100 for the 100 length vectors that we have.

In [9]:
input_dim = len(TEXT.vocab.itos)
embedding_dim = 100

In [10]:
embedding = nn.Embedding(input_dim, embedding_dim)

Now, what's interesting is that if we pass through some text to the embedding layer we get back the embedded vectors.  Let's see this.

> First we'll batch our data, and then select the first batch.

In [16]:
train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size = 64)

for batch in train_iterator:
    first_batch = batch
    break

We see that in our first batch we have 64 documents each with 832 tokens represented.

In [36]:
first_batch.text.shape

torch.Size([832, 64])

Then notice how the shape changes if we pass this through our embedding.

In [27]:
embedded = embedding(first_batch.text.unsqueeze(0))
embedded.shape

torch.Size([1, 832, 64, 100])

So we can see that originally we had 64 documents, of 832 words each, and now each of those 832 words are represented by a vector of length 100.  So for example, if we wished to see how just the first document was represented, it would look like the following.

In [37]:
first_document = embedded[0][:, 0, :]

In [38]:
first_document.shape

torch.Size([832, 100])

### Using the Glove Model

Now there's just one issue with the above: our embedding is full of randomly initialized vectors.

In [45]:
embedding.weight.data.shape

torch.Size([25002, 100])

Instead, we should use the glove vectors, so that when we get back a matrix of a 100 length vector for each word, that vector actually represents the word.

Ok, we can select the glove vectors from our `TEXT.vocab` object.

In [46]:
text_vectors = TEXT.vocab.vectors

And then we can replace the weights in our embedded layer with these vectors.

In [59]:
# embedding.weight.data.copy_(text_vectors)

In [58]:
embedding.weight

Parameter containing:
tensor([[ 9.9937e-01, -3.7601e-01,  3.7198e-01,  ...,  3.2650e-01,
          6.4078e-01,  6.6549e-01],
        [-3.6829e-01, -2.8898e+00, -7.7163e-01,  ..., -2.9713e+00,
          1.5644e+00, -1.0140e+00],
        [-3.8194e-02, -2.4487e-01,  7.2812e-01,  ..., -1.4590e-01,
          8.2780e-01,  2.7062e-01],
        ...,
        [ 3.1549e-01,  4.7876e-01, -7.9116e-03,  ...,  1.9896e-02,
          1.5833e+00, -2.9216e-01],
        [ 2.0895e+00, -6.2094e-01, -1.0965e-01,  ..., -1.1243e-01,
         -9.0293e-01,  4.8347e-01],
        [-3.3756e-01,  1.1619e-03,  2.1478e-01,  ...,  5.2835e-01,
         -1.7724e+00, -1.3728e+00]], requires_grad=True)

And it's a good idea to zero out the unknown token vector and the padding vector, as we do not think of these as consisting of 100 different features.

In [61]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

In [63]:
embedding.weight.data[UNK_IDX] = torch.zeros(embedding_dim)
embedding.weight.data[PAD_IDX] = torch.zeros(embedding_dim)

Now let's pass this through our and take a look at the vectors that are returned with our updated network.

In [67]:
embedded = embedding(batch.text)
embedded.shape

torch.Size([832, 64, 100])

In [68]:
first_document_update = embedded[:, 0, :]

In [72]:
first_document_update.shape

torch.Size([832, 100])

We could even visualize our words in a document across different dimensions as a sort of heat map.  This is the first ten words, across the first 20 dimensions, represented.

In [77]:
import pandas as pd
df = pd.DataFrame(first_document_update[:10, :20]).astype('float')
df.style.set_properties(**{'font-size':'6pt'}).background_gradient('Greys')

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,0.747637,-1.22769,0.75619,-0.741466,-1.513559,0.564932,-0.461237,0.572898,0.786169,-1.831817,-0.316089,-1.058399,-1.902547,0.787707,-0.72695,0.058335,0.828292,1.505496,-0.221799,-0.452337
1,-0.1529,-0.24279,0.89837,0.16996,0.53516,0.48784,-0.58826,-0.17982,-1.3581,0.42541,0.15377,0.24215,0.13474,0.41193,0.67043,-0.56418,0.42985,-0.012183,-0.11677,0.31781
2,-0.21823,0.69199,0.70441,-0.59642,-0.21818,0.55387,-0.32052,0.52602,-0.31667,-0.19129,0.76109,0.047439,0.43199,0.12232,0.25664,-0.52453,0.048994,0.81621,-0.53336,0.53093
3,-1.65267,1.223435,0.029972,1.079892,-0.748489,1.891953,0.192535,1.10728,-0.047635,0.828955,1.262526,1.094922,0.659926,0.518608,-1.246645,-1.651591,1.247685,-0.987246,-0.712199,1.274366
4,0.053583,0.70823,0.39167,-0.51792,-0.74798,0.012321,0.095655,0.35316,-0.19545,-0.39133,0.80921,0.10338,0.55577,-0.076189,-0.041986,-0.37032,-0.47886,0.7972,-0.8908,0.45737
5,0.54635,0.19018,0.51298,-0.76729,-0.23824,-0.065802,0.24464,0.32024,-0.13215,-0.51083,0.69103,0.24462,0.075322,0.34058,0.37736,-0.27647,-0.22937,0.32059,-0.43115,0.37238
6,-0.1897,0.050024,0.19084,-0.049184,-0.089737,0.21006,-0.54952,0.098377,-0.20135,0.34241,-0.092677,0.161,-0.13268,-0.2816,0.18737,-0.42959,0.96039,0.13972,-1.0781,0.40518
7,-0.037608,0.15683,0.4796,-0.03854,-0.1747,0.15475,-0.38234,0.24254,-0.16832,-0.3267,-0.064252,-0.023559,-0.19514,-0.015289,-0.14977,-0.010363,0.06325,0.18948,-0.55627,0.88542
8,-0.30664,0.16821,0.98511,-0.33606,-0.2416,0.16186,-0.053496,0.4301,0.57342,-0.071569,0.36101,0.26729,0.27789,-0.072268,0.13838,-0.26714,0.12999,0.22949,-0.18311,0.50163
9,-0.1897,0.050024,0.19084,-0.049184,-0.089737,0.21006,-0.54952,0.098377,-0.20135,0.34241,-0.092677,0.161,-0.13268,-0.2816,0.18737,-0.42959,0.96039,0.13972,-1.0781,0.40518


### Summary

### Resources

[TorchText Embeddings U Toronto](https://www.cs.toronto.edu/~lczhang/360/lec/w06/w2v.html)

[Stanford - Intro to Embeddings](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)

[illustrated guide word2vec](http://jalammar.github.io/illustrated-word2vec/)