# Word Embeddings

### Introduction

In this lesson, we'll consider different ways that we can represent an NLP dataset so that it can be used by a machine learning algorithm.  So far, we have seen relatively simple ways for translating text into something numerical.  Essentially, we represented each word in our corpus by a different feature, and then indicated how important that word was to a document either by it's frequency (as in bag of words), or it's `term frequency` * `inverse document frequency`.

With either our bag of words or `tfidf` representation, each word is a separate basis vector, and thus linearly independent of each of the other words.

<img src="./car-nlp.png" width="60%">

This technique has a couple of significant issues:

1. Sparse vectors
* The first is that with each additional word in our corpus, we need an additional feature.  More problematic is that some of these words will rarely occur, so it can become difficult to determine the significance of a word with relatively few training examples.

2. Unrelated words

* Another issue is that with each word represented as a separate dimension, we cannot tell the relationship between words. That is, we *do* think of words as being related to other words.  And even being combinations of other words, or qualities.

In this lesson, we'll see how we can represent each *word* as a vector of attributes with word embeddings. 

### Representing Words

Now the idea of word embeddings is to represent words by some deeper underlying meaning.  For example, let's say we represent the words `king`, `queen`, and `president` with the following vectors:

In [5]:
king = [1, 1400, 2]
queen = [0, 1400, 3]
president = [.7, 1900, 1]
prime_minister = [.6, 1900, 2]

Above, the first coordinate could represent the sex of associated with the word, a 1 for male and 0 for female.  The second coordinate represents the time period associated with the word, and the third represents the geography associated with the word (U.S. vs Europe).

One benefit with this representation of words is that our vectors are no longer completely independent of each other.  Instead, they have components of the other words.  So now, we can see that King and Queen appear more related than King and Prime Minister.

In addition, we can now represent our corpus with fewer dimensions.  Previously, we essentially had a one hot encoding to represent each word, where every word is assigned an index.  With English having over 1 million words, this leads to a lot of dimensions.  With word embeddings, the number of dimensions ranges between 50 and 300 for each word.  And because our vectors are not perpendicular to one another, we can calculate how similar words are by calculating their closeness to each other.

### Real World Embeddings

Now let's see how we can work with embeddings in Pytorch.  Pytorch makes available for different libraries of pretrained embeddings that allow us to represent our words with word embeddings.  Ok, let's once again download our IMDB dataset with torchtext.

In [4]:
from torchtext import datasets, data
import torch

TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)



downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:46<00:00, 1.83MB/s]


Ok, now this time when we build vocab we'll pass through an argument of `vectors = "glove.6B.100d"`.

In [41]:
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", 
                 unk_init = torch.Tensor.normal_)

In [15]:
LABEL.build_vocab(train_data)

> Now downloading the glove library takes both time and a significant amount of space on a computer.  If you wish to delete the file look in the hidden `.vector_cache` folder, located in the your current folder.  If you wish to reuse the your local glove library, simply specify the name of the file to reference.

Now the way embeddings work is a bit of a two step.  Each word is assigned a separate index, just like before.  But then torchtext will use this index to find the 100 dimensional vector related to the word.  Let's see this.

> We can again use string to integer to find the related index.

In [73]:
TEXT.vocab.stoi['dog']

1124

And then from there, we can use this index to find the related vector.

In [74]:
TEXT.vocab.vectors[1124].shape

torch.Size([100])

We can calculate how similar a word is to another word simply by looking at the distance between two words.

In [75]:
puppy_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['puppy']]
dog_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['dog']]

torch.cosine_similarity(puppy_vec.unsqueeze(0), dog_vec.unsqueeze(0))

tensor([0.7236])

In [76]:
banana_vec = TEXT.vocab.vectors[TEXT.vocab.stoi['banana']]

In [77]:
torch.cosine_similarity(puppy_vec.unsqueeze(0), banana_vec.unsqueeze(0))

tensor([0.2448])

If we were to plot different words, we would see that words that we think of being more closely related are more closely related in our multidimensional space.

<img src="./word2vec-img.png" width="60%">

### Summary

In this lesson, we saw the benefits of using words vectors as opposed to one hot encoding to numericalize our text.  We saw that by using word vectors, we replace our sparse vectors with dense vectors.  Remember that sparse vectors are an issue, because in a sparse vector each feature oftentimes has a value of zero, so it's hard to determine it's impact.  And we saw that with word embeddings we are able to show the *relation* between words, which we could not do otherwise.

Then, we saw how to use these embeddings with Pytorch's vocab object to map each word in our corpus to an index, which is then mapped to a related vector at that index.

### Resources

[Chris McCormick Embeddings](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/)

[Chris Olah Embeddings](https://colah.github.io/posts/2014-07-NLP-RNNs-Representations/)

[TorchText Embeddings U Toronto](https://www.cs.toronto.edu/~lczhang/360/lec/w06/w2v.html)

[Stanford - Intro to Embeddings](https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z)

[illustrated guide word2vec](http://jalammar.github.io/illustrated-word2vec/)

[Skipgram scratch](https://medium.com/district-data-labs/forward-propagation-building-a-skip-gram-net-from-the-ground-up-9578814b221)

[Skipgram again](https://iksinc.online/tag/skip-gram-model/)