### Word Embeddings


### Training the network: Skip Gram

#### The Fake Task

We’re going to build the neural network to perform a  “fake” task, which indirectly gives us those word vectors that we are really after.

We’re going to train the neural network to do the following. Given a specific word in the middle of a sentence (the input word), look at the words nearby and pick one at random. The network is going to tell us the probability for every word in our vocabulary of being the “nearby word” that we chose.
By "nearby", we mean there is actually a "window size" parameter to the algorithm. A typical window size might be 5, meaning 5 words behind and 5 words ahead (10 in total).

The output probabilities are going to relate to how likely it is find each vocabulary word nearby our input word. For example, if you gave the trained network the input word “Soviet”, the output probabilities are going to be much higher for words like “Union” and “Russia” than for unrelated words like “watermelon” and “kangaroo”.

We’ll train the neural network to do this by feeding it word pairs found in our training documents. The below example shows some of the training samples (word pairs) we would take from the sentence “The quick brown fox jumps over the lazy dog.” The word highlighted in blue is the input word.

<img src="Figures/training_data_word2vec.png" width="75%">

### Architecture 

We’re going to represent an input word like “ants” as a one-hot vector. This vector will have 10,000 components (one for every word in our vocabulary) and we’ll place a “1” in the position corresponding to the word “ants”, and 0s in all of the other positions.

The output of the network is a single vector (also with 10,000 components) containing, for every word in our vocabulary, the probability that a randomly selected nearby word is that vocabulary word.

Here’s the architecture of our neural network.

<img src="Figures/skip_gram_net_arch.png" width="80%">

#### The Hidden Layer

For our example, we’re going to say that we’re learning word vectors with 300 features. So the hidden layer is going to be represented by a weight matrix with 10,000 rows (one for every word in our vocabulary) and 300 columns (one for every hidden neuron).

*300 features is what Google used in their published model trained on the Google news dataset (you can download it from [here](https://code.google.com/archive/p/word2vec/)). The number of features is a "hyper parameter" that you would just have to tune to your application (that is, try different values and see what yields the best results).*

If you look at the rows of this weight matrix, these are actually what will be our word vectors!

<img src="Figures/word2vec_weight_matrix_lookup_table.png" width="80%">

##### Effect of matrix multiplication with a one-hot vector

That one-hot vector is almost all zeros… If you multiply a 1 x 10,000 one-hot vector by a 10,000 x 300 matrix, it will effectively just select the matrix row corresponding to the “1”. Here’s a small example to give you a visual.

<img src="Figures/matrix_mult_w_one_hot.png" width="70%">

This means that the hidden layer of this model is really just operating as a lookup table. The output of the hidden layer is just the “word vector” for the input word.

The Output Layer

The 1 x 300 word vector for “ants” then gets fed to the output layer. The output layer is a softmax regression classifier. There’s an in-depth tutorial on Softmax Regression here, but the gist of it is that each output neuron (one per word in our vocabulary!) will produce an output between 0 and 1, and the sum of all these output values will add up to 1.

Specifically, each output neuron has a weight vector which it multiplies against the word vector from the hidden layer, then it applies the function exp(x) to the result. Finally, in order to get the outputs to sum up to 1, we divide this result by the sum of the results from all 10,000 output nodes.

Here’s an illustration of calculating the output of the output neuron for the word “car”.

<img src="Figures/output_weights_function.png"  width="80%">

##### Intuition

If two different words have very similar “contexts” (that is, what words are likely to appear around them), then our model needs to output very similar results for these two words. And one way for the network to output similar context predictions for these two words is if the word vectors are similar. So, **if two words have similar contexts, then our network is motivated to learn similar word vectors for these two words!** 

And what does it mean for two words to have similar contexts? I think you could expect that **synonyms** like “intelligent” and “smart” would have very similar contexts. Or that words that are related, like “engine” and “transmission”, would probably have similar contexts as well.

This can also handle stemming for you – the network will likely learn similar word vectors for the words “ant” and “ants” because these should have similar contexts.

#### Another View as autoencoder

We have seen autoencoders before.

<img src="Figures/word2vec-skip-gram.png" width="80%">

### CBOW

Unlike a language model that can only base its predictions on past words, as it is assessed based on its ability to predict each next word in the corpus, a model that only aims to produce accurate word embeddings is not subject to such restriction. Mikolov et al. therefore use both the n words before and after the target word to predict it as shown in the Figure. This is known as a **continuous bag of words** (CBOW), owing to the fact that it uses continuous representations whose order is of no importance.

<img src="Figures/word2vec-cbow.png" width="80%" >

According to the authors Mikolov et al, here is the difference between Skip-gram and CBOW:

- Skip-gram: works well with small amount of the training data, represents well even rare words or phrases

- CBOW: several times faster to train than the skip-gram, slightly better accuracy for the frequent words

Sometimes the different architectures are depicted by this graph:

<img src="Figures/CBOW-vs-SkipGram.png" width="85%" >

### GloVe

Original paper: GloVe: Global Vectors for Word Representation https://nlp.stanford.edu/pubs/glove.pdf

Count-based model - GloVe is essentially a log-bilinear model with a weighted least-squares objective The model rests on a rather simple idea that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning which can be encoded as vector differences Therefore, the training objective is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence. Loss function:

$$
J = \sum_{i, j=1}^V f(X_{ij}) \: (w_i^T \tilde{w}_j + b_i + \tilde{b}_j  - \text{log} \: X_{ij})^2
$$

where $w_i$ and $b_i$ are the word vector and bias respectively of word $i$, $\tilde{w}_j$ and $b_j$ are the context word vector and bias respectively of word $j$, $X_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

However, when we control for all the training hyper-parameters, the embeddings generated using the two methods tend to perform very similarly in downstream NLP tasks. The additional benefits of GloVe over word2vec is that it is easier to parallelize the implementation which means it's easier to train over more data, which, with these models, is always A Good Thing.

- In word2vec, Skipgram models try to capture co-occurrence one window at a time
- Glove tries to capture the global counts of overall statistics.


### word2vec in spacy

- *spaCy* can compare two objects and predict similarity
- *Doc.similarity()*, *Span.similarity()* and *Token.similarity()*
- Take another object and return a similarity score ( 0 to 1 )
- Important: needs a model that has word vectors included, for example:
    * *en_core_web_md* (medium model)
    * *en_core_web_lg* (large model)
    * NOT *en_core_web_sm* (small model)
    

- Similarity is determined using word vectors
- Multi-dimensional meaning representations of words
- Generated using an algorithm like and lots of text
- Can be added to spaCy's statistical models
- Default: cosine similarity, but can be adjusted
- Doc and Span vectors default to average of token vectors
- Short phrases are better than long documents with many irrelevant words

In [1]:
import spacy
nlp = spacy.load('en_core_web_md')

In [2]:
# Compare two tokens
doc = nlp("I like pizza and pasta")
token1 = doc[2]
token2 = doc[4]
print(token1.similarity(token2))

0.73695457


In [None]:
# Compare two documents
doc1 = nlp("I like fast food")
doc2 = nlp("I like pizza")
print(doc1.similarity(doc2))

In [None]:
# Compare a document with a token
doc = nlp("I like pizza")
token = nlp("soap")[0]
print(doc.similarity(token))

In [3]:
# Compare a span with a document
span = nlp("I like pizza and pasta")[2:5]
doc = nlp("McDonalds sells burgers")
print(span.similarity(doc))

0.6199091710787739


### Print the word vector

In [5]:
doc = nlp("king equal queen minus woman plus man ?")
# Access the vector via the token.vector attribute
print(doc[2].vector)
#print(doc[2])

[ 0.4095    -0.22693    0.25362   -0.36055   -0.37095   -0.35181
  0.50669   -0.77897   -0.32571    1.4895     0.052438  -0.36751
 -0.074025   0.37078    0.063077   0.32274    0.346      0.64214
 -0.09583    0.14303   -0.33826    0.79005   -0.7136    -0.050134
 -0.46467   -0.067917  -0.32107    0.042919   0.018576   0.59272
 -0.032392   0.72779    0.26002    0.30401    0.43033    0.25546
 -0.37986   -0.14398   -0.54399   -0.46181    0.11046   -0.034391
 -0.10458   -0.069689   0.091839  -0.19097   -0.057108   0.61218
 -0.19544   -0.31698   -0.46372    0.088749  -0.052501  -0.27969
  0.025125  -0.42097   -0.069404  -0.038672  -0.26489    0.10911
 -0.084848  -0.23826    0.61538    0.0039223  0.20285    0.56085
  0.015419   0.30707    0.19435   -0.20358   -0.18724   -0.10311
 -0.46468   -0.16804    0.22614    0.040657  -0.5147     0.46701
  0.61985   -0.46281   -0.8657     0.26458   -0.015476   0.12292
  0.084031  -0.07936    0.58967   -0.011092  -0.3795    -0.053612
  0.21134    0.46996  

### Tasks: 
- compare with word2vec
- find function that computes the closest neighbor
- check the "arithmetic" relations such as "king - man + woman is queen"

#### Similarity depends on the application context

- Useful for many applications: recommendation systems, flagging duplicates etc.
- There's no objective definition of "similarity"
- Depends on the context and what application needs to do

In [6]:
doc1 = nlp("I like cats")
doc2 = nlp("I hate cats")
print(doc1.similarity(doc2))

0.9501446702124066


Note2Self:

