# More on W2V (for clarification)
Taken from: http://adventuresinmachinelearning.com/word2vec-keras-tutorial/ and http://adventuresinmachinelearning.com/word2vec-tutorial-tensorflow/

### Word embedding
- Word embeddings try to “compress” large one-hot word vectors into much smaller vectors (a few hundred elements) which preserve some of the meaning and context of the word. Word2Vec is the most common process of word embedding and will be explained below.

### Context, Word2Vec and the skip-gram model
- The context of the word is the key measure of meaning that is utilized in Word2Vec.  
    - The context of the word “sat” in the sentence “the cat sat on the mat” is (“the”, “cat”, “on”, “the”, “mat”).  In other words, it is the words which commonly occur around the target word “sat”. 
    - Words which have similar contexts share meaning under Word2Vec, and their reduced vector representations will be similar.  
    - In the **skip-gram model version of Word2Vec (more on this later), the goal is to take a target word i.e. “sat” and predict the surrounding context words.**  This involves an iterative learning process.
- The **end product of this learning will be an embedding layer in a network**
    - this embedding layer is a kind of lookup table – **the ROWS are vector representations of each word in our vocabulary.**  Here’s a simplified example (using dummy values) of what this looks like, where vocabulary_size=7 and embedding_size=3:
![](pictures/nlp_10_clairf_embedd.jpg)
- As you can see, each word (row) is represented by a vector of size 3. 
- Learning this embedding layer/lookup table can be performed using a **simple neural network and an output softmax layer:**
![](pictures/nlp_10_clairf_embedd2.jpg)


The idea of the neural network above is to:

1. Supply our input target words as one-hot vectors.  
2. Then, via a hidden layer, we want to train the neural network to increase the probability of valid context words, while decreasing the probability of invalid context words (i.e. words that never show up in the surrounding context of the target words).  **This involves using a softmax function on the output layer.**  
3. Once training is complete, **the output layer is discarded, and our embedding vectors are the weights of the hidden layer.**

There are two variants of the Word2Vec paradigm – skip-gram and CBOW.  The skip-gram variant takes a target word and tries to predict the surrounding context words, while the CBOW (continuous bag of words) variant takes a set of context words and tries to predict a target word.  **In this case, we will be considering the skip-gram variant.**

### What is the 'gram' in skip gram

- A gram is a group of n words, where n is the gram window size.  
- So for the sentence **“The cat sat on the mat”**, a 3-gram representation of this sentence would be:
    - “The cat sat”, “cat sat on”, “sat on the”, “on the mat”.  
    - The **“skip” part refers to the number of times an input word is repeated in the data-set with different context words (more on this later).**
- These grams are fed into the Word2Vec context prediction system. For instance, assume the input word is “cat” – the Word2Vec tries to predict the context (“the”, “sat”) from this supplied input word.  
#### The Word2Vec system will move through all the supplied grams and input words and attempt to learn appropriate mapping vectors (embeddings) which produce high probabilities for the right context given the input words.

### Skipgram training
- With respect to the neural net diagram above, if we take the word “cat” it will be one of the words in the 10,000 word vocabulary.  Therefore we can represent it as a 10,000 length one-hot vector.  
- We then interface this input vector to a 300 node hidden layer
- The weights connecting this layer will be our new word vectors
- The activations of the nodes in this hidden layer are simply linear summations of the weighted inputs **(i.e. no non-linear activation, like a sigmoid or tanh, is applied).**  
- These nodes are then fed into a softmax output layer.  

#### During training, we want to change the weights of this neural network so that words surrounding “cat” have a higher probability in the softmax output layer.  So, for instance, if our text data set has a lot of Dr Seuss books, we would want our network to assign large probabilities to words like “the”, “sat” and “on” (given lots of sentences like “the cat sat on the mat”).

- By training this network, we would be **creating a 10,000 x 300 weight matrix connecting the 10,000 length input with the 300 node hidden layer.**  Each row in this matrix corresponds to a word in our 10,000 word vocabulary – so we have effectively reduced 10,000 length one-hot vector representations of our words to 300 length vectors.  

#### The weight matrix essentially becomes a look-up or encoding table of our words.  Not only that, but these weight values contain context information due to the way we’ve trained our network.  Once we’ve trained the network, we abandon the softmax layer and just use the 10,000 x 300 weight matrix as our word embedding lookup table.

### The Softmax issue and negative sampling
- The problem with using a full softmax output layer is that it is very computationally expensive.  Consider the definition of the softmax function:

$$P(y = j \mid x) = \frac{e^{x^T w_j}}{\sum_{k=1}^K e^{x^T w_k}}$$

- Here the probability of the output being class j is calculated by multiplying the output of the hidden layer and the weights connecting to the class j output on the numerator and dividing it by the same product but over all the remaining weights.  
- When the output is a 10,000-word one-hot vector, we are talking millions of weights that need to be updated in any gradient based training of the output layer.
- **Enter negative sampling:** 
    - It works by reinforcing the strength of weights which link a target word to its context words, but rather than reducing the value of all those weights which aren’t in the context, it simply samples a small number of them – these are called the “negative samples”.

- To train the embedding layer using negative samples in Keras, we can re-imagine the way we train our network.  Instead of constructing our network so that the output layer is a multi-class softmax layer, we can change it into a simple binary classifier.  For words that are in the context of the target word, we want our network to output a 1, and for our negative samples, we want our network to output a 0. Therefore, the output layer of our Word2Vec Keras network is simply a single node with a sigmoid activation function.
 
We also need a way of ensuring that, as the network trains, words which are similar end up having similar embedding vectors.  Therefore, we want to ensure that the trained network will always output a 1 when it is supplied words which are in the same context, but 0 when it is supplied words which are never in the same context. Therefore, we need a vector similarity score supplied to the output sigmoid layer – with similar vectors outputting a high score and un-similar vectors outputting a low score.  The most typical similarity measure used between two vectors is the cosine similarity score:

$$similarity = cos(\theta) = \frac{\textbf{A}\cdot\textbf{B}}{\parallel\textbf{A}\parallel_2 \parallel \textbf{B} \parallel_2}$$

The denominator of this measure acts to normalize the result – the real similarity operation is on the numerator: the dot product between vectors A and B.  In other words, to get a simple, non-normalized measure of similarity between two vectors, you simply apply a dot product operation between them.

So with all that in mind, our new negative sampling network for the planned Word2Vec Keras implementation features:

An (integer) input of a target word and a real or negative context word
An embedding layer lookup (i.e. looking up the integer index of the word in the embedding matrix to get the word vector)
The application of a dot product operation
The output sigmoid layer
This architecture of this implementation looks like:

![](pictures/nlp_10_clairf_embedd3.jpg)

Let’s go through this architecture more carefully.  First, each of the words in our vocabulary is assigned an integer index between 0 and the size of our vocabulary (in this case, 10,000).  We pass two words into the network, one the target word and the other either a word from the surrounding context or a negative sample.  We “look up” these indexes as the rows of our embedding layer (10,000 x 300 weight tensor) to retrieve our 300 length word vectors.  We then perform a dot product operation between these vectors to get the similarity.  Finally, we output the similarity to a sigmoid layer to give us a 1 or 0 indicator which we can match with the label given to the Context word (1 for a true context word, 0 for a negative sample).

The back-propagation of our errors will work to update the embedding layer to ensure that words which are truly similar to each other (i.e. share contexts) have vectors such that they return high similarity scores. Let’s now implement this architecture in Keras and we can test whether this turns out to be the case.