# Training Word2Vec

### Introduction

### The n-gram model

The `word2vec` iterates on the values of vectors to say given one word with certain values, how well can we predict context words. For example, one initial way of doing this could be to say given previous words in a sentence, what is the next word?

$p(w_{t + 1}|w_1,w_2,...w_t)$

Imagine what this would look like in practice however.  For example, `Word2vec is a group of related models`.  If we want to calculate the probability of the word `models` given the phrase `Word2vec is a group of related`, we have the following:

$p(w_{t + 1}|w_1,w_2,...w_t) = \frac{COUNT(Word2vec\_is\_a\_group\_of\_related\_models)}{COUNT(Word2vec\_is\_a\_group\_of\_related)}$

The phrase `Word2Vec is a group of related ` probably only occurs one time in the entire corpus, so we really won't have enough data to make predictions based on this approach. 

So instead of using all of the previous words in a sentence to predict a word, we can just choose perhaps the word before it to predict the current word.  

$p(w_{t + 1}|w_1,w_2,...w_t) = p(w_{t+1} | w_t) $

So using our above example, now we would have: 

$\frac{COUNT(of\_related)}{COUNT(of)}$

We can extend the window of previous words to include more context, with the tradeoff fewer examples to calculate each probability.  This is the n in `n-gram`.  The `n` represents the number of previous words to consider.

### Skip Gram and Continuous Bag of Words

One way to relax the requirement of previous words, is to consider the surrounding words, both before and after.  

####  Continuous bag of words

The continuous bag of words model predicts the current word based on the surrounding words.

So now if we want to calculate the word related, and were training on the phrase 

`is a group of related models that are used` 

With $w= 1$, we have:


* $P(related|of)$
* $P(related|models)$


With $w=2$, we have:

* $P(related|of)$
* $P(related|models)$
* $P(related|group)$
* $P(related|that)$

#### Skipgram model

The skipgram model does the opposite.  It predicts the context words given the current word.  

<img src="word-2-vec-pw.png" width="40%">

We'll define the following terms:

* $w_t = $ current word
* $w_{t + j}$ = context word
* $c$ = context window size

Such that $j\neq 0$ and $|j| \leq c $.

So if we have the phrase: 

`is a group of related models that are used` 

We can let the word `related` be our current word, which will be used to predict each of our context words.  

So  $w_t = related$ and we try to predict the following.

With $c= 1$, we have:

* $P(w_{t-1}) = P(of|related)$
* $P(w_{t+1}) = P(models|related)$

With $c=2$, we have:

* $P(w_{t-1}) = P(of|related)$
* $P(w_{t+1}) = P(models|related)$
* $P(w_{t-2}) = P(group|related)$
* $P(w_{t+2}) = P(that|related)$

Or in other words, for each word in the context window, we try to calculate: 

* $P(w_{t+j}| w_t)$

Such that $j\neq 0$ and $|j| \leq c $.

### Predicting a context window

Now we saw that to predict a word, $w_{t + j}$ in the context window of our current word $w_t$, we can represent this as: 

$P(w_{t+j}| w_t)$

Now let's calculate the probability the entire context window, given a current word.  

* $P(group\_of\_related\_models\_that) = P(group|related)*P(of|related)*P(models|related)* P(that|related)$

Or written another way, we let $c = 2$, then probability of context window, $cw$, given our current word, $w_t$ is:

$P(cw) = P(w_{t-2}| w_t)*P(w_{t-1}| w_t) * P(w_{t+1}| w_t) * P(w_{t+2}| w_t)$

Which equals:

$P(cw) = {\displaystyle \prod_{|j| < c; j \neq 0}} P(w_{t + j}| w_t)$

### Predicting a corpus

The next task is to predict an entire corpus.  Now this is simply the probability of each of the context windows occurring.  How many context windows do we have for a given corpus?  Well each word in a corpus gets a context window, so $COUNT(w_t) = COUNT(cw)$.  

And calculating the probability of an entire corpus, is calculating the probability of each context window -- one for each word in the corpus.

$P(corpus) = P(cw_0)*P(cw_1)*P(cw_2)$

$P(corpus) = {\displaystyle \prod_{0 -> len(corpus)}} P(cw)$

Then, via substitution we have:

$P(corpus) = {\displaystyle \prod_{t=1}^T}  {\displaystyle \prod_{|j| < c; j \neq 0}} P(w_{t + j}| w_t)$

So that is the likelihood of the entire corpus, and the task of word2vec is to choose parameters of theta such that it maxes the likelihood of our corpus.  In other words:

$argmax(\theta) {\displaystyle \prod_{t=1}^T}  {\displaystyle \prod_{|j| < c; j \neq 0}} P(w_{t + j}| w_t; \theta)$


### Overview

1. Start with a big pile of text, and continuous text (sentences).  So begin with a corpus.
2. And every word is represented by a random vector
3. Then say, here is a word in the text, look at the words around it, and want the word in the middle, and want it to predict the word in the context.

Then we do the next word, to see what occurs around banking.

So we have a long corpus of words, and so go through.  We want to maximize the likelihood of words around given the vector representation of words.

### Resources

[Word2Vec](https://rohanvarma.me/Word2Vec/)

[stanford notes](https://cs224d.stanford.edu/lecture_notes/notes1.pdf)

[d2l.ai](https://d2l.ai/chapter_preliminaries/probability.html)

[babylon - word and context words](http://building-babylon.net/2016/05/12/skipgram-isnt-matrix-factorisation/)

[meap book ](https://www.amazon.com/Natural-Language-Processing-Action-Understanding/dp/1617294632/ref=sr_1_3?keywords=Natural+Language+Processing+in+Action&qid=1573581726&sr=8-3)