# Word2Vec

The principle of word2vec is that it is possible to establish the meaning of each word based on the contexts in which it is used. Meanings are encoded as dense vectors.

Two words with similar vectors are expected to have similar meanings. 

## Terms

### Vocabulary
A vocabulary is a set of words in a corpus that vectors are calcuated for. 

Center



# Skip-gram Algorithm

## Principle
Go through each word position `t` in text. For the word at position `t` is a `center word`. Each word surrounding the center word is a `context word`. The algorithm considers words within a fixed distance of the center word. The fixed distance is a `window`. 

Let `c` be a center word and `o` be a context word.

Use the similarity of word vectors for `c` and `o` to calculate `P(o|c)` or `P(c|o)`. Adjust the word vectors to maximize the probability.

### Example: Window of Size 2

Let `w` be a word at the subscripted position. 

$ P(w_{t-2}|w_{t}) \; P(w_{t-1}|w_{t}) \;\;\;\;\;\;\;\;\;\;  P(w_{t+1}|w_{t}) \; P(w_{t+2}|w_{t})    $<br>
$problems \;\; turning \;\;\;\;\; into \;\;\;\;\; banking \;\;\;\; crisis$<br>
$window \;\;\;\;  window \;\;\;\; center \;\; window \;\;\;\; window$<br>

$ P(w_{t-2}|w_{t}) \; P(w_{t-1}|w_{t}) \;\;\;\;\;\;\;\;\;\;\;\;\;  P(w_{t+1}|w_{t}) \; P(w_{t+2}|w_{t})    $<br>
$turning \;\;\;\;\; into \;\;\;\;\;\;\;\;\; banking \;\; crisis \;\;\;\;\;\;\; as$<br>
$window \;\;\;\;  window \;\;\;\; center \;\;\;\; window \;\;\;\; window$<br>

The goal is to maximize the product of these probabilities.

# Word2vec: Likelihood

For each position `t = 1`, ..., `T`, predict context words within a window of fixed size `m`, given center word w[t].

$T = number \; of words \; in \; a \; corpus $ <br>
$t = position \; of \; current \; word  $ <br>
$m = size \; of \; window, \; \pm m \; words $ <br>
$w = word \; at \; a \; given \; position $ <br>
$\theta = contents \; of \; the \; vector, \; which \; will \; be \; changed $ <br>
$L = likelihood$


For each word in the corpus, for each word in the window, get the probability that the context word is within the window of the given center word.

$$ L(\theta ) = \prod_{t = 1}^{T}\prod_{-m \leq j \leq m, j \ne 0}^{} P(w_{t+j}|w_{t};\theta) $$


# Word2vec: Objective Function
The objective function is the average negative log likelihood. It is also called the cost function or loss function. We minimize the objective function in order to maximize predictive accuracy.

$$ J(\theta ) = -\frac{1}{T}log  \; L(\theta) =  -\frac{1}{T} \sum_{t = 1}^{T}\sum_{-m \leq j \leq m, j \ne 0}^{} log \; P(w_{t+j}|w_{t};\theta) $$

For center word `c` and context word `o`

$w = a \; given \; word $ <br>
$v_{w} = vector \; of \; word \; when \; word \; is \; a \; center \; word $ <br>
$u_{w} = vector \; of \; word \; when \; word \; is \; a \; context \; word $ <br>
$c = center \; word  $ <br>
$o = context \; word  $ <br>
$u_o^Tv_c = dot \; product = u_o \cdot v_c$ <br>
$u_w^Tv_c = dot \; product = u_w \cdot v_c$ <br>


## Question 
How to calculate? 
$$  P(w_{t+j}|w_{t};\theta) $$<br>

## Answer
We will use two vectors per word $w$:
*  $v_w$ when w is a center word
*  $u_w$ when w is a context word


Then for a center word `c` and context word `o`:

$$ P(o|c) = \frac{exp(u_o^{\intercal}v_c)}{\sum_{w \in v} exp(u_w^{\intercal} v_c)} $$

Characteristics of the equation.

$exp$: Exponentiation makes anything positive.<br>
$u_o^{\intercal}v_c$: <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Dot product compares similarity of `o` and `c`. It is the element-wise product.<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $u^{\intercal}v = u.v = \sum_{i=1}^{n} u_iv_i$ <br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Large dot product = larger probability<br>
$\sum_{w \in v} exp(u_w^{\intercal} v_c)$: Normalize over the entire vocabulary to give probability distribution.<br>
$w \in V$: If the word is a member of the vocabulary.

This is an example of the softmax function. 

$softmax(x_i) =  \frac{exp(x_{i})}{\sum_{j=1}^{n} exp(x_{j})} =  p_{i}$

Softmax in code.

In [1]:
import numpy as np
a = [1.0, 2.0, 3.0, 4.0, 1.0, 2.0, 3.0]
np.exp(a) / np.sum(np.exp(a)) 
#array([0.02364054, 0.06426166, 0.1746813, 0.474833, 0.02364054,
#       0.06426166, 0.1746813])

array([0.02364054, 0.06426166, 0.1746813 , 0.474833  , 0.02364054,
       0.06426166, 0.1746813 ])

We need to minimize $J(\theta)$. Let's start with the derivative of $J(\theta)$ w.r.t $v_{c}$

$$\frac{\partial J(\theta)}{\partial v_{c}} = \frac{\partial}{\partial v_{c}}(log(exp(u_{o}^{\intercal}v_{c}))) - \frac{\partial}{\partial v_{c}}(log \sum_{w=1}^{v}exp(u_{w}^{\intercal}v_{c}))$$

Break the equation into two parts and solve them individually. For the first part, since log(exp(x)) = x: 

$$\frac{\partial}{\partial v_{c}}(log(exp(u_{o}^{\intercal}v_{c}))) = \frac{\partial}{\partial v_{c}}(u_{o}^{\intercal}v_{c}) = u_{o}$$

$$\frac{\partial }{\partial v_{c}}(log(exp(u_{o}^{\intercal}v_{c}))) = \frac{\partial }{\partial v_{c}}(u_{o}^{\intercal}v_{c}) = u_{0}$$

## Second Part of the Equation
Take the derivative of log(x) and move the derivative inside the summation. First recall that.

$\frac{d}{dx}ln(x) = \frac{1}{x}$<br>
$\frac{d}{dx}ln[f(x)] = \frac{1}{f'(x)}$<br>
$\frac{d}{dx}\sum x = \sum \frac{d}{dx}x$<br>
Chain rule: $\frac{d}{dx}f(g(x)) = f'g(x) * g'(x)$

$$\frac{1}{\sum_{w=1}^{v}exp(u_{w}^{\intercal}v_{c})}\sum_{x=1}^{v}\frac{\partial}{\partial v_{c}}exp(u_{x}^{\intercal}v_{c})$$

$$\sum_{x=1}^{v} \frac{exp(u_{x}^{\intercal}v_{c})}{\sum_{w=1}^{v}exp(u_{w}^{\intercal}v_{c})} * u_{x} $$

$$\sum_{x=1}^{v} P(x|c) * u_{x}$$

$$\frac{\partial J(\theta)}{\partial v_{c}} = -u_{0} + \sum_{x=1}^{v} P(x|c) * u_{x}$$

`if w not equal 0`
$$ \frac{\partial J(\theta)}{\partial u_{w}} = \sum_{x=1}^{v} P(x|c) * v_{c}$$

`if w equal 0`
$$ \frac{\partial J(\theta)}{\partial u_{w}} = -v + \sum_{x=1}^{v} P(x|c) * v_{c}$$

$$ \frac{\partial J(\theta)}{\partial v_{c}} = - \sigma(-u_{o}^{\intercal}
v_{c})u_{o} + \sum_{k=1}^{K} \sigma(u_{k}^{\intercal} v_{c})u_{k}$$

$$ \frac{\partial J(\theta)}{\partial u_{o}}  = - \sigma(-u_{o}^{\intercal}
v_{c})v_{c} $$

$$ \frac{\partial J(\theta)}{\partial u_{k}} = \sum _{k=1}^{K} \sigma(u_{k}^{\intercal}v_{c})v_{c}$$