# Word Embeddings
Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors of numbers. However, natural language processing systems traditionally treat words a discrete atomic symbols, and therefore 'cat' may be represented as Id537 and 'dog' as Id143. These encodings are very sparse and provide no useful information regarding the relationships that may exist between the individual symbols. 

Vector space models represent words in a continuous vector space where semantically similar words are mapped to nearby points (are embedded nearby each other). In this series of notebook, we look at few word embedding techniques and compare them:

* Skip-gram with [Negative Sampling](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* Skip-gram with [Noise Contrastive Estimation](http://papers.nips.cc/paper/5165-learning-word-embeddings-efficiently-with-noise-contrastive-estimation.pdf)
* Glove: [Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) and more resource from [here](https://nlp.stanford.edu/projects/glove/)

# Skip-gram 
Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of traning words $w_1,w_2,\ldots,w_T$, the objective of the Skip-gram model is to maximize the average log probability
$$
\frac{1}{T} \sum_{t=1}^T\sum_{-c\leq j \leq c, j\neq 0}\log p(w_{t+j}|w_t)
$$
The Skip-gram defines $p(w_{t+j}|w_t)$ using the softmax function
$$
p(w_{o}|w_{i}) = \frac{\exp\left(u_{w_o}^Tv_{w_{i}}\right)}{\sum_{w=1}^V\exp\left(u_w^Tv_{w_{i}}\right)}
$$
where $V$ is size of vocabulary and
* $w_o$ is output word (outside word or surrounding word)
* $w_i$ is input word (context word or center word)
* $u_w$ is output vector representation
* $v_w$ is input vector representation

This formulation is impractical because the cost of computing the denominator is $O(V)$ where $V$ is often large ($10^5-10^7$).

# Skip-gram with Negative sampling
Mikolov et al. introduce one effecient technique so called Negative sampling (NEG). The NEG re-define the objective as
$$
\log \sigma\left(u_{w_o}^Tv_{w_{i}}\right) + \sum_{i=1}^k \mathbb{E}_{j_i\sim P_n(w)}\log\sigma\left(-u_{j_i}^Tv_{w_{i}}\right)
$$
where
$$
P_n(w) = U(w)^{3/4}/Z
$$
the unigram distribution $U(w)$ raised to the 3/4 power (then normalized by $Z$). The power 3/4 makes less frequent words be sampled more often.

The idea here is to
* maximize the probability that real outside word $w_o$ appears around center word $w_i$
* minimize the probability that random words $j_i$ appears around center word $w_i$

 