# Word Embeddings
Image and audio processing systems work with rich, high-dimensional datasets encoded as vectors of numbers. However, natural language processing systems traditionally treat words a discrete atomic symbols, and therefore 'cat' may be represented as Id537 and 'dog' as Id143. These encodings are very sparse and provide no useful information regarding the relationships that may exist between the individual symbols. 

Vector space models represent words in a continuous vector space where semantically similar words are mapped to nearby points (are embedded nearby each other). In this series of notebook, we look at few word embedding techniques and compare them:

* Skip-gram with [Negative Sampling](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)
* Glove: [Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) and more resource from [here](https://nlp.stanford.edu/projects/glove/)

# Skip-gram 
Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of traning words $w_1,w_2,\ldots,w_T$, the objective of the Skip-gram model is to maximize the average log probability
$$
\frac{1}{T} \sum_{t=1}^T\sum_{-c\leq j \leq c, j\neq 0}\log p(w_{t+j}|w_t)
$$
The Skip-gram defines $p(w_{t+j}|w_t)$ using the softmax function
$$
p(w_{o}|w_{i}) = \frac{\exp\left(u_{w_o}^Tv_{w_{i}}\right)}{\sum_{w=1}^V\exp\left(u_w^Tv_{w_{i}}\right)}
$$
where $V$ is size of vocabulary and
* $w_o$ is output word (outside word or surrounding word)
* $w_i$ is input word (context word or center word)
* $u_w$ is output vector representation
* $v_w$ is input vector representation

This formulation is impractical because the cost of computing the denominator is $O(V)$ where $V$ is often large ($10^5-10^7$).

# Skip-gram with Negative sampling
Mikolov et al. introduce one effecient technique so called Negative sampling (NEG). The NEG re-define the objective as
$$
\log \sigma\left(u_{w_o}^Tv_{w_{i}}\right) + \sum_{i=1}^k \mathbb{E}_{j_i\sim P_n(w)}\log\sigma\left(-u_{j_i}^Tv_{w_{i}}\right)
$$
where
$$
P_n(w) = U(w)^{3/4}/Z
$$
the unigram distribution $U(w)$ raised to the 3/4 power (then normalized by $Z$). The power 3/4 makes less frequent words be sampled more often.

The idea here is to
* maximize the probability that real outside word $w_o$ appears around center word $w_i$
* minimize the probability that random words $j_i$ appears around center word $w_i$

## Implementation planning
Before doing the implementation, we list the required steps
0. Choose dataset: 
    * which corpus to be used for training
    * which test-set to be used for testing
1. Pre-processing raw_tex:
    * extract a set of all words (vocab)
    * map vocab <-> integer id
    * compute words-frequence (we might sub-sampling to remove some frequent words such as 'the,a,an,...e.t.c'), we also need the words-frequence to compute $P_n(w)$
    * convert raw text to list of words-ids
2. Ensemble a graph:
    * Define inputs, targets: must take into account of mini-batches
    * Define trainable variables
    * Define a loss function with neg-sampling
    * Define an optimizer (might need to apply some Gradient-Clipping technique)
3. Training:
    * How to feed inputs/targets data
    * How to measure training performance
    * How to tune hyper-parameters
4. Evaluation:
    * How to measure word2vec quality (hard)
    
## Choose dataset
We use cleaned wiki-dataset from Matt Mahoney's [website](http://mattmahoney.net/dc/textdata.html):
* [text8](http://mattmahoney.net/dc/text8.zip) is small dataset (100Mb) 
* [enwiki9](http://mattmahoney.net/dc/enwik9.zip) is bigger dataset (1Gb)

We use the same script in Matt Mahoney to create text9 data from enwik9.

First we load module for this notebook

In [1]:
import numpy as np
import tensorflow as tf
from time import time

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

import sys
if '../common' not in sys.path:
    sys.path.insert(0, '../common')

Note that, the text data is clean text (i.e no punctuation, no new line), let's view first 100 characters of our text-input

In [2]:
import getpass

text_file = '/home/%s/workplaces/tf_datas/nltk/text8' % getpass.getuser()
preprocess_file = '/home/%s/workplaces/tf_datas/nltk/text8.pkl' % getpass.getuser()
with open(text_file, 'r') as f:
    text = f.read()
    print (text[:100])

 anarchism originated as a term of abuse first used against early working class radicals including t


## Pre-processing data
We code pre-processing into **Word2VecInput**

In [40]:
from nlp.preprocess_input import Word2VecInput

ts = time()
w2v_input = Word2VecInput(text_file)
print ('Pre-processing took {:.2f} seconds'.format(time() - ts))

Pre-processing took 20.42 seconds


Since pre-processing took quite a long time, we dump pre-processed data into a pickled file which includes vocabs, word2id, id2word, word-frequences and trained_wordids

In [41]:
# dump pre-processing data to file
w2v_input.dump(preprocess_file)

## Ensemble a graph
We need to define inputs and targets, re-call that the Skip-gram model is to predict surrounding words given a center word so input will be center word and targets will be surrounding-words. Let's look at an example

In [5]:
from IPython.display import IFrame
IFrame('./skipgram-demo/index.html', width=500, height=750)

So the input/target can be defined by tf.placeholder of tf.int32 to represent word-id (integer), the tricky part is to 
* define embeding layer
* define sampling procedure
* define loss function

### Embedding layer
For each word $w$, we have two embedding layers $u_w$ and $v_w$ with embedding-dimension $D$, we can model it as follow
* $v_w$ is input embedding-weight
* $u_w$ is output softmax-weight

We can define embedding-weight and softmax-weight as [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable) with shape $[V,D]$. 

Note that for embedding-weight $u_w$ we often initialized by random-uniform between [-1,1], while $v_w$ is initialized by truncated-normal with $\sigma=\frac{1.0}{\sqrt{D}}$.

Note that since $V$ can be very large, we need a way to look-up $u_w, v_w$, this can be done via [`tf.nn.embedding_lookup`](https://www.tensorflow.org/api_docs/python/tf/nn/embedding_lookup).

### Sampling procedure
The sampling method is tricky since $V$ can be very large. Fortunately, Tensorflow has implemented various [candidate-sampling](https://www.tensorflow.org/api_guides/python/nn#Candidate_Sampling). Here we will use
* [tf.nn.fixed_unigram_candidate_sampler](https://www.tensorflow.org/api_docs/python/tf/nn/fixed_unigram_candidate_sampler): to sample $P_n(w)$ as described above
* [https://www.tensorflow.org/api_docs/python/tf/nn/log_uniform_candidate_sampler]: to sample log-uniform, note this should be used **only if our words is sorted with decreasing frequence**

### Loss function
Tensorflow has already implemented (see [source](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn_impl.py) for implementation)
* [tf.nn.sampled_softmax_loss](https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss): sampled softmax training loss
* [tf.nn.nce_loss](https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss): sampled logistic training loss

Let's review the implementation of the two above loss function
* Sampled-softmax compute
$$
-\log\left(\frac{\exp(u_{w_o}^Tv_{w_{i}})}{\exp(u_{w_o}^Tv_{w_{i}}) + \sum_{i=1}^k \exp(u_{j_i}^Tv_{w_{i}})} \right) 
$$
via [tf.nn.softmax_cross_entropy_with_logits](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits)
* Sampled-logistic compute
$$
\log \sigma\left(u_{w_o}^Tv_{w_{i}})\right) + \sum_{i=1}^k \log \sigma(-u_{j_i}^Tv_{w_{i}})
$$
via [tf.nn.sigmoid_cross_entropy_with_logits](https://www.tensorflow.org/api_docs/python/tf/nn/sigmoid_cross_entropy_with_logits)

We implement above steps inside **`Word2vecSamping`** object and add training to it.

## Training
Let build a word2vec model using **`Word2vecSampling`**
and training it with pre-processed data

In [59]:
import pickle
vocabs, word2id, id2word, freqs, train_wordids = pickle.load(open(preprocess_file, 'rb'))
from nlp.word2vec import Word2vecSampling

w2v_model = Word2vecSampling(vocabs, word2id, id2word, freqs, train_wordids)

In [24]:
settings = {'embed_dim'       : 200,
            'nb_neg_sample'   : 100,
            'learning_rate'   : 0.01,
            'sampling_method' : 'fixed_unigram',
            'loss_func'       : 'nce',
            'subtract_log_q'  : True,
            'use_tf_loss'     : False}

w2v_model.build_graph(settings)

We create checkpoints to save training-progress

In [21]:
# If the checkpoints directory doesn't exist:
!mkdir checkpoints
!mkdir logs

mkdir: cannot create directory ‘checkpoints’: File exists


Let's train word2vec

In [None]:
epochs = 1
batch_size = 1024
window_size = 5
w2v_model.train(epochs, batch_size, window_size, max_iters = 200)

### Hyperparameters tunning
In this section we summarize some result for training word2vec. We have the following method

| Sampling method | Loss function   | 
| :-------------: |:---------------:| 
| fixed unigram   | sampled softmax | 
| log_uniform     | nce             |

Note that, by-default `tf.nn.sampled_softmax_loss` and `tf.nn.nce` use sampling-method `log_uniform`. We want to make our test close with original papers so we will use `fixed_unigram` as default.

We test with the following 


In [None]:
# hyper-parameter for testing
test_lr     = [0.02] #[0.1, 0.01, 0.001]
test_lf     = ['nce']
test_use_tf = [True]#[True, False]

settings = {'embed_dim'       : 200,
            'nb_neg_sample'   : 100,
            'learning_rate'   : 0.01,
            'sampling_method' : 'fixed_unigram',
            'loss_func'       : 'nce',
            'subtract_log_q'  : True,
            'use_tf_loss'     : False}

epochs      = 5
batch_size  = 1024
window_size = 10
max_iters   = None

for lr in test_lr:
    settings['learning_rate'] = lr
    for lf in test_lf:
        settings['loss_func'] = lf
        for use_tf in test_use_tf:
            settings['use_tf_loss'] = use_tf
            
            ## rebuild with new-setting
            w2v_model.build_graph(settings)
            
            ## train and logs
            w2v_model.train(epochs, batch_size, window_size, summary_path='nce_1', max_iters = max_iters)
            
            

Epoch (1/5) Batch (  100/4518 ) Iteration:      100 Avg. Training loss: 479.1447 0.1214 sec/batch
Epoch (1/5) Batch (  200/4518 ) Iteration:      200 Avg. Training loss: 413.2743 0.1208 sec/batch
Epoch (1/5) Batch (  300/4518 ) Iteration:      300 Avg. Training loss: 373.8611 0.1212 sec/batch
Epoch (1/5) Batch (  400/4518 ) Iteration:      400 Avg. Training loss: 323.8119 0.1213 sec/batch
Epoch (1/5) Batch (  500/4518 ) Iteration:      500 Avg. Training loss: 297.6261 0.1209 sec/batch
Epoch (1/5) Batch (  600/4518 ) Iteration:      600 Avg. Training loss: 272.2038 0.1210 sec/batch
Epoch (1/5) Batch (  700/4518 ) Iteration:      700 Avg. Training loss: 261.1461 0.1212 sec/batch
Epoch (1/5) Batch (  800/4518 ) Iteration:      800 Avg. Training loss: 247.5380 0.1209 sec/batch
Epoch (1/5) Batch (  900/4518 ) Iteration:      900 Avg. Training loss: 223.1556 0.1211 sec/batch
Epoch (1/5) Batch ( 1000/4518 ) Iteration:     1000 Avg. Training loss: 223.4641 0.1208 sec/batch
Epoch (1/5) Batch ( 

Epoch (2/5) Batch ( 4082/4518 ) Iteration:     8600 Avg. Training loss: 47.0055 0.1214 sec/batch
Epoch (2/5) Batch ( 4182/4518 ) Iteration:     8700 Avg. Training loss: 45.4049 0.1211 sec/batch
Epoch (2/5) Batch ( 4282/4518 ) Iteration:     8800 Avg. Training loss: 44.3155 0.1217 sec/batch
Epoch (2/5) Batch ( 4382/4518 ) Iteration:     8900 Avg. Training loss: 43.6555 0.1213 sec/batch
Epoch (2/5) Batch ( 4482/4518 ) Iteration:     9000 Avg. Training loss: 43.4577 0.1218 sec/batch
Epoch (3/5) Batch (   64/4518 ) Iteration:     9100 Avg. Training loss: 43.7408 0.0780 sec/batch
Epoch (3/5) Batch (  164/4518 ) Iteration:     9200 Avg. Training loss: 43.8538 0.1213 sec/batch
Epoch (3/5) Batch (  264/4518 ) Iteration:     9300 Avg. Training loss: 44.3655 0.1211 sec/batch
Epoch (3/5) Batch (  364/4518 ) Iteration:     9400 Avg. Training loss: 43.9025 0.1212 sec/batch
Epoch (3/5) Batch (  464/4518 ) Iteration:     9500 Avg. Training loss: 43.3287 0.1212 sec/batch
Epoch (3/5) Batch (  564/4518 

Looking at above result, we can see that learning_rate 0.1 doesn't work and learning_rate=0.001 is too low. Let change the learning_rate=0.02 and use 5 epochs. 

# Testing
It's not trival to test the quality of word-vector. As in the introduction, we use 'dense-representation' of word to allow us to model the similarity of words via distance between points in embedding-space. So to evaluate the quality of word-vector one can use the folloing intrinsic tasks 
* `Nearest neighbors`: is to find closest word (in Euclidean distance or cosine similarity) for a given word 
* `Word analogy`: is to answer the question of the form `a` is to `b` as `c` is to `__` for example:
<center>
good:better rough:__ (expect rougher)
<center>

Or one can use the extrinsic tasks 
* `Sentiment classification`: for example we want to classify movie review

Let load one checkpoint and try this out

In [68]:
# build eval model and load a check-point
w2v_eval = Word2vecSampling(vocabs, word2id, id2word, freqs, train_wordids)
w2v_eval.build_graph()
w2v_eval.build_eval_graph()
sess = w2v_eval.load_checkpoint('./checkpoints/sg_lr=(0.02,),lf=nce,sampling=fixed_unigram,use_tf=True-45000')

In [73]:
# test some analogy
w2v_eval.analogy(sess, 'man', 'king', 'woman')

w2v_eval.analogy(sess, 'good', 'better', 'rough')

# test nearest nearby word
w2v_eval.nearby(sess, 'london')

predict man-king as woman-?
answer: leader

predict big-bigger as smart-?
answer: poultry


Nearest neighbours of [london]
london               1.0000
other                0.5790
places               0.5779
a                    0.5779
meditations          0.5769
actually             0.5764
and                  0.5735
others               0.5699
but                  0.5634
different            0.5632


# Conclusion
As we can see the evaluation doesn't work very well since we train on a limited corpus and Skip-Gram with Neg-Sampling need bigger datas. In the next series, we look at Glove word2vec then look how to use pre-trained word2vec.