<a href="https://colab.research.google.com/github/DataScienceUB/DeepLearningMaster20192020/blob/master/8.%20Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Embeddings

Word games:

+ [Talk to books](https://books.google.com/talktobooks/)

+ [Semantris](https://research.google.com/semantris)

Deep Learning algorithms require the input to be represented as (sequences of) fixed-length feature vectors. 

+ Words in documents and other categorical features such as user/product ids in recommeders, names of places, visited URLs, etc. are usually represented by using a one-of-K scheme (**one-hot encoding**). 

+ Phrases are represented by bag-of-words or bag-of-ngrams features, loosing the ordering of words and ignoring semantics. 

Are these good representations for deep learning?

Let's see how to represent **words**, but this line of reasoning can be extended to other items.

+ There are an estimated 13 million tokens for the English language. 

+ One possible strategy is to encode word tokens each into some vector that represents a point in some sort of *word space* that *represents* language semantics. 

+ The most intuitive reason is that perhaps there actually exists some $N$-dimensional space (such that $N << 13$ million) that is sufficient to encode all semantics of our language. 

+ Each dimension would encode some meaning that we transfer using speech.

### One-hot encoding

If we represent every word as an $\mathbb{R}^{|V|\times 1}$ vector with all $0$s and one $1$ at the index of that word in the sorted english language, word vectors in this type of encoding would appear as the following:

<centering>
$$w^{aardvark} = \left[ \begin{array}{c} 1 \\ 0 \\ 0 \\ \vdots \\ 0 \end{array} \right], w^{a} = \left[ \begin{array}{c} 0 \\ 1 \\ 0 \\ \vdots \\ 0 \end{array} \right] , w^{at} = \left[ \begin{array}{c} 0 \\ 0 \\ 1 \\ \vdots \\ 0 \end{array} \right] , \cdots,  w^{zebra} = \left[ \begin{array}{c} 0 \\ 0 \\ 0 \\ \vdots \\ 1 \end{array} \right] $$
</centering>


We represent each word as a completely independent entity:

$$(w^{hotel})^Tw^{motel} = (w^{hotel})^Tw^{cat} = 0$$

What other alternatives are there?

### Semantics from word-document matrix

As our first attempt, we make the bold conjecture that words that are related will often appear in the same documents (or phrases, paragraphs, etc.). 

For instance, "banks", "bonds", "stocks", "money", etc. are probably likely to appear together. But "banks", "octopus", "banana", and "hockey" would probably not consistently appear together. 

We use this fact to build a word-document matrix, $X$ in the following manner: 

+ Loop over billions of documents and for each time word $i$ appears in document $j$, we add one to entry $X_{ij}$. 

This is obviously a very large matrix ($\mathbb{R}^{|V|\times M}$) and it scales with the number of documents ($M$). 

So perhaps we can try something better, such as building a window based co-occurrence matrix.

In this method we count the number of times each word appears inside a window of a particular size around the word of interest (this is a sparse ($\mathbb{R}^{|V|\times |V|}$) matrix). We calculate this count for all the words in corpus. 

Let our corpus contain just three sentences and the window size be 1:

+ ``I enjoy flying``.
+ ``I like NLP``.
+ ``I like deep learning``.

The resulting counts matrix will then be:

$$X=\left[ \begin{array}{cccccccc}
        & I & like & enjoy & deep & learning & NLP & flying & . \\
     I & 0 & 2 & 1 & 0 & 0 & 0 & 0 & 0\\
    like & 2 & 0 & 0 & 1 & 0 & 1 & 0 & 0\\
    enjoy & 1 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\
    deep & 0 & 1 & 0 & 0 & 1 & 0 & 0 & 0 \\
    learning & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 1\\
    NLP & 0 & 1 & 0 & 0 & 0 & 0 & 0 & 1\\
    flying & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1\\
    . & 0 & 0 & 0 & 0 & 1 & 1 & 1 & 0 \\
  \end{array} \right]$$

Once the $|V|\times|V|$ co-occurrence matrix $X$ has been generated, we can apply SVD on $X$ to get $X = USV^T$ and select the first $k$ columns of $U$ to get a $k$-dimensional word vectors. 

$\frac{\sum_{i = 1}^{k}\sigma_i}{\sum_{i = 1}^{|V|}\sigma_i}$ indicates the amount of variance captured by the first $k$ dimensions.

These vectors encode some kind of semantics but they have some problems:

+ The dimensions of the matrix can change very often (new words are added very frequently and corpus changes in size).
+ SVD based methods do not scale well for big matrices and it is hard to incorporate new words or documents. 
+ The matrix is extremely sparse since most words do not co-occur.
+ The matrix is very high dimensional in general ($\approx 10^6 \times 10^6$)
+ Quadratic cost to train (i.e. to perform SVD)
+ Requires the incorporation of some hacks on $X$ to account for the drastic imbalance in word frequency

Some solutions to exist to resolve some of the issues discussed above:
+ Ignore function words such as "the", "he", "has", etc.
+ Apply a ramp window -- i.e. weight the co-occurrence count based on distance between the words in the document. 
+ Use Pearson correlation and set negative counts to 0 instead of using just raw count.

But a NN method can solve many of these issues in a far more elegant manner....



## ``word2vec``

Instead of computing and storing global information about some huge dataset (which might be billions of sentences), we can try to create a model that will be able to learn one iteration at a time and eventually be able **to encode the probability of a word given its context** (or, alternatively, the probability of the context given a word). 

> The **context of a word** is the set of $m$ surrounding words. For instance, the $m = 2$ context of the word ``fox`` in the sentence ``The quick brown fox jumped over the lazy dog`` is \{``quick``, ``brown``, ``jumped``, ``over``\}.

The idea is to design a model whose parameters are the word vectors. Then, train the model on a certain objective related to representing the probability model. 

At every iteration we run our model, evaluate the errors, and follow an update rule that has some notion of penalizing the model parameters that caused the error. Thus, we learn our word vectors. 

Mikolov presented a simple, probabilistic model in 2013 that is known as ``word2vec``. In fact, ``word2vec`` includes 2 algorithms (**CBOW** and **skip-gram**) and 2 training methods (negative sampling and hierarchical softmax).

> This model relies on a very important hypothesis in linguistics, *distributional semantics*. The basic idea of distributional semantics can be summed up in the so-called distributional hypothesis: linguistic items with similar distributions have similar meanings.

> The distributional hypothesis can be applied to other data than words: items in shopping baskets, neural activations in wet neural networks, etc.

First, we need to create such a model that will assign a probability to a sequence of tokens. Let us start with an example: ``The cat jumped over the puddle``. 

A **good language model will give this sentence a high probability because this is a completely valid sentence**, syntactically and semantically. Similarly, the sentence ``stock boil fish is toy`` should have a very low probability because it makes no sense. 

Mathematically, we can call this probability on any given sequence of $n$ words:

$$P(w_{1}, w_{2}, \cdots, w_{n})$$

### Language Models

We know that 

$$P(w_{1}, w_{2}, \cdots, w_{n}) = P(w_{1}) P(w_{2} | w_{1}) \dots P(w_{n} | w_{1}, w_{2}, \cdots, w_{n-1})$$

but we alse know that we cannot compute this terms from a corpus by **counting**. All we can do if to approximate it.

We can take the **unary language model** approach and break apart this probability by assuming the word occurrences are completely independent:

$$P(w_{1}, w_{2}, \cdots, w_{n}) \approx \prod_{i=1}^n P(w_{i})$$


However, we know the next word is highly contingent upon the previous sequence of words. So perhaps we let the probability of the sequence depend on the pairwise probability of a word in the sequence and the word next to it. We call this the **bigram model** and represent it as:

$$P(w_{1}, w_{2}, \cdots, w_{n}) \approx \prod_{i=2}^n P(w_{i} | w_{i-1})$$

Again this is certainly a bit naive since we are only concerning ourselves with pairs of neighboring words rather than evaluating a whole sentence, but as we will see, this representation gets us pretty far along. Note in the Word-Word Matrix with a context of size 1, we basically can learn these pairwise probabilities. But again, this would require computing and storing global information about a massive dataset.

### Skip-gram model

The skip-gram approach is to create a model such that given the center word ``jumped``, the model will be able to predict or generate the surrounding words ``The``, ``cat``, ``over``, ``the``, ``puddle``. 

How can we learn this model? Well, we need to create an **objective function**. 

Let's suppose we have a text composed of $T$ words. In this case, for each position $t = 1, … , T$, our task is to predict surrounding words within a window of fixed size $m$, given	center word $w_t$:

$$
L = \prod^T_{t=1} \prod_{-m \leq j \leq m ; j \neq 0} P(w_{t+j} | w_t)
$$

The objective function is the average negative log likelihood:

$$
J =  - \frac{1}{T} \sum^T_{t=1} \sum_{-m \leq j \leq m ; j \neq 0} \log P(w_{t+j} | w_t)
$$

How to calculate $P(w_{t+j} | w_t)$?

**We will use a model that uses two vectors per word**. 

We create two matrices, $\mathcal{V} \in \mathbb{R}^{n\times|V|}$ and $\mathcal{U} \in \mathbb{R}^{|V|\times n}$, where $n$ is an arbitrary size which defines the size of our embedding space. 

$\mathcal{V}$ is the input word matrix such that the $i$-th column of $\mathcal{V}$ is the $n$-dimensional **embedded vector** for word $w_{i}$ when it is an input to this model. We denote this $n\times1$ vector as $v_{i}$. 

Similarly, $\mathcal{U}$ is the output word matrix. The $j$-th row of $\mathcal{U}$ is an $n$-dimensional embedded vector for word $w_{j}$ when it is an output of the model. We denote this row of $\mathcal{U}$ as $u_{j}$. 

Then, for a center word $c$ and a context word $o$, we assume the following model:

$$
P(o | c) = \frac{\exp(u_o^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)}
$$

where $u_i = \mathcal{U} w_i$, $v_i = \mathcal{V} w_i$, and $w_i$ is the one hot encoding of a word $i$. Thus, our objective function is:

$$
J = - \frac{1}{T} \sum^T_{c=1} \sum_{-m \leq j \leq m ; j \neq 0} \log \frac{\exp(u_{c+j}^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)} = \frac{1}{T} \sum^T_{c=1} \left(\sum_{-m \leq j \leq m ; j \neq 0} - (u_{c+j}^T v_c) + 2m \log \sum_{w \in V} \exp(u_w^T v_c) \right)
$$

It is important to note that the second term involves a large number of vector products!

The model works in 6 steps:

+ We generate our one hot vector for the input word, $w_c$.
+ Then, we get its corresponding embedded word vector $v_c = \mathcal{V} w_c$.
+ We generate $2m$ score vectors $u_o^T v_c $, where  $ u_o = \mathcal{U} w_o$.
+ Turn each of these scores into probabilities (which involves a large number of vector products). 
+ Our objective is to match these $2m$ probability vectors to the one hot vectors of the actual input.

This is the graphical representation of our model ($ W $ encodes $ \mathcal{V}$ and $ W'$ encodes $ \mathcal{U}$). The left part involves only one vector multiplications, but the right one is much heavier!

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/fword2vec-sg.png?raw=1" alt="" style="width: 400px;"/> 
</center>

The computational complexity of this algorithm computed in a straightforward fashion is the size of our vocabulary, $O(V)$. This is because of the term $\sum_{w \in V} \exp(u_w^T v_c)$. This denominator computes the similarity of all possible contexts $u_w$ and the target word $v_c$. 

### Negative sampling

Loss functions $ J $ is expensive to compute because of the softmax normalization, where we sum over all $ |V| $ scores! A simple idea is we could instead just approximate it.

While negative sampling is based on the Skip-Gram model, it is in fact optimizing a different objective. 

Consider a pair $(w, c)$ of word and context. Did this pair come from the training data? Let's denote by $P(D = 1|w, c)$ the probability that (w, c) came from the corpus data. Correspondingly, $P(D = 0|w, c)$ will be the probability that $(w, c)$ did not come from the corpus data. 

First, let's model $P(D = 1|w, c)$ with the sigmoid function:

$$ P(D = 1|w, c, \theta) = \sigma (v_c^T v_w) = \frac{1}{1+ e^{(-v_c^Tv_w)}}$$
Now, we build a new objective function that tries to maximize the probability of a word and context being in the corpus data if it indeed is, and maximize the probability of a word and context not being in the corpus data if it indeed is not. We take a simple maximum likelihood approach of these two probabilities. (Here we take $\theta$ to be the parameters of the model, and in our case it is $\mathcal{V}$ and $\mathcal{U}$.)
\begin{align*}
\theta &= \mbox{argmax}_{\theta} \prod_{(w,c) \in D} P(D = 1|w, c, \theta) \prod_{(w,c) \in \tilde{D}} P(D = 0|w, c, \theta) \\
&= \mbox{argmax}_{\theta} \prod_{(w,c) \in D} P(D = 1|w, c, \theta) \prod_{(w,c) \in \tilde{D}} (1 - P(D = 1|w, c, \theta))\\
&= \mbox{argmax}_{\theta} \sum_{(w,c) \in D} \log P(D = 1|w, c, \theta) + \sum_{(w,c) \in \tilde{D}} \log(1 - P(D = 1|w, c, \theta))\\
&= \mbox{argmax}_{\theta} \sum_{(w,c) \in D} \log \frac{1}{1 + \exp(-u_w^Tv_c)} + \sum_{(w,c) \in \tilde{D}} \log(1 - \frac{1}{1 + \exp(-u_w^Tv_c)} )\\
&= \mbox{argmax}_{\theta} \sum_{(w,c) \in D} \log \frac{1}{1 + \exp(-u_w^Tv_c)} + \sum_{(w,c) \in \tilde{D}} \log(\frac{1}{1 + \exp(u_w^Tv_c)} )\\
\end{align*}
Note that maximizing the likelihood is the same as minimizing the negative log likelihood

$$
J = - \sum_{(w,c) \in D} \log \frac{1}{1 + \exp(-u_w^Tv_c)} - \sum_{(w,c) \in \tilde{D}} \log(\frac{1}{1 + \exp(u_w^Tv_c)} )
$$

Note that $\tilde{D}$ is a "false" or "negative" corpus.

For skip-gram, our new objective function for observing the context word $ c-m+j$ given the center word $ c $ would be


$$ - \log \sigma (u_{c-m+j}^{T}\cdot v_{c}) -\sum_{k = 1}^K \log \sigma (- \tilde{u}_{k}^{T}\cdot v_{c}) $$


In the above formulation, $\{\tilde{u}_{k} | k = 1\dots K\}$ are sampled from $P_n(w)$, the unigram distribution. Let's discuss what $P_n(w)$ should be. While there is much discussion of what makes the best approximation, what seems to work best is the Unigram Model raised to the power of 3/4. Why 3/4? Here's an example that might help gain some intuition:

+ ``is``: $0.9^{3/4} = 0.92$
+ ``Constitution``: $0.09^{3/4} = 0.16$
+ ``bombastic``: $0.01^{3/4} = 0.032$

``Bombastic`` is now 3x more likely to be sampled while ``is`` only went up marginally.


### Continuous Bag of Words Model (CBOW)

The CBOW approach is to treat {``The``, ``cat``, ``over``, ``the``, ``puddle``} as a context and from these words, be able to predict or generate the center word ``jumped``:

+ We generate our one hot vectors for the input context ($w_j, \forall j$).
+ We get our word vectors for the input context ($v_j = \mathcal{V} w_j, \forall j$).
+ Average these vectors to get a unique vector ($\hat v = \frac{1}{2m} \sum_j v_j$).
+ Get the score vector ($z_i = \mathcal{U} \hat v$).
+ Turn the score into probabilities.
+ Our objective is to match this probability vector to the one hot vector of the actual word.

Our objective function is:

$$
J =  \frac{1}{T} \sum^T_{c=1} \left(- (u_{c}^T \hat v) +  \log \sum_{w \in V} \exp(u_{w}^T \hat v) \right)
$$

and the graphical representation of the model:

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/word2vec-cbow.png?raw=1" alt="" style="width: 400px;"/> 
</center>

## A ``word2vec`` implementation in Keras

A word embedding layer is usually regarded as a mapping from a discrete set of objects (words) to a real valued vector, i.e. 

$$k\in\{1..|V|\} \rightarrow \mathbb{R}^{n}$$

Thus, we can represent the *Embedding layer* as $|V|\times n$ matrix, or just a table/dictionary.

$$
\begin{matrix}
word_1: \\
word_2:\\
\vdots\\
word_{|V|}: \\
\end{matrix}
\left[
\begin{matrix}
x_{1,1}&x_{1,2}& \dots &x_{1,n}\\
x_{2,1}&x_{2,2}& \dots &x_{2,n}\\
\vdots&&\\
x_{{|V|},1}&x_{{|V|},2}& \dots &x_{{|V|},n}\\
\end{matrix}
\right]
$$

In this sense, the basic operation that an embedding layer has to accomplish is that given a certain word it returns the assigned code. And the goal in learning is to learn the values in the matrix.


To **train** our data set using negative sampling and the skip-gram method, we need to create data samples for both valid context words and for negative samples. 

This involves scanning through the data set and picking target words, then randomly selecting context words from within the window of words around the target word (i.e. if the target word is “on” from “the cat sat on the mat”, with a window size of 2 the words “cat”, “sat”, “the”, “mat” could all be randomly selected as valid context words).  

It also involves randomly selecting negative samples outside of the selected target word context. 

Finally, we also need to set a label of 1 or 0, depending on whether the supplied context word is a true context word or a negative sample.  

Thankfully, Keras has a function (``skipgrams``) which does all that for us.

In [1]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Reshape, dot
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import skipgrams
from tensorflow.keras.preprocessing import sequence

import urllib.request
import collections
import os
import zipfile

import numpy as np
import tensorflow as tf

def maybe_download(filename, url, expected_bytes):
    """Download a file if not present, and make sure it's the right size."""
    if not os.path.exists(filename):
        filename, _ = urllib.request.urlretrieve(url + filename, filename)
    statinfo = os.stat(filename)
    if statinfo.st_size == expected_bytes:
        print('Found and verified', filename)
    else:
        print(statinfo.st_size)
        raise Exception(
            'Failed to verify ' + filename + '. Can you get to it with a browser?')
    return filename


# Read the data into a list of strings.
def read_data(filename):
    """Extract the first file enclosed in a zip file as a list of words."""
    with zipfile.ZipFile(filename) as f:
        data = tf.compat.as_str(f.read(f.namelist()[0])).split()
    return data


def build_dataset(words, n_words):
    """Process raw inputs into a dataset."""
    count = [['UNK', -1]]
    count.extend(collections.Counter(words).most_common(n_words - 1))
    dictionary = dict()
    for word, _ in count:
        dictionary[word] = len(dictionary)
    data = list()
    unk_count = 0
    for word in words:
        if word in dictionary:
            index = dictionary[word]
        else:
            index = 0  # dictionary['UNK']
            unk_count += 1
        data.append(index)
    count[0][1] = unk_count
    reversed_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
    return data, count, dictionary, reversed_dictionary

def collect_data(vocabulary_size=10000):
    url = 'http://mattmahoney.net/dc/'
    filename = maybe_download('text8.zip', url, 31344016)
    vocabulary = read_data(filename)
    print('First words of the dataset:',vocabulary[:7])
    data, count, dictionary, reverse_dictionary = build_dataset(vocabulary,
                                                                vocabulary_size)
    del vocabulary  # Hint to reduce memory.
    return data, count, dictionary, reverse_dictionary

vocab_size = 10000
data, count, dictionary, reverse_dictionary = collect_data(vocabulary_size=vocab_size)
print('First words representation:',data[:7])


Found and verified text8.zip
First words of the dataset: ['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse']
First words representation: [5234, 3081, 12, 6, 195, 2, 3134]


In [0]:
window_size = 3
vector_dim = 300
epochs = 200000

valid_size = 16     # Random set of words to evaluate similarity on.
valid_window = 100  
valid_examples = np.random.choice(valid_window, valid_size, replace=False)

# Generates a word rank-based probabilistic sampling table.
sampling_table = sequence.make_sampling_table(vocab_size)

# Generates skipgram word pairs.
# This function transforms a sequence of word indexes 
# (list of integers) into tuples of words of the form:
# (word, word in the same window), with label 1 (positive samples).
# (word, random word from the vocabulary), with label 0 (negative samples).
couples, labels = skipgrams(data, vocab_size, window_size=window_size, sampling_table=sampling_table)

word_target, word_context = zip(*couples)
word_target = np.array(word_target, dtype="int32")
word_context = np.array(word_context, dtype="int32")

print(couples[:10], labels[:10])

[[1207, 139], [7, 267], [876, 4029], [1991, 7486], [406, 418], [4071, 8], [9821, 1130], [3187, 2], [780, 5], [470, 6]] [1, 1, 0, 0, 0, 1, 1, 1, 1, 1]


In [0]:
# create some input variables
input_target = Input((1,))
input_context = Input((1,))

embedding = Embedding(vocab_size, vector_dim, input_length=1, name='embedding')
target = embedding(input_target)
target = Reshape((vector_dim, 1))(target)

context = embedding(input_context)
context = Reshape((vector_dim, 1))(context)

# setup a cosine similarity operation which will be output in a secondary model
similarity = dot([target, context], normalize=True, axes=0)

# now perform the dot product operation to get a similarity measure
dot_product = dot([target, context], axes=1)
dot_product = Reshape((1,))(dot_product)

# add the sigmoid output layer
output = Dense(1, activation='sigmoid')(dot_product)

# create the primary training model
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop')

# create a secondary validation model to run our similarity checks during training
validation_model = Model(inputs=[input_target, input_context], outputs=similarity)

class SimilarityCallback:
    def run_sim(self):
        for i in range(valid_size):
            valid_word = reverse_dictionary[valid_examples[i]]
            top_k = 8  # number of nearest neighbors
            sim = self._get_sim(valid_examples[i])
            nearest = (-sim).argsort()[1:top_k + 1]
            log_str = 'Nearest to %s:' % valid_word
            for k in range(top_k):
                close_word = reverse_dictionary[nearest[k]]
                log_str = '%s %s,' % (log_str, close_word)
            print(log_str)

    @staticmethod
    def _get_sim(valid_word_idx):
        sim = np.zeros((vocab_size,))
        in_arr1 = np.zeros((1,))
        in_arr2 = np.zeros((1,))
        in_arr1[0,] = valid_word_idx
        for i in range(vocab_size):
            in_arr2[0,] = i
            out = validation_model.predict_on_batch([in_arr1, in_arr2])
            sim[i] = out
        return sim
      
sim_cb = SimilarityCallback()

arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))
for cnt in range(epochs):
    idx = np.random.randint(0, len(labels)-1)
    arr_1[0,] = word_target[idx]
    arr_2[0,] = word_context[idx]
    arr_3[0,] = labels[idx]
    loss = model.train_on_batch([arr_1, arr_2], arr_3)
    if cnt % 100 == 0:
        print("Iteration {}, loss={}".format(cnt, loss))
    if cnt % 10000 == 0:
        sim_cb.run_sim()

Instructions for updating:
Use tf.cast instead.
Iteration 0, loss=0.6924847364425659
Nearest to the: exotic, accidental, website, gettysburg, outbreak, emotions, monkey, strait,
Nearest to has: bias, yale, joke, surveillance, occasions, acknowledge, exponential, tips,
Nearest to three: court, batman, sixty, failure, leadership, accused, montgomery, riemann,
Nearest to if: chapters, sky, jargon, nominee, residence, singers, baseball, bullet,
Nearest to their: came, queen, cuisine, jacob, sixth, believers, infant, sean,
Nearest to with: naming, practically, metal, avoided, guild, storyline, fingers, her,
Nearest to d: explicit, register, guiana, scale, zeus, ptolemy, honduras, census,
Nearest to are: outstanding, bright, dog, luthor, anime, because, layers, exhibits,
Nearest to will: peter, co, moderate, living, stake, morphology, journalism, walking,
Nearest to during: marvel, interesting, gdp, agave, do, centred, mountains, authority,
Nearest to states: influenced, mrna, variant, alone

## ``par2vec``

What about a vector representation for phrases/paragraphs/documents?

The ``par2vec`` approach for learning paragraph vectors is inspired by the methods for learning the word vectors. The inspiration is that the word vectors are asked to contribute to a prediction task about the next word in the sentence.

We will consider a *paragraph* vector. The paragraph vectors are also
asked to contribute to the prediction task of the next word
given many contexts sampled from the paragraph.

In ``par2vec`` framework, every paragraph is mapped to a unique vector, represented by a column in matrix D and every word is also mapped to a unique vector, represented by a column in matrix W. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context.

<center>
<img src="https://github.com/DataScienceUB/DeepLearningMaster2019/blob/master/images/par2vec.png?raw=1" alt="" style="width: 500px;"/> 
(Source: https://cs.stanford.edu/~quocle/paragraph_vector.pdf)
</center>

The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the
current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory
Model of Paragraph Vectors (PV-DM).

The contexts are fixed-length and sampled from a sliding window over the paragraph. 

The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. 

The word vector matrix W, however, is shared across paragraphs. 

At prediction time, one needs to perform an inference step to compute the paragraph vector for a new paragraph. This
is also obtained by gradient descent. In this step, the parameters for the rest of the model, the word vectors and
the softmax weights, are fixed.

### Using GloVe pre-trained word embeddings for classification. 1-D Convolutions.

GloVe (https://nlp.stanford.edu/pubs/glove.pdf) consists of a weighted least squares model that trains on global word-word co-occurrence counts and thus makes efficient use of statistics. The model produces a word vector space with meaningful sub-structure. It shows state-of-the-art performance on the word analogy task, and outperforms other current methods on several word similarity tasks.





In [0]:
'''This script loads pre-trained word embeddings (GloVe embeddings)
into a frozen Keras Embedding layer, and uses it to
train a text classification model on the 20 Newsgroup dataset
(classication of newsgroup messages into 20 different categories).
GloVe embedding data can be found at:
http://nlp.stanford.edu/data/glove.6B.zip (822MB)
(source page: http://nlp.stanford.edu/projects/glove/)
20 Newsgroup data can be found at:
http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html
'''

from __future__ import print_function

import os
import sys
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, Flatten
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.models import Model

def to_categorical(y, num_classes=None):
    """Converts a class vector (integers) to binary class matrix.
    E.g. for use with categorical_crossentropy.
    # Arguments
        y: class vector to be converted into a matrix
            (integers from 0 to num_classes).
        num_classes: total number of classes.
    # Returns
        A binary matrix representation of the input.
    """
    y = np.array(y, dtype='int').ravel()
    if not num_classes:
        num_classes = np.max(y) + 1
    n = y.shape[0]
    categorical = np.zeros((n, num_classes))
    categorical[np.arange(n), y] = 1
    return categorical

BASE_DIR = ''
GLOVE_DIR = BASE_DIR + '/glove.6B/'
TEXT_DATA_DIR = BASE_DIR + '/20_newsgroup/'
MAX_SEQUENCE_LENGTH = 1000
MAX_NB_WORDS = 20000
EMBEDDING_DIM = 100
VALIDATION_SPLIT = 0.2

# first, build index mapping words in the embeddings set
# to their embedding vector

print('Indexing word vectors.')

embeddings_index = {}
f = open(os.path.join(GLOVE_DIR, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

# second, prepare text samples and their labels
print('Processing text dataset')

texts = []  # list of text samples
labels_index = {}  # dictionary mapping label name to numeric id
labels = []  # list of label ids
for name in sorted(os.listdir(TEXT_DATA_DIR)):
    path = os.path.join(TEXT_DATA_DIR, name)
    if os.path.isdir(path):
        label_id = len(labels_index)
        labels_index[name] = label_id
        for fname in sorted(os.listdir(path)):
            if fname.isdigit():
                fpath = os.path.join(path, fname)
                if sys.version_info < (3,):
                    f = open(fpath)
                else:
                    f = open(fpath, encoding='latin-1')
                t = f.read()
                i = t.find('\n\n')  # skip header
                if 0 < i:
                    t = t[i:]
                texts.append(t)
                f.close()
                labels.append(label_id)

print('Found %s texts.' % len(texts))

# finally, vectorize the text samples into a 2D integer tensor
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)

labels = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# split the data into a training set and a validation set
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * data.shape[0])

x_train = data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

print('Preparing embedding matrix.')

# prepare embedding matrix
num_words = min(MAX_NB_WORDS, len(word_index))
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector

# load pre-trained word embeddings into an Embedding layer
# note that we set trainable = False so as to keep the embeddings fixed
embedding_layer = Embedding(num_words,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

print('Training model.')

# train a 1D convnet with global maxpooling
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

model.fit(x_train, y_train,
          batch_size=128,
          epochs=10,
          validation_data=(x_val, y_val))

Indexing word vectors.


FileNotFoundError: ignored

## Are word embeddings still useful?

The word embeding models we have seen have several limitations:

+ **Word2Vec** and **Glove** handle whole words, and can't easily handle words they haven't seen before. 
+ Words can be ambigous, but we are assigning only one embedding. Embeddings don't depend on the context.

**FastText** (based on Word2Vec) is word-fragment (character) based and can usually handle unseen words, although it still generates one vector per word. 

Lately, several new "ontext-aware" models have been proposed.

**ELMo** and **BERT**  incorporate context, handling polysemy and nuance much better (e.g. sentences like "Time flies like an arrow. Fruit flies like bananas") . This in general improves performance notably on downstream tasks.

For natural language tasks, **ELMo** and **BERT** represent the best option at this time. For other kinds of tasks (for example, item-embedding for recommendert systems), **Word2Vec** is still an alternative. 



## Bibliography

On word embeddings (Part I, II and III): http://ruder.io/word-embeddings-1/