### The skip-gram word2vec model
The skip-gram model is trained to predict the surrounding words given the current word. To
understand how the skip-gram word2vec model works, consider the following example sentence:

    I love green eggs and ham.

Assuming a window size of three, this sentence can be broken down into the following sets of
(context, word) pairs:
- [I, green], love)
- [love, eggs], green)
- [green, and], eggs)
- ...

Since the skip-gram model predicts a context word given the center word, we can convert the
preceding dataset to one of (input, output) pairs. That is, given an input word, we expect the skipgram
model to predict the output word:

    (love, I), (love, green), (green, love), (green, eggs), (eggs, green), (eggs, and), ...

We can also generate additional negative samples by pairing each input word with some random
word in the vocabulary. For example:
    
    (love, Sam), (love, zebra), (green, thing), ...

Finally, we generate positive and negative examples for our classifier:
    
    ((love, I), 1), ((love, green), 1), ..., ((love, Sam), 0), ((love, zebra), 0), ...

We can now train a classifier that takes in a word vector and a context vector and learns to predict
one or zero depending on whether it sees a positive or negative sample. The deliverables from this
trained network are the weights of the word embedding layer (the gray box in the following figure):
<img src="skip-gram.jpg">

In [4]:
#from keras.layers import Merge # out of date
#from keras.layers import dot
#from keras.layers.core import Dense, Reshape
#from keras.layers.embeddings import Embedding
#from keras.models import Sequential
from keras import layers
from keras import models
from keras import Input, Model

In [2]:
vocab_size = 5000   # vocabulary size is set at 5000
embed_size = 300    # output embedding size is 300

The input to this model is the word ID in the
vocabulary. The embedding weights are initially set to small random values. During training, the
model will update these weights using backpropagation. The next layer reshapes the input to the
embedding size.

The other model that we need is a sequential model for the context words. For each of our skip-gram
pairs, we have a single context word corresponding to the target word, so this model is identical to
the word model:

The outputs of the two models are each a vector of size (embed_size). **These outputs are merged into one
using a dot product and fed into a dense layer**, which has a single output wrapped in a sigmoid
activation layer. The **Sigmoid Activation** modulates the output so numbers higher than 0.5 tend rapidly to 1
and flatten out, and numbers lower than 0.5 tend rapidly to 0 and also flatten out.

The next code lines don't work due to the deprecation of the *Merge API*:

        model = Sequential()
        model.add(Merge([word_model, context_model], mode="dot"))
        model.add(Dense(1, init="glorot_uniform", activation="sigmoid"))
        model.compile(loss="mean_squared_error", optimizer="adam")

The next lines of code is working translating the Sequential API of Keras - to the Functional API to solve the Merge API problem. For now the objective is merge two vectors to build a tensor using dot product.     
        <img src="concatenate_keras2.png">

In [8]:
"""
#Sequential API of Keras
word_model = Sequential()
word_model.add(Embedding(vocab_size, embed_size,
                         embeddings_initializer="glorot_uniform",
                         input_length=1))                            # window size is 1

word_model.add(Reshape((embed_size, )))
"""

# Functional API of Keras 
word_input = Input(shape=(1,))
word_x = layers.Embedding(vocab_size, 
                          embed_size, 
                          embeddings_initializer='glorot_uniform')(word_input)
word_reshape = layers.Reshape((embed_size,))(word_x)
word_model = Model(word_input, word_reshape)    

"""
#Sequential API of Keras 
context_model = Sequential()
context_model.add(Embedding(vocab_size, embed_size,
                            embeddings_initializer="glorot_uniform",
                            input_length=1))

context_model.add(Reshape((embed_size,)))
"""
# Functional API of Keras 
context_input = Input(shape=(1,))
context_x = layers.Embedding(vocab_size, 
                             embed_size, 
                             embeddings_initializer='glorot_uniform')(context_input)
context_reshape = layers.Reshape((embed_size,))(context_x)
context_model = Model(context_input, context_reshape)


dot_output = layers.dot([word_model.output, context_model.output], axes=1, normalize=False)
model_output = layers.Dense(1, kernel_initializer='glorot_uniform', activation='sigmoid')(dot_output)
model = Model([word_input, context_input], model_output)

The loss function used is the mean_squared_error; the idea is to minimize the dot product for positive
examples and maximize it for negative examples. If you recall, the dot product multiplies
corresponding elements of two vectors and sums up the result—this causes similar vectors to have
higher dot products than dissimilar vectors, since the former has more overlapping elements.

In [None]:
model.compile(loss="mean_squared_error", optimizer="rmsprop")

In [25]:
merge_layer = model.layers[0]
word_embed_layer = word_model.layers[0]
weights = word_model.get_weights()[0]

### Extracting skip-grams for a text that has been converted to a list of word indices

In [27]:
from keras.preprocessing.text import Tokenizer, text_to_word_sequence
from keras.preprocessing.sequence import skipgrams

text = "I love green eggs and ham ."



56 56
(ham (6), i (1)) -> 0
(eggs (4), love (2)) -> 0
(ham (6), eggs (4)) -> 0
(i (1), green (3)) -> 0
(and (5), eggs (4)) -> 0
(and (5), eggs (4)) -> 0
(i (1), green (3)) -> 0
(love (2), green (3)) -> 1
(i (1), eggs (4)) -> 1
(ham (6), green (3)) -> 1


Declaring the tokenizer and run the text against it. This will produce a list of word tokens:

In [31]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

The tokenizer creates a dictionary mapping each unique word to an integer ID and makes it available
in the word_index attribute. We extract this and create a two-way lookup table:

In [32]:
word2id = tokenizer.word_index
id2word = {v: k for k, v in word2id.items()}

Finally, we convert our input list of words to a list of IDs and pass it to the skipgrams function. We then
print the first 10 of the 56 (pair, label) skip-gram tuples generated:

In [33]:
wids = [word2id[w] for w in text_to_word_sequence(text)]
pairs, labels = skipgrams(wids, len(word2id))
print(len(pairs), len(labels))
for i in range(10):
    print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format(
        id2word[pairs[i][0]], pairs[i][0],
        id2word[pairs[i][1]], pairs[i][1],
labels[i]))

56 56
(love (2), and (5)) -> 0
(eggs (4), ham (6)) -> 1
(love (2), love (2)) -> 0
(and (5), i (1)) -> 1
(and (5), green (3)) -> 1
(and (5), i (1)) -> 0
(and (5), green (3)) -> 0
(i (1), and (5)) -> 0
(i (1), eggs (4)) -> 1
(ham (6), and (5)) -> 1


The process of negative sampling, used for generating the negative examples, consists of
randomly pairing up arbitrary tokens from the text. As the size of the input text increases, this is more
likely to pick up unrelated word pairs. In our example, since our text is very short, there is a chance
that it can end up generating positive examples as well.