### The CBOW word2vec model

Like the skip-gram model, the CBOW model is a classifier that takes the **context words as input**
and predicts the target word. The model predicts the current word given a window of surrounding words.

The **input to the model is the word IDs** for the context words. These word IDs are fed into a common **embedding layer** that is initialized with **small random weights**. Each word ID is transformed
into a vector of size (*embed_size*) by the embedding layer. Thus, **each row of the input context** is
transformed into a **matrix of size (2*window_size, embed_size)** by this layer. This is then fed into a lambda
layer, which computes an average of all the embeddings. This **average is then fed to a dense layer**,
which creates a dense vector of size (vocab_size) for each row. The **activation function** on the dense
layer is a **softmax**, which reports the maximum value on the **output vector as a probability**. The ID
with the **maximum probability** corresponds to the **target word**.

The deliverable for the CBOW model is the weights from the embedding layer shown in gray in the
following figure:

<img src="cbow.JPG">

Assume a vocabulary size of 5000, an embedding size of 300, and a context window size of 1.

In [1]:
from keras.models import Sequential
from keras.layers.core import Dense, Lambda
from keras.layers.embeddings import Embedding
import keras.backend as K

Using TensorFlow backend.


In [3]:
vocab_size = 5000
embed_size = 300
window_size = 1

We then construct a sequential model, to which we add an **embedding layer** whose **weights are
initialized with small random values**. Note that the input_length of this embedding layer is equal to the
number of context words. So each context word is fed into this layer and will **update the weights
jointly during backpropagation**. The output of this layer is a matrix of context word embeddings,
which are averaged into a single vector (per row of input) by the **lambda layer**. Finally, the dense
layer will convert each row into a dense vector of size (vocab_size). The target word is the one whose
ID has the maximum value in the dense output vector:

In [4]:
model = Sequential()
model.add(Embedding(input_dim=vocab_size, 
                    output_dim=embed_size,
                    embeddings_initializer='glorot_uniform',
                    input_length=window_size*2))

model.add(Lambda(lambda x: K.mean(x, axis=1), output_shape= (embed_size,)))
model.add(Dense(vocab_size, kernel_initializer='glorot_uniform', activation='softmax'))
"""The loss function used here is categorical_crossentropy, which is a common choice for cases where there
are two or more (in our case, vocab_size) categories."""
model.compile(loss='categorical_crossentropy', optimizer="adam")

Instructions for updating:
Colocations handled automatically by placer.


In [5]:
# get weights
weights = model.layers[0].get_weights()[0]