# Implementing the Continuous Bag of Words (CBOW) Model

The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words). 

Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. 

Here is the step by step method for implementation:

**1. Build the corpus vocabulary**

**2. Build a CBOW (context, target) generator**

**3. Build the CBOW model architecture**

**4. Train the Model** 

Importing the necessary packages

word2vec_utils is the utility file which contains functions such as downloading data,building vocabulary,building sample generator etc.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'

import numpy as np
from tensorflow.contrib.tensorboard.plugins import projector
import tensorflow as tf

import utils
import word2vec_utils

In [2]:
download_url = 'http://mattmahoney.net/dc/text8.zip'
local_dest = 'data/text8.zip'
expected_byte = 31344016
vocab_size = 50000
visual_fld = 'visualization'
window_size = 2

## STEP 1 : Build the corpus vocabulary
Download the corpus and convert the corpus data to list of individual words

and then all the unique words in the list are collected and assigned with the index.

In [3]:
utils.download_one_file(download_url, local_dest, expected_byte)
words = word2vec_utils.read_data(local_dest)
dictionary, _ = word2vec_utils.build_vocab(words, vocab_size, visual_fld)


data/text8.zip already exists


## STEP 2: Build a CBOW (context, target) generator
Here each word in the corpus is assigned with index from the dictionary that was created in the previous step

We need pairs which consist of a target centre word and surround context words. In our implementation, a target word is of length 1 and surrounding context is of length 2 x window_size where we take window_size words before and after the target word in our corpus.

In [4]:
index_words = word2vec_utils.convert_words_to_index(words, dictionary)
train = word2vec_utils.generate_cbow_sample(index_words,window_size)
print(train)

<generator object generate_cbow_sample at 0x0000013F8A360A20>


Example of pairs of context words and target word 

In [5]:
iteration = 0
for x,y in train:
    if iteration >= 10:
        break
    print('------ sample ',iteration+1,' ------')
    print('X:context ->',[words[i] for i in x])
    print('Y:target ->',words[y])
    iteration +=1

------ sample  1  ------
X:context -> ['so', 'soviet', 'abuse', 'be']
Y:target -> class
------ sample  2  ------
X:context -> ['soviet', 'class', 'be', 'as']
Y:target -> abuse
------ sample  3  ------
X:context -> ['class', 'abuse', 'as', 'the']
Y:target -> be
------ sample  4  ------
X:context -> ['abuse', 'be', 'the', 'means']
Y:target -> as
------ sample  5  ------
X:context -> ['be', 'as', 'means', 'as']
Y:target -> the
------ sample  6  ------
X:context -> ['as', 'the', 'as', 'economic']
Y:target -> means
------ sample  7  ------
X:context -> ['the', 'means', 'economic', 'use']
Y:target -> as
------ sample  8  ------
X:context -> ['means', 'as', 'use', 'this']
Y:target -> economic
------ sample  9  ------
X:context -> ['as', 'economic', 'this', 'the']
Y:target -> use
------ sample  10  ------
X:context -> ['economic', 'use', 'the', 'least']
Y:target -> this


## STEP 3: Build the CBOW model architecture

**cbow_batch_gen** generator is used to make batches of the data

In [6]:

def cbow_batch_gen(batch_size,train_new):
    while True:
        context_batch = np.zeros([batch_size,2*window_size], dtype=np.int32)
        target_batch = np.zeros([batch_size, 1])
        for i in range(batch_size):
            context_batch[i],target_batch[i] = next(train_new)
        yield context_batch,target_batch

Loading data in batches with the generator

In [8]:
BATCH_SIZE = 100
WINDOW_SIZE = 2
VOCAB_SIZE = 50000
EMBED_SIZE = 100
train = word2vec_utils.generate_cbow_sample(index_words,window_size)
def gen():
    yield from cbow_batch_gen(BATCH_SIZE,train)

dataset = tf.data.Dataset.from_generator(gen, 
                                (tf.int32, tf.int32), 
                                (tf.TensorShape([BATCH_SIZE,2*WINDOW_SIZE]), tf.TensorShape([BATCH_SIZE, 1])))


In [10]:
with tf.name_scope('data'):
        iterator = dataset.make_initializable_iterator()
        context_words, target_word = iterator.get_next()

creating Embedding matix where each word is embedded with EMBED_SIZE length vector

In [11]:
with tf.name_scope('embed'):
        embed_matrix = tf.get_variable('embed_matrix1', 
                                        shape=[VOCAB_SIZE, EMBED_SIZE],
                                        initializer=tf.random_uniform_initializer())
        embed = tf.nn.embedding_lookup(embed_matrix, context_words, name='embedding1')

taking mean of context embeddings

In [12]:
sess = tf.InteractiveSession()
sess.run(iterator.initializer)
sess.run(tf.global_variables_initializer())
embed_mean = tf.reduce_mean(embed,1)

Assigning weights and bias for the network

In [13]:
with tf.name_scope('loss'):
        nce_weight = tf.get_variable('nce_weight', shape=[VOCAB_SIZE, EMBED_SIZE],
                        initializer=tf.truncated_normal_initializer(stddev=1.0 / (EMBED_SIZE ** 0.5)))
        nce_bias = tf.get_variable('nce_bias', initializer=tf.zeros([VOCAB_SIZE]))

NCE loss fuction is used for estimating the loss 

http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf

In [14]:
NUM_SAMPLED = 64            
loss = tf.reduce_mean(tf.nn.nce_loss(weights=nce_weight, 
                                     biases=nce_bias, 
                                     labels=target_word, 
                                     inputs=embed_mean, 
                                     num_sampled=NUM_SAMPLED, 
                                     num_classes=VOCAB_SIZE), name='loss')

W0715 21:29:22.214289  4068 deprecation.py:323] From c:\users\nikhi\appdata\local\programs\python\python37\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


gradient descent optimizer is used for optimization

In [16]:
LEARNING_RATE = 0.5
with tf.name_scope('optimizer'):
        optimizer = tf.train.GradientDescentOptimizer(LEARNING_RATE).minimize(loss)

## STEP 4: Traning the model

In [17]:
NUM_TRAIN_STEPS = 100000
SKIP_STEP = 5000
with tf.Session() as sess:
        sess.run(iterator.initializer)
        sess.run(tf.global_variables_initializer())

        total_loss = 0.0 # we use this to calculate late average loss in the last SKIP_STEP steps
        writer = tf.summary.FileWriter('graphs/cbow_word2vec', sess.graph)

        for index in range(NUM_TRAIN_STEPS):
            try:
                loss_batch, _ = sess.run([loss, optimizer])
                total_loss += loss_batch
                if (index + 1) % SKIP_STEP == 0:
                    print('Average loss at step {}: {:5.1f}'.format(index, total_loss / SKIP_STEP))
                    total_loss = 0.0
            except tf.errors.OutOfRangeError:
                sess.run(iterator.initializer)
        writer.close()

Average loss at step 4999:  63.3
Average loss at step 9999:  17.4
Average loss at step 14999:   9.1
Average loss at step 19999:   6.3
Average loss at step 24999:   5.1
Average loss at step 29999:   4.8
Average loss at step 34999:   4.6
Average loss at step 39999:   4.5
Average loss at step 44999:   4.4
Average loss at step 49999:   4.4
Average loss at step 54999:   4.4
Average loss at step 59999:   4.4
Average loss at step 64999:   4.4
Average loss at step 69999:   4.4
Average loss at step 74999:   4.3
Average loss at step 79999:   4.3
Average loss at step 84999:   4.3
Average loss at step 89999:   4.3
Average loss at step 94999:   4.3
Average loss at step 99999:   4.3
