# Word Embeddings with Neural Networks
We will train word embeddings using the continuous-bag-of-words (CBOW) method.

# Continuous-Bag-of-Words (CBOW) Embeddings

We will use dense word embeddings based on the word2vec paradigm. In particular, we will use the continuous-bag-of-words approach, which trains a model to predict a word based on the embeddings of surrounding words. For example, in the sentence "the man walks the dog in the park", the embeddings for the words ("man, "walks", "dog", "in") will be used to predict the word "the" (if your context size is 2 on each side of the target word).

## Download \& Preprocess the Data
First we will download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch.

In [None]:
! pip install torch==1.13.0
! pip install torchtext==0.14.0
! pip install torchdata==0.5.0

In [3]:
### DO NOT EDIT ###

import torch
import torch.nn as nn
import torch.nn.functional as F

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

if __name__=='__main__':
    print('Using device:', DEVICE)

Using device: cuda


We will use WikiText-2, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper: https://arxiv.org/pdf/1609.07843v1.pdf. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext-2.preprocess

After downloading the data, we preprocess the text.

* <b>Sentence splitting:</b>&nbsp;&nbsp;&nbsp;&nbsp; Here, we are interested in modeling individual sentences, rather than longer chunks of text such as paragraphs or documents. The WikiTest dataset provides paragraphs; thus, we write a simple method to identify individual sentences by splitting paragraphs at punctuation tokens (".",  "!",  "?").

* <b>Sentence markers:</b>&nbsp;&nbsp;&nbsp;&nbsp; For both training and testing corpora, each sentence must be surrounded by a start-of-sentence (`<s>`) and end-of-sentence marker (`/s`). These markers will allow our models to generate sentences that have realistic beginnings and endings.

* <b>Unknown words:</b>&nbsp;&nbsp;&nbsp;&nbsp; In order to deal with unknown words, all words that do not appear in the vocabulary must be replaced with a special token for unknown words (`<UNK>`). The WikiText dataset has already done this as mentioned in the paper above. When unknown words are encountered in the test corpus, they should be treated as that special token instead.

In [4]:
### SENTENCE ###

# Constants
CBOW_START = "<s>"   # Start-of-sentence token
CBOW_END = "</s>"    # End-of-sentence-token
CBOW_UNK = "<UNK>"   # Unknown word token

In [5]:
### DATA PREPROCESSING ###

import torchtext
import random
import sys

def cbow_preprocess(data, vocab=None, do_lowercase=True):
    final_data = []
    lowercase = "abcdefghijklmnopqrstuvwxyz"
    for paragraph in data:
        paragraph = [x if x != '<unk>' else CBOW_UNK for x in paragraph.split()]
        if vocab is not None:
            paragraph = [x if x in vocab else CBOW_UNK for x in paragraph]
        if paragraph == [] or paragraph.count('=') >= 2: continue
        sen = []
        prev_punct, prev_quot = False, False
        for word in paragraph:
            if prev_quot:
                if word[0] not in lowercase:
                    final_data.append(sen)
                    sen = []
                    prev_punct, prev_quot = False, False
            if prev_punct:
                if word == '"':
                    prev_punct, prev_quot = False, True
                else:
                    if word[0] not in lowercase:
                        final_data.append(sen)
                        sen = []
                        prev_punct, prev_quot = False, False
            if word in {'.', '?', '!'}: prev_punct = True
            sen += [word]
        if sen[-1] not in {'.', '?', '!', '"'}: continue # Prevent a lot of short sentences
        final_data.append(sen)
    vocab_was_none = vocab is None
    if vocab is None:
        vocab = {}
    for i in range(len(final_data)):
        # Make words lowercase for this assignment
        final_data[i] = [x.lower() if do_lowercase and x != CBOW_UNK else x for x in final_data[i]]
        final_data[i] = [CBOW_START] + final_data[i] + [CBOW_END]
        if vocab_was_none:
            for word in final_data[i]:
                vocab[word] = vocab.get(word, 0) + 1
    return final_data, vocab

def getDataset():
    dataset = torchtext.datasets.WikiText2(root='.data', split=('train',))
    train_dataset, vocab = cbow_preprocess(dataset[0])
    return train_dataset, vocab

if __name__=='__main__':
    sentences, vocab = getDataset()

Run the next cell to see 10 random sentences of the data.

In [6]:
### PRINT DATA ###

if __name__ == '__main__':
    for x in random.sample(sentences, 10):
        print (x)

['<s>', 'cherry', 'had', 'originally', 'wanted', 'to', 'do', 'a', 'ten', '@-@', 'year', 'jump', ',', 'mostly', 'to', 'age', 'the', 'young', 'characters', 'into', 'their', 'teenage', 'years', 'in', 'order', 'to', 'open', 'up', 'more', 'storyline', 'possibilities', '.', '</s>']
['<s>', '"', 'sweet', 'love', '"', 'garnered', 'positive', 'reviews', 'from', 'music', 'critics', '.', '</s>']
['<s>', 'david', '<UNK>', ',', 'in', '1851', ',', 'used', 'the', 'even', 'of', 'st', 'agnes', 'to', 'claim', ',', '"', 'we', 'have', 'here', 'a', 'specimen', 'of', 'descriptive', 'power', '<UNK>', 'rich', 'and', 'original', ';', 'but', 'the', 'following', 'lines', ',', 'from', 'the', "'", 'ode', 'to', 'a', 'nightingale', ',', "'", 'flow', 'from', 'a', 'far', 'more', 'profound', 'fountain', 'of', 'inspiration', '.', '"', '</s>']
['<s>', 'the', 'film', 'was', 'featured', 'in', 'an', 'exhibit', 'in', 'vienna', ',', 'examining', 'the', 'nature', 'of', 'pornography', '.', '</s>']
['<s>', 'in', 'his', 'first', 

## Define the Dataset Class
In the following cell, we will define the <b>dataset</b> class. The dataset contains input-output pairs for each training example we will provide to the model. We will implement the following functions:

*   <b>` make_training_examples(self)`:</b>  Each training example will be a list of <em>context</em> words along with a <em>target</em> word. The context words consist of $c$ words on either side of the target word; hence, each list of context words has size $2c$. The goal will be to have our model predict the target word from the context words. Thus, we must convert each sentence into a series of context-target pairs, as follows:
<ul>
<li>For each sentence $s=[w_1,w_2,...,w_n]$ and a context size $c$, compute the following (context, target) pairs:<br>&emsp;&emsp;&emsp;&emsp;$([w_1,...,w_c,w_{c+2},...,w_{2c+1}]$, $w_{c+1}$)<br>&emsp;&emsp;&emsp;&emsp;$([w_2,...,w_{c+1},w_{c+3},...,w_{2c+2}]$, $w_{c+2}$)<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;$\vdots$<br>&emsp;&emsp;&emsp;&emsp;$([w_{n-2c},...,w_{n-c-1},w_{n-c+1},...,w_{n}]$, $w_{n-c}$)<br>For example, suppose our sentence is "the man walks the dog in the park" and the context size is $c=2$, our method should find the following training pairs:<br>&emsp;&emsp;&emsp;&emsp;(["the", "man", "the", "dog"], "walks")<br>&emsp;&emsp;&emsp;&emsp;(["man", "walks", "dog", "in"], "the")<br>&emsp;&emsp;&emsp;&emsp;(["walks", "the", "in", "the"], "dog")<br>&emsp;&emsp;&emsp;&emsp;(["the", "dog", "the", "park"], "in")<br>Of course, the sentences in our dataset have start-of-sentence and end-of-sentence tokens as well, which we can treat as any other word.
</ul>
This function will return a list of <b>all</b> such training pairs.

*   <b>` build_dictionaries(self, vocab)`:</b>  Creates the dictionaries `word2idx` and `idx2word`. We will represent each word in the vocabulary with a unique index, and keep track of this in these dictionaries. The input `vocab` is a list of words: we must assign indexes in the order the words appear in this list.

* <b>`get_context_vector(self, idx)`:</b> Returns a vector representing the <em>context</em> of the `idx`th training example. Specifically, if the context size is $c$, this should be a tensor of $2c$ word indices corresponding to the context words of the `idx`th example.

   <font color='green'><b>Note:</b> We may want to pre-compute and save all context vectors (using word indices rather than the words themselves) in `__init__(...)`, and then access these in `get_context_vector(self, idx)`. This would give us a slight speedup at train time.</font>

*   <b>`get_target_index(self, idx) `</b>: Return the target word index for the `idx`th training example.

*  <b> ` __len__(self) `: </b> Return the total number of training examples in the dataset as an `int`.

*   <b>` __getitem__(self, idx)`:</b> Return the `idx`th training example as a tuple of `(context_vector, target_word_index)`. We can use the ` get_context_vector(self, idx) ` and ` get_label(self, idx) ` functions here.

In [7]:
from torch.utils import data
from collections import defaultdict

#maintaining global dictionaries as well.
word2idx = {}
idx2word = {}

class CbowDataset(data.Dataset):
    def __init__(self, sentences, vocab, context_size):
        ##### INITIALIZATION #####

        assert CBOW_START in vocab and CBOW_END in vocab and CBOW_UNK in vocab
        self.sentences = sentences
        self.context_size = context_size

        self.training_examples = []
        self.make_training_examples()

        self.word2idx = {} # Mapping of word to index
        self.idx2word = {} # Mapping of index to word
        self.build_dictionaries(sorted(vocab.keys()))
        self.vocab_size = len(self.word2idx)

    def make_training_examples(self):
        '''
        Builds a list of context-target_word pairs that will be used as training examples for the model and stores them in
        self.training_examples.
        Each example is a (context, target_word) tuple, where context is a list of strings of size 2*context_size and
        target_word is simply a string.
        Returns nothing.
        '''

        ##### MAKE TRAINING EXAMPLE #####
        # For each sentence, loop over each word in the sentence. If there are c words before and c words after the word,
        # make a (context, word) pair, where context is a list made up of the c words before the word and the c words
        # after the word (in the same order they appear in the sentence). Append this (context, word) pair to self.training_examples.


        for sentence in self.sentences:
            for i in range(len(sentence)- 2*self.context_size):
                target_word = sentence[i+self.context_size]
                context1 = sentence[i : i+self.context_size]
                context2 = sentence[i+self.context_size+1 : i+2*self.context_size+1]
                context = context1 + context2
                self.training_examples.append((context, target_word))

        pass

    def build_dictionaries(self, vocab):
        '''
        Builds the dictionaries self.idx2word and self.word2idx. Make sure that we assign indices
        in the order the words appear in vocab (a list of words).
        Returns nothing.
        '''
        ##### BUILD DICTIONARIES #####

        idx = 0
        for word in vocab:
            if word not in self.word2idx:
                self.word2idx[word] = idx
                self.idx2word[idx] = word
                word2idx[word] = idx
                idx2word[idx] = word
                idx += 1
        # print("length of self.idx2word : ",len(self.idx2word))

        pass

    def get_context_vector(self, idx):
        '''
        Returns the context vector (as a torch.tensor) for the training example at index idx.
        This is is a tensor containing the indices of each word in the context.
        '''
        assert len(self.training_examples) > 0

        ##### GET CONTEXT VECTOR #####

        example = self.training_examples[idx]
        context = example[0]

        vec = []

        for word in context:
            vec.append(self.word2idx[word])

        return torch.tensor(vec)

    def get_target_index(self, idx):
        '''
        Returns the index of the target word (as type int) of the training example at index idx.
        '''
        ##### GET TARGET INDEX #####

        example = self.training_examples[idx]
        target_word = example[1]
        return int(self.word2idx[target_word])

    def __len__(self):
        '''
        Returns the number of training examples (as type int) in the dataset
        '''
        ##### GET LENGTH #####

        return int(len(self.training_examples))

    def __getitem__(self, idx):
        '''
        Returns the context vector (as a torch.tensor) and target index (as type int) of the training example at index idx.
        '''
        ##### GET ITEM #####

        return self.get_context_vector(idx), self.get_target_index(idx)

## Define the CBOW Model

Here, we will define a simple feed-forward neural network that takes in a context vector and predicts the word that completes the context. We create a `CbowModel` class with `__init__(...)` and `forward(...)` functions.

In [9]:
class CbowModel(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, context_size):
        '''
        vocab_size: Size of the vocabulary
        embed_size: Size of your embedding vectors
        hidden_size: Size of hidden layer of neural network
        context_size: The size of your context window used to generate training examples
        '''
        super(CbowModel, self).__init__()

        self.context_size = context_size

        ##### CREATING THE MODEL LAYERS #####
        # 1. Create an embedding layer using nn.Embedding, that will take an index in our vocabulary as input
        #    (referring to a word) and return a vector of size embed_size (i.e. your embedding vector).
        #    Note that providing a word index to nn.Embedding is the same (conceptually) as providing a one-hot
        #    vector to nn.Linear (however, nn.Embedding takes sparsity into account, so is more efficient)

        # 2. Create a linear layer that projects our embedding vector to a vector of size hidden_size.

        # 3. Create an output linear layer, that projects our hidden vector to a vector the size of our vocabulary.

        self.embed_dim = embed_size
        self.embedding = nn.Embedding(num_embeddings=vocab_size,embedding_dim=self.embed_dim)
        self.linear1 = nn.Linear(self.embed_dim, hidden_size)
        self.output = nn.Linear(hidden_size, vocab_size)
        self.activation_function1 = nn.ReLU()

    def forward(self, inputs):
        '''
        inputs: Tensor of size [batch_size, 2*context_size]

        Returns output: Tensor of size [batch_size, vocab_size]
        '''

        ##### CREATE FORWARD #####
        # 1. Feed the inputs through our embedding layer to get a tensor of size [batch_size, 2*context size, embed_size]
        # 2. Average the embedding vectors of each of our context word embeddings (for each example in your batch).
        #    Expected size: [batch_size, embed_size]
        # 3. Feed this through our linear layer and then a ReLU activation. Expected size: [batch_size, hidden_size]
        # 4. Feed this through our output layer and return the result. Expected size [batch_size, vocab_size]
        #    Do NOT apply a softmax to the final output - this is done in the training method!

        embeds = self.embedding(inputs)
        avg_embeds = torch.mean(embeds, dim=1)
        out = self.linear1(avg_embeds)
        out = self.activation_function1(out)
        out = self.output(out)
        return out


##Train the CBOW Model

Now, we initialize the <b>dataloader</b>. A dataloader is responsible for providing batches of data to our model. Notice how we first instantiate dataset.

In [12]:
### TRAINING ###

BATCH_SIZE = 1000 #can change
CONTEXT_SIZE = 6  #can change

if __name__=='__main__':
    cbow_dataset = CbowDataset(sentences, vocab, CONTEXT_SIZE)
    cbow_dataloader = torch.utils.data.DataLoader(cbow_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2, drop_last=True)

Now we write a function that takes our model and trains it on the data.

In [13]:
### DO NOT EDIT ###

from tqdm.notebook import tqdm
from torch import optim

def train_cbow_model(model, num_epochs, data_loader, optimizer, criterion):
    print("Training CBOW model....")
    for epoch in range(num_epochs):
        epoch_loss, n = 0, 0
        for context, target in tqdm(data_loader):
            optimizer.zero_grad()
            log_probs = model(context.long().to(DEVICE)) # to(torch.float32)
            loss = criterion(log_probs, target.to(DEVICE))
            loss.backward()
            optimizer.step()
            n += context.shape[0]
            epoch_loss += (loss*context.shape[0])

        epoch_loss = epoch_loss/n
        print('[TRAIN]\t Epoch: {:2d}\t Loss: {:.4f}'.format(epoch+1, epoch_loss))
    print('CBOW Model Trained!\n')

Now we can instantiate our model.

In [14]:
count_parameters = lambda model: sum(p.numel() for p in model.parameters() if p.requires_grad)

if __name__=='__main__':
    cbow_model = CbowModel(vocab_size = cbow_dataset.vocab_size, # Don't change this
                embed_size = 300, # can change
                hidden_size = 300, # can change
                context_size = CONTEXT_SIZE)

    # Put our model on the device (cuda or cpu)
    cbow_model = cbow_model.to(DEVICE)

    print('The model has {:,d} trainable parameters'.format(count_parameters(cbow_model)))

The model has 17,446,579 trainable parameters


Next, we create the **criterion**, which is our loss function: it is a measure of how well the model matches the empirical distribution of the data. We use cross-entropy loss (https://en.wikipedia.org/wiki/Cross_entropy).

We also define the **optimizer**, which performs gradient descent. We use the Adam optimizer (https://arxiv.org/pdf/1412.6980.pdf), which has been shown to work well on these types of models.

In [15]:
import torch.optim as optim

if __name__=='__main__':
    LEARNING_RATE = 0.01 # can try other learning rates

    # Define the loss function
    criterion = nn.CrossEntropyLoss().to(DEVICE)

    # Define the optimizer
    optimizer = optim.Adam(cbow_model.parameters(), lr=LEARNING_RATE)

Finally, we can train the model.

In [16]:
if __name__=='__main__':
    N_EPOCHS = 10 # can change

    # Train model for N_EPOCHS epochs
    train_cbow_model(cbow_model, N_EPOCHS, cbow_dataloader, optimizer, criterion)

Training CBOW model....


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  1	 Loss: 7.0209


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  2	 Loss: 6.6275


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  3	 Loss: 6.4770


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  4	 Loss: 6.3840


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  5	 Loss: 6.3133


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  6	 Loss: 6.2464


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  7	 Loss: 6.1933


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  8	 Loss: 6.1587


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch:  9	 Loss: 6.1241


  0%|          | 0/1248 [00:00<?, ?it/s]

[TRAIN]	 Epoch: 10	 Loss: 6.0951
CBOW Model Trained!



In [18]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [22]:
#Saving and Loading Model
PATH = "drive/MyDrive/......./cbow_model"

# torch.save(cbow_model, PATH)
model = torch.load(PATH, map_location=torch.device(DEVICE))

## Visualize Word Embeddings

Now that you have a trained model, we can extract the word embeddings and visualize them. The word embeddings are basically the weight matrix of the embedding layer that you defined, as this maps each index of your vocab to a dense vector of size `embed_size`.

Since we cannot easily visualize such high-dimensional vectors, we use a process called TSNE (t-distributed stochastic neighbor embedding). This reduces the vectors to a 2-dimensional space so that we can visualize them. For more information on TSNE, see https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). Note that this method is not deterministic, so running this cell multiple times will give you a different visualization.

The cell below will run TSNE and plot the word embeddings corresponding to thed 1,000 most frequent words on a 2-dimensional plot. You are welcome to increase this threshold if you'd like to see the vectors for more words.

In [20]:
if __name__=='__main__':
    from sklearn.manifold import TSNE
    import numpy as np
    import plotly.express as px
    import pandas as pd
    import warnings
    warnings.filterwarnings("ignore", category=FutureWarning)

    THRESHOLD = 1000
    words = [x[0] for x in sorted(vocab.items(), key = lambda x: -x[1])[:THRESHOLD]]
    idxes = [cbow_dataset.word2idx[word] for word in words]
    vectors = np.array([model.embedding.weight[i].tolist() for i in idxes])

    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, verbose=False)
    new_vectors = tsne_model.fit_transform(vectors)

    df = pd.DataFrame(data={'x': new_vectors[:,0], 'y': new_vectors[:,1], 'word':words})

    fig = px.scatter(df, x='x', y='y', text='word')
    fig.update_traces(textposition='top center')
    fig.update_layout(height=600, title_text='Word Embedding 2D Visualization')
    fig.show()

At a high level, we should see words with similar meanings clustering together. We should also see mini-clusters within this plot.

## Predicting on Model

In [27]:
# Set the model to evaluation mode
model.to(DEVICE)
model.eval()

# Define our input sentence (as a list of words)
input_sentence = ['you', 'need', 'to', 'define']


input_contexts = []

expected_sentence = []
for i in range(len(input_sentence)):
    context = input_sentence[max(i - CONTEXT_SIZE, 0):i] + input_sentence[i + 1:i + CONTEXT_SIZE + 1]
    expected_sentence.append(input_sentence[max(i - CONTEXT_SIZE, 0):i] + ["___"] + input_sentence[i + 1:i + CONTEXT_SIZE + 1])
    input_contexts.append(context)

# Convert input contexts to tensors
input_tensors = []
for context in input_contexts:
    context_indices = [cbow_dataset.word2idx[word] for word in context]
    input_tensors.append(torch.tensor(context_indices).to(DEVICE))

# Make predictions using the model
predicted_words = []
for input_tensor in input_tensors:
    with torch.no_grad():
        input_tensor = input_tensor.unsqueeze(0)
        output = model(input_tensor)
        _, predicted_index = output.max(1)
        predicted_word = cbow_dataset.idx2word[predicted_index.item()]
        predicted_words.append(predicted_word)

# Print the predicted words
for i in range(0, len(predicted_words)):
    print(expected_sentence[i], " : ", predicted_words[i])

['___', 'need', 'to', 'define']  :  the
['you', '___', 'to', 'define']  :  is
['you', 'need', '___', 'define']  :  to
['you', 'need', 'to', '___']  :  want
