### Continuous Bag of Words (CBOW) Overview

The Continuous Bag of Words (CBOW) model is a neural network-based approach used in natural language processing (NLP) to generate word embeddings. It is part of the Word2Vec family of models introduced by Mikolov et al. in 2013. The primary goal of the CBOW model is to predict a target word given its surrounding context words within a specified window size.

#### Concept and Understanding

1. **Context and Target Words**:
    - **Context Words**: These are the words surrounding the target word within a specified window size. For example, in the sentence "The sun was setting behind the mountains," if the target word is "was" and the window size is 2, the context words are ["The", "sun", "setting", "behind"].
    - **Target Word**: This is the word that the model aims to predict based on the context words.

2. **Model Architecture**:
    - **Input Layer**: The input layer consists of the context words represented as one-hot encoded vectors.
    - **Hidden Layer**: The hidden layer is a dense layer that projects the input vectors into a lower-dimensional space, creating word embeddings.
    - **Output Layer**: The output layer is a softmax layer that predicts the probability distribution over the entire vocabulary for the target word.

3. **Training Objective**:
    - The CBOW model is trained to maximize the probability of predicting the correct target word given the context words. This is achieved by minimizing the cross-entropy loss between the predicted and actual target words.

4. **Advantages**:
    - **Efficiency**: CBOW is computationally efficient and can be trained on large datasets.
    - **Quality of Embeddings**: The embeddings generated by CBOW capture semantic relationships between words, making them useful for various NLP tasks.

5. **Applications**:
    - **Word Similarity**: Finding similar words based on their embeddings.
    - **Text Classification**: Using word embeddings as features for classification tasks.
    - **Machine Translation**: Improving translation quality by leveraging word embeddings.

Overall, the CBOW model is a powerful tool for generating word embeddings that capture the semantic meaning of words based on their context in a corpus.


### Build sample example CBOW with torch

In [18]:
# Prepare dataset for CBOW model 

# Example dataset generated from https://chatgpt.com/
example_data = [
    "The sun was setting behind the mountains, casting a golden glow over the valley.",
    "She opened the book and found an old letter tucked between the pages.",
    "The cat sat on the windowsill, watching the birds outside with keen interest.",
    "A sudden gust of wind blew the papers off his desk and onto the floor.",
    "They decided to take a road trip along the coast, stopping at every small town.",
    "The scientist carefully recorded the experiment's results in her notebook.",
    "He could hear the distant sound of thunder as dark clouds gathered overhead.",
    "The bakery on the corner fills the street with the smell of fresh bread every morning.",
    "She practiced the piano for hours, determined to perfect the piece before the recital.",
    "A tiny kitten was curled up in a basket by the fireplace, sleeping peacefully.",
    "The detective studied the clues, trying to connect the dots in the mystery case.",
    "They watched as the fireworks exploded in brilliant colors across the night sky.",
    "He found an old photograph of his grandparents when they were young and in love.",
    "The children built a sandcastle at the beach, decorating it with seashells.",
    "The mountain trail was steep and rocky, but the view from the top was breathtaking.",
    "She received a letter from an old friend she hadn't spoken to in years.",
    "The library was quiet except for the sound of pages turning and pencils scratching.",
    "The farmer woke up early to tend to his fields before the sun got too hot.",
    "A soft melody played on the radio as she sipped her coffee and stared out the window.",
    "The ancient ruins stood as a reminder of a once-great civilization.",
]

In [68]:
import json
import torch 
import torch.nn as nn 
import torch.optim as optim

class CBOWDataset:

    def __init__(self, data, window_size=2): 

        """
        Args:
            data: list of sentences
            window_size: number of words to consider before and after the target word
        """
        self.data = data
        self.window_size = window_size
        self.word_to_idx = {}  
        self.idx_to_word = {}
        self.vocab = set()
        self.create_vocab()
        self.create_word_to_idx()
        self.create_data()
        

    def create_vocab(self):
        for sentence in self.data:
            words = sentence.split()
            for word in words:
                self.vocab.add(word)

    def create_word_to_idx(self):
        for idx, word in enumerate(self.vocab):
            self.word_to_idx[word] = idx
            self.idx_to_word[idx] = word

    def create_data(self):
        self.context_words = []
        self.target_word = []

        for sentence in self.data:
            words = sentence.split()
            for i in range(self.window_size, len(words) - self.window_size):
                context = [words[j] for j in range(i - self.window_size, i + self.window_size + 1) if j != i]
                target = words[i]
                self.context_words.append(context)
                self.target_word.append(target)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        context = self.context_words[idx]
        target = self.target_word[idx]
        context_idx = torch.tensor([self.word_to_idx[word] for word in context])
        target_idx = torch.tensor(self.word_to_idx[target])
        return context_idx, target_idx
    
    def get_vocab_size(self):
        return len(self.vocab)
    
    def save_vocab(self, path: str) -> None:
        """
        Save vocabulary and related mappings to JSON file
        
        Args:
            path (str): Path to save vocabulary
        """
        # Save to a JSON file
        with open(path, "w") as json_file:
            json.dump(self.word_to_idx, json_file, indent=4) 

    @staticmethod
    def load_vocab(path: str) -> None:
        """
        Load vocabulary and related mappings from JSON file
        
        Args:
            path (str): Path to load vocabulary
        """
        # Load from a JSON file
        with open(path, "r") as json_file:
            word_to_idx = json.load(json_file)
            idx_to_word = {v: k for k, v in word_to_idx.items()}
            vocab = set(word_to_idx.keys())
        

In [61]:
# Example usage
dataset = CBOWDataset(example_data, window_size=2)
context_words, target_word = dataset.__getitem__(0)
print(context_words, target_word)

tensor([ 48,   7,  54, 175]) tensor(99)


In [62]:
## Define CBOW model 
class CBOW(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int):
        """
        Initialize CBOW model
        
        Args:
            vocab_size (int): Size of vocabulary
            embedding_dim (int): Dimension of word embeddings
        """
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, inputs: torch.Tensor) -> torch.Tensor:
        """
        Forward pass of the model
        
        Args:
            inputs (torch.Tensor): Context word indices [batch_size, context_size]
            
        Returns:
            torch.Tensor: Probability distribution over vocabulary
        """
        embeds = self.embeddings(inputs)  # [batch_size, context_size, embed_dim]
        hidden = torch.mean(embeds, dim=1)  # [batch_size, embed_dim]
        output = self.linear(hidden)  # [batch_size, vocab_size]
        return output


In [63]:
# Define model and parameters
vocab_size = dataset.get_vocab_size()
embedding_dim = 100
batch_size = 32
learning_rate = 0.001
num_epochs = 100

In [64]:
# Train CBOW model
dataloader = torch.utils.data.DataLoader(dataset, 
                                        batch_size=batch_size, 
                                        shuffle=True)
    
# Initialize model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = CBOW(vocab_size, embedding_dim).to(device)
print(model)


CBOW(
  (embeddings): Embedding(179, 100)
  (linear): Linear(in_features=100, out_features=179, bias=True)
)


In [65]:
# Training setup
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    total_loss = 0
    for batch_idx, (context, target) in enumerate(dataloader):
        
        context = context.to(device)
        target = target.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        output = model(context)
        loss = criterion(output, target)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        
    avg_loss = total_loss / len(dataloader)
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}')

Epoch 1/100, Loss: 5.3613
Epoch 2/100, Loss: 5.2994
Epoch 3/100, Loss: 5.2376
Epoch 4/100, Loss: 5.1759
Epoch 5/100, Loss: 5.1144
Epoch 6/100, Loss: 5.0531
Epoch 7/100, Loss: 4.9919
Epoch 8/100, Loss: 4.9308
Epoch 9/100, Loss: 4.8698
Epoch 10/100, Loss: 4.8090
Epoch 11/100, Loss: 4.7484
Epoch 12/100, Loss: 4.6879
Epoch 13/100, Loss: 4.6275
Epoch 14/100, Loss: 4.5672
Epoch 15/100, Loss: 4.5071
Epoch 16/100, Loss: 4.4472
Epoch 17/100, Loss: 4.3874
Epoch 18/100, Loss: 4.3277
Epoch 19/100, Loss: 4.2682
Epoch 20/100, Loss: 4.2089
Epoch 21/100, Loss: 4.1497
Epoch 22/100, Loss: 4.0906
Epoch 23/100, Loss: 4.0318
Epoch 24/100, Loss: 3.9731
Epoch 25/100, Loss: 3.9146
Epoch 26/100, Loss: 3.8562
Epoch 27/100, Loss: 3.7981
Epoch 28/100, Loss: 3.7402
Epoch 29/100, Loss: 3.6825
Epoch 30/100, Loss: 3.6250
Epoch 31/100, Loss: 3.5678
Epoch 32/100, Loss: 3.5108
Epoch 33/100, Loss: 3.4541
Epoch 34/100, Loss: 3.3977
Epoch 35/100, Loss: 3.3416
Epoch 36/100, Loss: 3.2857
Epoch 37/100, Loss: 3.2302
Epoch 38/1

In [55]:
from typing import List, Tuple

def get_similar_words(word: str, 
                     model: CBOW, 
                     dataset: CBOWDataset, 
                     top_k: int = 5) -> List[Tuple[str, float]]:
    """
    Find similar words based on cosine similarity of embeddings
    
    Args:
        word (str): Query word
        model (CBOW): Trained CBOW model
        dataset (CBOWDataset): Dataset containing vocabulary
        top_k (int): Number of similar words to return
        
    Returns:
        List[Tuple[str, float]]: List of (word, similarity) pairs
    """
    if word not in dataset.vocab:
        raise ValueError(f"Word '{word}' not in vocabulary")
    
    # Get word embedding
    word_idx = torch.tensor([dataset.word_to_idx[word]])
    word_embedding = model.embeddings(word_idx)
    
    # Calculate similarities with all words
    all_embeddings = model.embeddings.weight.detach()
    similarities = torch.cosine_similarity(word_embedding, all_embeddings)
    
    # Get top-k similar words
    top_indices = torch.argsort(similarities, descending=True)[1:top_k+1]
    similar_words = [(dataset.idx_to_word[idx.item()], 
                     similarities[idx].item()) 
                    for idx in top_indices]
    
    return similar_words


In [56]:
get_similar_words('sun', model, dataset)

[('ruins', 0.2600654363632202),
 ('off', 0.2595617175102234),
 ('reminder', 0.2342543751001358),
 ('gust', 0.2105642557144165),
 ('distant', 0.2053515762090683)]

In [57]:
get_similar_words('I', model, dataset)

ValueError: Word 'I' not in vocabulary

In [69]:
## Save & load model 
import os
from typing import Dict

def save_model(model: CBOW, dataset: CBOWDataset, save_dir: str) -> None:
    """
    Save CBOW model and vocabulary
    
    Args:
        model (CBOW): Trained CBOW model
        dataset (CBOWDataset): Dataset containing vocabulary
        save_dir (str): Directory to save model and vocabulary
    """
    os.makedirs(save_dir, exist_ok=True)
    
    # Save model state
    model_path = os.path.join(save_dir, 'cbow_model.pt')
    torch.save(model.state_dict(), model_path)
    
    # Save vocabulary
    vocab_path = os.path.join(save_dir, 'vocabulary.json')
    dataset.save_vocab(vocab_path)

def load_model(save_dir: str, embedding_dim: int) -> Tuple[CBOW, Dict]:
    """
    Load saved CBOW model and vocabulary
    
    Args:
        save_dir (str): Directory containing saved model and vocabulary
        embedding_dim (int): Dimension of word embeddings
        
    Returns:
        Tuple[CBOW, Dict]: Loaded model and vocabulary data
    """
    # Load vocabulary
    vocab_path = os.path.join(save_dir, 'vocabulary.json')
    word_to_idx, idx_to_word, vocab = CBOWDataset.load_vocab(vocab_path)
    
    # Initialize and load model
    model = CBOW(len(vocab), embedding_dim)
    model_path = os.path.join(save_dir, 'cbow_model.pt')
    model.load_state_dict(torch.load(model_path))
    model.eval()
    
    return model, word_to_idx

def inference(model, word, word_to_idx):
    word_idx = torch.tensor([word_to_idx[word]])
    word_embedding = model.embeddings(word_idx)
    return word_embedding


In [66]:
save_model(model, dataset, 'cbow_model')

In [None]:
# Example inference using saved model
model, word_to_idx = load_model('cbow_model', embedding_dim)
inference(model, 'sun', word_to_idx)

tensor([[ 8.5920e-01, -4.0341e-02,  1.5539e+00,  3.2055e-01,  9.5291e-01,
          5.9976e-01,  5.4465e-01, -1.6439e+00, -1.5701e+00,  1.1009e-01,
          7.8184e-01, -1.0384e+00,  7.7920e-01,  1.2740e+00, -1.0114e+00,
          7.6585e-01, -2.7919e-01,  4.0247e-01,  1.0488e+00,  1.9079e+00,
         -2.1199e-01, -4.4335e-01, -1.8806e-01,  8.7044e-01, -8.1496e-01,
         -1.5402e-01,  2.3751e+00,  1.9763e-01,  2.7805e-01,  3.4879e-01,
          7.8925e-01,  1.4039e+00, -1.6240e+00, -1.0708e-01,  2.3802e-01,
         -1.2223e+00,  1.1615e+00, -9.0425e-01,  6.5808e-01,  1.8840e-01,
          7.5056e-01, -2.1386e-01, -8.5958e-01,  8.0674e-01, -9.8270e-01,
          1.4049e-01, -3.6771e-01, -9.4613e-01,  2.1960e-01,  7.4400e-01,
          6.8973e-01,  1.6451e+00, -1.3060e+00,  8.5354e-01, -1.1818e+00,
         -1.2987e+00, -6.9507e-01,  1.3776e+00,  9.9991e-01, -7.0496e-01,
         -1.7938e-03,  2.0736e-01, -2.5317e-01,  4.1819e-01,  5.8782e-02,
          2.4086e-01,  2.4027e-01, -4.