# Continuous Bag of Words (CBOW) Model Implementation:


.   This example demonstrates a Python implementation of the Continuous Bag of Words (CBOW) model, a key component of the Word2Vec framework used for generating word embeddings. The CBOW model predicts a target word based on its surrounding context words, capturing semantic relationships in text. This  

In [79]:
# Import necessary libraries for the CBOW model implementation
import unicodedata
import re
import torch
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [48]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# Setting Hyperparameters for the CBOW Model:

In [91]:
hidden_layer=90
context_size=2

# Data Preparation:

In [50]:
sentences =" The ancient castle stood tall against the backdrop of mist-covered mountains, its weathered stone walls bearing witness to centuries of history. Inside, torches flickered dimly, casting dancing shadows along the corridors lined with tapestries depicting epic battles and noble quests. Scholars huddled in corners, poring over ancient manuscripts filled with arcane knowledge. The aroma of hearty stew drifted from the castle kitchen, where cooks bustled about preparing meals fit for kings and knights. Outside, in the castle courtyard, knights practiced their swordsmanship under the watchful eye of seasoned masters. It was a place where legends were born and whispered secrets echoed through the halls."

In [112]:
def unicodeToAscii(s):
    """
    Convert Unicode string to plain ASCII by removing diacritics.
    """
    return "".join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')

def normalizeString(s):
    """
    Normalize string: lowercase, remove non-letters except punctuation, add space around punctuation.
    """
    s = unicodeToAscii(s.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z!?]+", r" ", s)
    return s.strip()

# Normalize the 'sentences' string
sentences = normalizeString(sentences)


In [52]:
# Create a sorted list of unique words from the sentences
vocab = sorted(list(set(sentences.split())))

# Get the length of the vocabulary
vocab_len = len(vocab)

# Create dictionaries to map indices to words and words to indices
index2word = {i: j for i, j in enumerate(vocab)}
word2index = {j: i for i, j in enumerate(vocab)}



# Building CBOW Data:
The CBOW model requires context-target pairs for training. We define functions to generate context windows and corresponding target words from the tokenized sequences.

In [None]:
# Prepare data and target lists
data = []
target = []

# Loop through the sentence to create context and center words
for i in range(2, len(sentences.split()) - 2):
    context = [word2index[sentences.split()[i-2]], word2index[sentences.split()[i-1]], word2index[sentences.split()[i+1]], word2index[sentences.split()[i+2]]]
    center = [word2index[sentences.split()[i]]]

    # Convert context and center words to one-hot encoded tensors
    a = torch.nn.functional.one_hot(torch.tensor(context), num_classes=vocab_len)
    b = torch.nn.functional.one_hot(torch.tensor(center), num_classes=vocab_len)

    # Append center word to target list and context to data list
    target.append(center)
    data.append(a)

# Convert target and data to tensors
target = torch.tensor(target).squeeze()
data = np.array(data)
input = torch.from_numpy(data)

input, target

# Model Architecture:
The CBOW model architecture consists of an embedding layer followed by a linear layer. We define it by inheriting the nn.Module Pytorch Class as WordEmbeddings. The output of the this model is going to be a word embedding of side of vocabulary.

In [54]:
class cbow(nn.Module):
    def __init__(self, input_size, n):
        """
        Initialize the CBOW model with input and hidden layer sizes.

        Args:
        input_size (int): The size of the input layer.
        n (int): The size of the hidden layer (embedding dimension).
        """
        super(cbow, self).__init__()
        self.n = n
        self.linear1 = nn.Linear(input_size, n)  # Embedding layer
        self.linear2 = nn.Linear(n, input_size)  # Prediction layer

    def forward(self, input):
        """
        Forward pass of the CBOW model.

        Args:
        input (Tensor): The input tensor.

        Returns:
        Tensor: The predicted output.
        """
        input = input.float()
        layer_1 = self.linear1(input)  # Project to embedding space
        ave = torch.sum(layer_1, dim=1) / (2 * context_size)  # Average embeddings
        layer_2 = self.linear2(ave)  # Map to vocabulary space
        y_pred = torch.softmax(layer_2, dim=1)  # Softmax for prediction
        return y_pred

# Note: The weights of the linear1 layer will be used for word embeddings.
# After training, the weights of this layer represent the learned word vectors.

# Training the Model:

In [None]:

# Instantiate the CBOW model
model = cbow(vocab_len, hidden_layer)

# Define the optimizer and loss criterion
optimizer = optim.Adam(model.parameters(), lr=0.1)
criterion = nn.NLLLoss()  # Assuming NLLLoss is the appropriate criterion for this task

# Set the model to training mode
model.train()

# Number of epochs for training
epochs = 10000

# Training loop
for epoch in range(epochs):
    # Forward pass: Compute predicted y by passing x to the model
    y = model(input)

    # Compute the loss
    loss = criterion(y, target)

    # Zero the gradients
    optimizer.zero_grad()

    # Backward pass: Compute gradient of the loss with respect to model parameters
    loss.backward()

    # Update the model parameters
    optimizer.step()

    # Print the loss value for the current epoch
    print(f'Epoch {epoch + 1}, Loss: {loss.item()}')

# Results and Analysis:

In [56]:
# Convert model output probabilities to predicted class indices
_, predicted_classes = torch.max(y, 1)  # Get the indices of the max values along the class dimension

# Calculate the number of correct predictions
correct_predictions = (predicted_classes == target).sum().item()

# Calculate total number of samples
total_samples = len(target)

# Compute accuracy
accuracy = (correct_predictions / total_samples) * 100

print(f'Accuracy: {accuracy:.2f}%')

Accuracy: 98.02%


In [57]:

# Extract the learned word embeddings from the CBOW model.
# `model.linear1.weight.data` contains the weights of the linear layer `linear1`,
# which represent the word embeddings. These embeddings are learned during training
# and are used to capture the semantic meaning of words in the vector space.
word_embeddings = model.linear1.weight.data


# Exploring Word Embeddings in Vector Space:
This section visualizes word embeddings obtained from the CBOW model in a reduced 2D vector space. By applying dimensionality reduction techniques like PCA, we can project high-dimensional word vectors into two dimensions. The resulting scatter plot illustrates how words are distributed and clustered based on their semantic similarities. Interactive tools like Plotly enhance the exploration process, allowing for zooming and panning to examine relationships between words and better understand the underlying structure of the embedding space.

In [108]:
# Fill in the index2word dictionary with "<NaW>" (Not a Word) tokens to match the length of index2word with the hidden layer size
for i in range(vocab_len, hidden_layer):
    index2word[i] = "<NaW>"


In [110]:
import plotly.express as px
import pandas as pd

# Prepare the data for Plotly
df = pd.DataFrame(embeddings_2d, columns=['Component 1', 'Component 2'])
df['Word'] = [index2word[i] for i in range(hidden_layer)]

# Create the scatter plot
fig = px.scatter(df, x='Component 1', y='Component 2', text='Word', title='Word Embeddings Visualization (PCA)')

# Update layout for better readability
fig.update_traces(textposition='top center')
fig.update_layout(
    xaxis_title='Component 1',
    yaxis_title='Component 2',
    title='Word Embeddings Visualization (PCA)',
    template='plotly_white'
)


# Conclusion:
In this example, we implemented the Continuous Bag of Words (CBOW) model using PyTorch. The model takes context words as input and predicts the center word, which is useful for creating word embeddings.

# Summary:
1). Preprocessing: We normalized the input sentences and created mappings between words and indices.                                 
2). Model Implementation: We built a CBOW model using linear layers in PyTorch.              
3). Training: We trained the model using a specified number of epochs and calculated the loss at each step.                                                           
4). Visualization: We visualized the word embeddings using PCA and Plotly for better interpretability.