## Word2Vec Overview
Word2Vec is a model introduced by Mikolov et al. (2013) for learning dense, low-dimensional vector representations (embeddings) of words. The model captures semantic and syntactic relationships between words based on their co-occurrence in a corpus.

Word2Vec offers two main architectures for training:

### 1. Skip-gram Model
Objective: Given a target word, predict the surrounding context words.

Training Data: (Target, Context) pairs.

Mathematical Formulation:

Maximize the probability of context words given a target word:
```
𝑃(𝑐𝑜𝑛𝑡𝑒𝑥𝑡∣𝑡𝑎𝑟𝑔𝑒𝑡)=∏𝑤𝑐∈context𝑃(𝑤𝑐∣𝑤𝑡)
```

The model optimizes the likelihood that a word 𝑤_𝑐 appears in the context window of the target word 𝑤_𝑡.

Characteristics:
- Generates more training samples per word, as it creates a pair for each target-context combination.
- Works well for rare words, as it trains on every occurrence of the target word independently.

Use Cases:

- Tasks where capturing detailed, fine-grained relationships between words is important.
- Smaller datasets, where learning high-quality embeddings for rare words is crucial.

### 2. CBOW (Continuous Bag of Words) Model
Objective: Given a set of context words, predict the target word.

Training Data: (Context, Target) pairs.

Mathematical Formulation:

Maximize the probability of the target word given its context words:
```
𝑃(𝑡𝑎𝑟𝑔𝑒𝑡∣𝑐𝑜𝑛𝑡𝑒𝑥𝑡)=𝑃(𝑤𝑡∣𝑤𝑐1,𝑤𝑐2,…,𝑤𝑐𝑛)
```

The model optimizes the likelihood that a word 𝑤𝑡 is the center of a given set of context words.

Characteristics:
- Aggregates multiple context words to predict the target, which results in fewer training samples compared to Skip-gram.
- Trains faster and is more efficient for frequent words.

Use Cases:
- Tasks where speed and computational efficiency are important.
- Large datasets, where learning embeddings quickly for common words suffices.

### Why Two Architectures?
The choice of Skip-gram or CBOW depends on:

#### Dataset Size:
- Skip-gram works better with small datasets and rare words.
- CBOW is faster and works well with large datasets.

#### Task Requirements:
- Use Skip-gram if capturing fine-grained semantic relationships is crucial.
- Use CBOW for faster, general-purpose embeddings.

#### Computational Resources:
- CBOW requires less computational power and trains faster.
- Skip-gram can be more computationally intensive due to the larger number of training pairs.

In [3]:
import re

In [4]:
def generate_training_pairs(corpus, word_to_index, window_size=2, model="skipgram"):
    """
    Generates and displays training pairs for Skip-gram and CBOW models.
    
    Parameters:
    - corpus (list of str): A list of documents (each document is a string).
    - word_to_index (dict): A dictionary mapping words to their indices.
    - window_size (int): The size of the context window.
    - model (str): The model type ('skipgram' or 'cbow').
    
    Returns:
    - pairs (list of tuples): A list of training pairs for the specified model.
    """
    pairs = []
    
    # Tokenize the corpus and generate pairs
    for doc in corpus:
        tokens = [word for word in re.findall(r'\w+', doc.lower()) if word in word_to_index]
        for idx, word in enumerate(tokens):
            start = max(0, idx - window_size)
            end = min(len(tokens), idx + window_size + 1)
            context = tokens[start:idx] + tokens[idx + 1:end]
            
            if model == "skipgram":
                # Skip-gram: (target, context)
                pairs.extend([(word, context_word) for context_word in context])
            elif model == "cbow":
                # CBOW: (context, target)
                pairs.append((context, word))
            else:
                raise ValueError("Invalid model type. Choose 'skipgram' or 'cbow'.")
    
    # Display pairs
    for pair in pairs[:10]:  # Display only the first 10 pairs for brevity
        print(pair)
    
    return pairs

In [6]:
corpus = [
    "The quick brown fox jumps over the lazy dog",
    "Never jump over the lazy dog quickly",
    "A quick movement of the enemy will jeopardize six gunboats"
]

# Example vocabulary and word-to-index mapping
word_to_index = {
    "the": 0, "quick": 1, "brown": 2, "fox": 3, "jumps": 4, 
    "over": 5, "lazy": 6, "dog": 7, "never": 8, "jump": 9,
    "quickly": 10, "movement": 11, "enemy": 12, "jeopardize": 13,
    "six": 14, "gunboats": 15
}

print("\n--- Skip-gram Training Pairs ---")
generate_training_pairs(corpus, word_to_index, window_size=3, model="skipgram")

print("\n--- CBOW Training Pairs ---")
generate_training_pairs(corpus, word_to_index, window_size=3, model="cbow")



--- Skip-gram Training Pairs ---
('the', 'quick')
('the', 'brown')
('the', 'fox')
('quick', 'the')
('quick', 'brown')
('quick', 'fox')
('quick', 'jumps')
('brown', 'the')
('brown', 'quick')
('brown', 'fox')

--- CBOW Training Pairs ---
(['quick', 'brown', 'fox'], 'the')
(['the', 'brown', 'fox', 'jumps'], 'quick')
(['the', 'quick', 'fox', 'jumps', 'over'], 'brown')
(['the', 'quick', 'brown', 'jumps', 'over', 'the'], 'fox')
(['quick', 'brown', 'fox', 'over', 'the', 'lazy'], 'jumps')
(['brown', 'fox', 'jumps', 'the', 'lazy', 'dog'], 'over')
(['fox', 'jumps', 'over', 'lazy', 'dog'], 'the')
(['jumps', 'over', 'the', 'dog'], 'lazy')
(['over', 'the', 'lazy'], 'dog')
(['jump', 'over', 'the'], 'never')


[(['quick', 'brown', 'fox'], 'the'),
 (['the', 'brown', 'fox', 'jumps'], 'quick'),
 (['the', 'quick', 'fox', 'jumps', 'over'], 'brown'),
 (['the', 'quick', 'brown', 'jumps', 'over', 'the'], 'fox'),
 (['quick', 'brown', 'fox', 'over', 'the', 'lazy'], 'jumps'),
 (['brown', 'fox', 'jumps', 'the', 'lazy', 'dog'], 'over'),
 (['fox', 'jumps', 'over', 'lazy', 'dog'], 'the'),
 (['jumps', 'over', 'the', 'dog'], 'lazy'),
 (['over', 'the', 'lazy'], 'dog'),
 (['jump', 'over', 'the'], 'never'),
 (['never', 'over', 'the', 'lazy'], 'jump'),
 (['never', 'jump', 'the', 'lazy', 'dog'], 'over'),
 (['never', 'jump', 'over', 'lazy', 'dog', 'quickly'], 'the'),
 (['jump', 'over', 'the', 'dog', 'quickly'], 'lazy'),
 (['over', 'the', 'lazy', 'quickly'], 'dog'),
 (['the', 'lazy', 'dog'], 'quickly'),
 (['movement', 'the', 'enemy'], 'quick'),
 (['quick', 'the', 'enemy', 'jeopardize'], 'movement'),
 (['quick', 'movement', 'enemy', 'jeopardize', 'six'], 'the'),
 (['quick', 'movement', 'the', 'jeopardize', 'six', 'g

In [8]:
import numpy as np
from collections import defaultdict
from itertools import chain
import random

class Word2Vec:
    def __init__(self, vector_size=50, window_size=2, learning_rate=0.01, epochs=10, negative_samples=5):
        """
        Initializes the Word2Vec model.
        
        Parameters:
        - vector_size (int): Dimensionality of word vectors.
        - window_size (int): Context window size.
        - learning_rate (float): Learning rate for gradient descent.
        - epochs (int): Number of training iterations.
        - negative_samples (int): Number of negative samples for each positive sample.
        """
        self.vector_size = vector_size
        self.window_size = window_size
        self.learning_rate = learning_rate
        self.epochs = epochs
        self.negative_samples = negative_samples
        self.vocab = {}
        self.word_to_index = {}
        self.index_to_word = {}
        self.word_counts = defaultdict(int)
        self.W = None  # Target word vectors
        self.W_prime = None  # Context word vectors
    
    def preprocess(self, text):
        """
        Preprocesses the input text (tokenization, lowercasing).
        
        Parameters:
        - text (str): The raw text string.
        
        Returns:
        - tokens (list): A list of preprocessed tokens.
        """
        text = text.lower()
        text = re.sub(r'\W+', ' ', text)
        tokens = text.split()
        return tokens
    
    def build_vocab(self, corpus):
        """
        Builds the vocabulary from the corpus and initializes word vectors.
        
        Parameters:
        - corpus (list of str): A list of documents (each document is a string).
        """
        tokens = list(chain.from_iterable([self.preprocess(doc) for doc in corpus]))
        
        # Count word frequencies
        for token in tokens:
            self.word_counts[token] += 1
        
        # Build vocab with unique words and their indices
        self.vocab = {word for word in self.word_counts}
        self.word_to_index = {word: idx for idx, word in enumerate(self.vocab)}
        self.index_to_word = {idx: word for word, idx in self.word_to_index.items()}
        
        # Initialize word vectors
        vocab_size = len(self.vocab)
        self.W = np.random.rand(vocab_size, self.vector_size)  # Target word vectors
        self.W_prime = np.random.rand(vocab_size, self.vector_size)  # Context word vectors
    
    def generate_training_data(self, corpus):
        """
        Generates training data (target-context pairs) for the Skip-gram model.
        
        Parameters:
        - corpus (list of str): A list of documents.
        
        Returns:
        - training_data (list of tuples): A list of (target, context) word pairs.
        """
        pairs = []
        for doc in corpus:
            tokens = self.preprocess(doc)
            for idx, word in enumerate(tokens):
                target_idx = self.word_to_index[word]
                start = max(0, idx - self.window_size)
                end = min(len(tokens), idx + self.window_size + 1)
                for context_word in tokens[start:idx] + tokens[idx+1:end]:
                    pairs.append((target_idx, self.word_to_index[context_word]))
        return pairs
    
    def sigmoid(self, x):
        """
        Computes the sigmoid of x.
        
        Parameters:
        - x (float): The input value.
        
        Returns:
        - sigmoid (float): Sigmoid output.
        """
        return 1 / (1 + np.exp(-x))
    
    def train(self, corpus):
        """
        Trains the Word2Vec model using the Skip-gram architecture and negative sampling.
        
        Parameters:
        - corpus (list of str): A list of documents.
        """
        training_data = self.generate_training_data(corpus)
        vocab_size = len(self.vocab)
        
        for epoch in range(self.epochs):
            loss = 0
            for target_idx, context_idx in training_data:
                # Positive sample
                target_vector = self.W[target_idx]
                context_vector = self.W_prime[context_idx]
                
                positive_score = self.sigmoid(np.dot(target_vector, context_vector))
                loss += -np.log(positive_score)
                
                # Gradient update for positive sample
                grad = self.learning_rate * (1 - positive_score)
                self.W[target_idx] += grad * context_vector
                self.W_prime[context_idx] += grad * target_vector
                
                # Negative sampling
                for _ in range(self.negative_samples):
                    negative_idx = random.randint(0, vocab_size - 1)
                    if negative_idx == context_idx:
                        continue
                    negative_vector = self.W_prime[negative_idx]
                    
                    negative_score = self.sigmoid(-np.dot(target_vector, negative_vector))
                    loss += -np.log(negative_score)
                    
                    # Gradient update for negative sample
                    grad_neg = self.learning_rate * (1 - negative_score)
                    self.W_prime[negative_idx] -= grad_neg * target_vector
                    self.W[target_idx] -= grad_neg * negative_vector
            
            print(f"Epoch {epoch + 1}/{self.epochs}, Loss: {loss:.4f}")
    
    def get_embedding(self, word):
        """
        Retrieves the embedding for a given word.
        
        Parameters:
        - word (str): The target word.
        
        Returns:
        - embedding (numpy array): The word embedding vector.
        """
        idx = self.word_to_index.get(word)
        if idx is not None:
            return self.W[idx]
        else:
            raise ValueError(f"Word '{word}' not in vocabulary.")

In [9]:
# Initialize and train Word2Vec model
w2v_model = Word2Vec(vector_size=10, window_size=2, epochs=10)
w2v_model.build_vocab(corpus)
w2v_model.train(corpus)

# Retrieve embeddings
embedding = w2v_model.get_embedding("quick")
print(f"Embedding for 'quick':\n{embedding}")

Epoch 1/10, Loss: 929.5796
Epoch 2/10, Loss: 679.6919
Epoch 3/10, Loss: 542.0443
Epoch 4/10, Loss: 474.1075
Epoch 5/10, Loss: 422.4208
Epoch 6/10, Loss: 397.3668
Epoch 7/10, Loss: 370.7108
Epoch 8/10, Loss: 354.4496
Epoch 9/10, Loss: 338.7151
Epoch 10/10, Loss: 322.8259
Embedding for 'quick':
[ 0.35401816 -0.26008658  0.41378339 -0.18623707 -0.12941121 -0.25738425
  0.1542014  -0.57640416 -0.52509664 -0.14805118]


## Notes

What are the consequences of increasing the context window

#### 1. Broader Contextual Information
Effect: A larger window size means the model will consider more words around the target word as part of its context.

Consequence:
- Captures Broader Semantic Relationships:
    - With a larger window, the model learns more about general topics or themes. For example, it can relate "dog" to "park" or "walk" even if they’re farther apart in a sentence.
    - Useful for tasks that benefit from capturing document-level semantics, such as topic modeling or text classification.
- Less Focus on Local Context:
    - Smaller window sizes capture syntactic relationships (e.g., subject-verb-object), while larger windows focus more on semantic relationships (e.g., broader themes).

#### 2. Increased Noise in Context
Effect: With a larger window, more context words are included, but not all of them may be strongly related to the target word.

Consequence:
- Potential for Noisier Training Data:
    - Words that are less relevant to the target (e.g., function words or distant topics) might dilute the training signal.
    - For instance, in the sentence "The quick brown fox jumps over the lazy dog," a window size of 5 might pair "fox" with "dog," which is semantically weaker compared to "jumps" or "brown."
- Harder to Learn Precise Relationships:
    - The model may struggle to distinguish between strongly and weakly related words if the window size includes too many irrelevant pairs.

#### 3. Word Embedding Generalization
Effect: Larger windows help capture high-level, general associations between words.

Consequence:
- Embeddings Capture Broader Themes:
    - Word embeddings become more generalized, representing concepts that are coarser but applicable across broader contexts.
    - Example: With a large window size, the embeddings for "king" and "queen" might show strong similarity not only due to gender-specific words in their immediate vicinity but also due to broader topics like "monarchy" or "royalty."

#### 4. Computational and Memory Costs
Effect: More context words lead to more training pairs.

Consequence:
- Increased Computational Complexity:
    - For each target word, a larger window size generates more (target, context) pairs, which means more updates during training.
    - This increases the time and computational resources required for training.
- Higher Memory Usage:
    - Storing and processing a larger number of training pairs can consume more memory, especially in large corpora.

#### 5. Impact on Skip-gram and CBOW
Skip-gram:
- Larger windows increase the number of context words it tries to predict for each target word.
- Effect:
    - Potentially more accurate embeddings for rare words due to increased training data.
    - May result in noisier predictions for each context word if distant words are less related.
CBOW:
- Larger windows mean the model will aggregate more context words to predict the target.
- Effect:
    - May dilute the influence of strongly related context words.
    - Requires efficient handling of larger input contexts during each training iteration.

#### Task specific window size recommendations:

| Task | Recomended size | Reason |
| :---- | :-----: | :------ |
|Syntactic analysis | Small (1-2) | Captures word order and grammatical roles |
| Semantic similarity | Medium (3-5) | Balances local semantics with broader context |
| Topic modelling | Large (5-10+) | Focuses on document-level themes |
| Recommendation systems | Large (5-10+) | Captures weak signals and broad relationships |

#### In practice
Balancing window size in practice is domain specific, it is recommended to start small (2-3) and allow the model to focus on strong local relationships, then gradually increase the window size and evaluate how it impacts downstream tasks like classification, similarity, or clustering.

With technical texts, smaller windows are prefered to capture precise relationships. Narrative texts may benifit from larger windows to capture themes and broader contexts.