# Word2Vec (PyTorch Guide)

Word2Vec is a set of techniques for representing words as numerical vectors.  
These vectors are positioned in a high-dimensional space so that **similar words are close together**.  
These vectors are called **word embeddings**.

For example:  
- Words like *"king"* and *"queen"* will have vectors near each other.  
- A word like *"book"* will be positioned farther away.  
- Even simple vector arithmetic works, e.g., **king − man + woman ≈ queen**.  

This representation allows machine learning models to capture **semantic relationships** between words.

## Setup

### Step 1: Familiarize with training data.

In this notebook, we use 100 sentences i scraped from redit about ML. 

In [11]:
import pandas as pd

df = pd.read_csv("../../datasets/ML_Sentences.csv")

print(df.head())

                                            sentence
0                 Make computer understand patterns.
1                      It's glorified curve fitting.
2  Algorithms that make predictions based on prev...
3  Machine Learning is using data to answer quest...
4  Iterative problem solving using computers for ...


### Step 2: Build the vocabulary. 

In [33]:
# load a simple tokenizer
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("basic_english")

# the data must be an iterable of strings
def tokenize(sentences):
    for s in sentences:
        yield tokenizer(s)
sentences = df["sentence"].astype(str).tolist()
tokenized_data = tokenize(sentences)

# build vocabulary 
from torchtext.vocab import build_vocab_from_iterator
vocab = build_vocab_from_iterator(tokenized_data, specials=['unk'])
vocab.set_default_index(vocab["unk"])

# demonstration
sample = "I wish I were an ML expert." 
tokens = tokenizer(sample)
print(tokens)

# you can write a function to translate tokens via vocab into numbers
text_pipeline = lambda tokens:[vocab[token] for token in tokens]
print(text_pipeline(tokens))

['i', 'wish', 'i', 'were', 'an', 'ml', 'expert', '.']
[17, 0, 17, 0, 22, 472, 0, 1]


## Continuous Bag of Words (CBOW)

The **Continuous Bag of Words (CBOW)** model predicts a **target word** from a fixed-size **context window** of surrounding words. The context consists of the words that appear before and after the target.

**Example:**  
Sentence:  
`I wish I were an ML expert`  

With a **context window size of 2**, the context for predicting `I` is:  `["I", "wish", "were", "an"]`.

**Training Data:**  
The training data is structured as pairs **(x, y)**, where:  
- **x** is the input context: $(w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2})$
- **y** is the target word to predict: $w_t$

The model learns to estimate:  
$$
P(w_t \mid w_{t-2}, w_{t-1}, w_{t+1}, w_{t+2})
$$

In [37]:
CONTEXT_WINDOW = 2

def setup_training_data(tokenized_sentences):
    training_data = []
    for s in tokenized_sentences:
        for i in range(CONTEXT_WINDOW, len(s) - CONTEXT_WINDOW):
            wtm2, wtm1 = s[i-2], s[i-1]
            wta1, wta2 = s[i+1], s[i+2]
            y = s[i]
            training_data.append(((wtm2, wtm1, wta1, wta2), y))
    return training_data

# Demonstration
training_data = setup_training_data(tokenize(sentences))
print(training_data[:5])

[(('make', 'computer', 'patterns', '.'), 'understand'), (('it', "'", 'glorified', 'curve'), 's'), (("'", 's', 'curve', 'fitting'), 'glorified'), (('s', 'glorified', 'fitting', '.'), 'curve'), (('algorithms', 'that', 'predictions', 'based'), 'make')]


The `collate_batch` function prepares training batches by converting each `(context, target)` pair into numerical representations using the vocabulary.  
The function should return **two tensors**:  
- A **context tensor** of shape `(batch_size, context_window * 2)`  
- A **target tensor** of shape `(batch_size)`  

In [44]:
import torch

# Device for training: use GPU (CUDA) if available, otherwise fallback to CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    context_list, target_list = [], []
    for context, target in batch:
        _context = torch.tensor(text_pipeline(context), dtype=torch.int64)
        context_list.append(_context)
        target_list.append(vocab[target])

    context_tensor = torch.cat(context_list).to(device)
    target_tensor = torch.tensor(target_list, dtype=torch.int64).to(device)
    
    return context_tensor, target_tensor

# Demonstration
context_tensor, target_tensor = collate_batch(training_data[:5])
print(