# Word Embedding
- What is Word Embedding?
    - A technique to represent **words as dense vectors** in a lower-dimensional space.
    - Captures **semantic relationships** between words based on their usage in a corpus.
    - Similar words have close vector representations.
        - E(King) - E(Man) + E(Woman) ≈ E(Queen)
- Why Use Word Embeddings Instead of One-Hot Encoding?
    - One-hot encoding is sparse and high-dimensional (e.g., for 50K words, it needs a 50K-dimensional vector).
    - Word embeddings are dense and lower-dimensional (e.g., 100-300 dimensions), making them efficient.
    - Embeddings capture meaning—words with similar meanings have closer vectors.
    - One-hot encoding treats all words as independent, whereas embeddings learn relationships.
- Common Word Embedding Models
    - Word2Vec (CBOW & Skip-Gram) – Predicts words based on context.
    - GloVe – Uses word co-occurrence matrices to find relationships.
    - FastText – Embeds subword information, useful for rare words.
    - Transformer-based Embeddings (BERT, GPT) – Contextual embeddings, meaning changes based on sentence structure.
- How Word Embeddings are Learned?
    - Word2Vec (CBOW & Skip-Gram): Predicts missing words in a context.
    - Matrix Factorization (GloVe): Learns word relationships from co-occurrence statistics.
    - Neural Networks (BERT, GPT): Uses deep learning models for contextual understanding.




In [229]:
import builtins
import torch
import torchtext
import collections
from collections import Counter
import os
from torch.utils.data import Dataset, DataLoader
from datasets import load_dataset
import re
import itertools
import torch.nn as nn
from tqdm import tqdm
import torch.optim as optim
import torch.nn.functional as F

In [208]:
# define constant
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = torchtext.data.utils.get_tokenizer("basic_english")
BATCH_SIZE = 128
VOCAB_SIZE = 10_000
EMBEDDING_DIM = 256
TOTAL_EPOCH = 5
MAX_TRAINING_EXAMPLE = 50_000

# Step 1: Load and preprocess the data

In [69]:
print("Loading dataset....")
train_iter = load_dataset("embedding-data/simple-wiki", split='train')
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove punctuation
    tokens = tokenizer(text) # tokenize words
    
    return tokens
corpus = [preprocess_text(".".join(single_example["set"])) for single_example in train_iter]

# Step 2: Build vocabulary

In [209]:
all_words = list(itertools.chain(*corpus))
# get unique words
word_frequncy = Counter(all_words)
all_words_sorted_by_frequency = sorted(all_words, key=lambda x: word_frequncy[x])
vocabulary = all_words_sorted_by_frequency[:VOCAB_SIZE - 1] # last one we will keep for unknown words

unique_words = list(set(all_words))
print("Total words:", len(all_words))
print("Total unique words:", len(unique_words))
print("Vocabulary size: ", len(vocabulary) + 1)

# word to index mapper and index to word mapper
word_to_index = {i: word for word, i in enumerate(vocabulary)}
index_to_word = {i: word for word, i in word_to_index.items()}

Total words: 4515153
Total unique words: 118134
Vocabulary size:  10000


# Step 3: Generate training data

![alt text](https://i.sstatic.net/Urqj0.png)

In [210]:
# CoB
WINDOW_SIZE = 2

def get_training_pairs(algorithm="cbow", window_size=WINDOW_SIZE):
    training_pairs = []

    for sentence in corpus:
        # convert the word into index
        sentence_with_index = [word_to_index.get(word, VOCAB_SIZE -  1) for word in sentence]        
        # Algorithm for Continious bag of words
        # ---?
        if algorithm == "cbow":
            for index in range(window_size, len(sentence_with_index)):
                training_pair = sentence_with_index[index-window_size:index+1]
                training_pairs.append(training_pair)
        else: # skip_gram
            # Algorithm for skip_gram
            # ??_??
            for index in range(window_size, len(sentence_with_index) - window_size):
                training_pair = sentence_with_index[index-window_size : index] + sentence_with_index[index + 1 : index + window_size+1] + [sentence_with_index[index]]
                training_pairs.append(training_pair)
            
    return training_pairs

training_pairs = get_training_pairs()
# Convert training pairs to tensors
train_data = torch.tensor(training_pairs, dtype=torch.long)

In [211]:
# create data loader
class TextDataset(Dataset):
    def __init__(self, data):
        self.data = data[:MAX_TRAINING_EXAMPLE]
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        return self.data[index][:-1], self.data[index][-1]
    

training_dataset = TextDataset(train_data)
    
train_dataloader = DataLoader(training_dataset, batch_size=BATCH_SIZE)

# Step 4: Define word embedding model

- Embedding layer is just linear layer with efficient mat multiplications and without bias term
- Why embedding layer doesn't uses bias terms:
    - Any bias term would apply **the same shift to all words**, which isn’t useful.

In [212]:
class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(Word2Vec, self).__init__()
        # Instead of an embedding layer, we could use a Linear layer without the bias term
        # However, using a Linear layer requires inputting one-hot encoded vectors, 
        # which are sparse and inefficient for large vocabularies.
        # The embedding layer directly maps word indices to dense vectors, making it memory-efficient.
        self.embedding_layer = nn.Embedding(vocab_size, embedding_dim)
        # The output layer projects the embeddings to a probability distribution over the vocabulary.
        # This is similar to a softmax layer in a classification model.
        self.output_layer = nn.Linear(embedding_dim, vocab_size)
    
    def forward(self, x):
        # The embedding layer returns a tensor of shape (batch_size, window, embedding_dim).
        # We take the mean along the context dimension to get a single vector per batch element.
        embed = self.embedding_layer(x).mean(dim=1)
        output = self.output_layer(embed)
        return output

model = Word2Vec(VOCAB_SIZE, EMBEDDING_DIM)

# Step 5: Train the model

In [214]:
model.train()
total_loss = 0.0
# 
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.CrossEntropyLoss()
for epoch in range(TOTAL_EPOCH):
    epoch_loss = 0.0
    total_batch = 0
    with tqdm(train_dataloader, desc=f"Training Epoch {epoch + 1}", leave=False) as tbar:
        for inputs_batch, label_batch in tbar:
            
            # Forward pass
            predictions = model(inputs_batch)
            loss = loss_fn(predictions, label_batch)
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            
            epoch_loss += loss.item()
            total_batch += 1
            tbar.set_postfix(loss=epoch_loss/total_batch)
    # print("Epoch: ", epoch + 1, "/", TOTAL_EPOCH)
    # print("\t\t Loss: ", round(total_loss, 2))


# Return average loss
# return total_loss / len(data_loader)

                                                                                

# Step 5: Find similar words

In [237]:
def find_most_similar_words(model, word, word_to_index, top_k = 5):
    if word not in word_to_index: 
        print("Word not found in the vocabulary")
        return 
    # step 1: Get the index of the word from the vocbulary. That index is also the index in the embedding layer
    word_index = torch.tensor([word_to_index[word]], dtype=torch.long)
    # step 2: Get the embeddings of that word index from the embedding layers
    model.eval()
    with torch.no_grad():
        word_embedding = model.embedding_layer(word_index) # shape: (1, EMBEDDING_DIM)
        word_embedding = word_embedding.squeeze(0) # shape(EMBEDDING_DIM, )
    
    # step 3: Get all the embedding from the model
    all_embeddings = model.embedding_layer.weight # shape (Vocab_size, embedding_dim)

    # step 4: Find cosine similarity of the given words with all embeddings
    similarities = F.cosine_similarity(word_embedding, all_embeddings) # shape: (Vocab_size, )
    # Get top k words
    similar_indices = torch.argsort(similarities, descending=True).tolist()
    # remove the given word
    similar_indices.remove(word_index.item())
    # Get top K word:
    similar_words = [index_to_word[index] for index in similar_indices[:top_k]]
    similar_words_txt = ", ".join(similar_words)
    print("Given word:", word)
    print("Similar words: ", similar_words_txt)

find_most_similar_words(model, "comforts", word_to_index, 5)

Given word: comforts
Similar words:  naujamiestis, halogen, takur, saville, orientalis
