## Word2Vec from Scratch

We will be building a Word2Vec method
- What it is? It helps create the vector representation of a given word, called word embeddings.
- Why is it useful? The vectors we create will aim to capture semantic meanings and their relationships with different words, so the famous example 'king' and 'queen' will end up in a similar vector space.
- It might feel like a FFNN, it is the same structure but an embedding model instead of a classification one, it learns word embeddings (semantics in words)

There's two ways to do it:
1. Continuous Bag of Words (CBOW) which tries to predict a word in the sentence, given its surrounding neighbour words.
eg: the quick brown _____ jumps over the lazy dog. We can try to use our surrounding words of 'brown' and 'jumps' to try to predict the missing word 'fox'.

2. Skip Gram is the reverse, instead it will predict the neighbours of a given word. 
eg: the quick ______ fox ______ over the lazy dog.


## Step 1: setup the dataset

I'll go with the reddit text corpus today from Convokit. Its supposed stats are:

- Number of Utterances: 297132
- Number of Speakers: 119889
- Number of Conversations: 8286

We'll import the dataset then see what we're working with

In [1]:
from convokit import Corpus, download
corpus = Corpus(filename=download("reddit-corpus-small"))


TransformerDecoderModel requires ML dependencies. Run 'pip install convokit[llm]' to install them.


  from .autonotebook import tqdm as notebook_tqdm


TransformerEncoderModel requires ML dependencies. Run 'pip install convokit[llm]' to install them.
UnslothUtteranceSimulatorModel requires ML dependencies. Run 'pip install convokit[llm]' to install them.
Dataset already exists at C:\Users\Eunha\.convokit\saved-corpora\reddit-corpus-small


In [30]:
from itertools import islice
from typing import Dict, List, Sequence, Tuple
MAX_UTTERANCES = 10000
chosen_utts = [utt.text for utt in islice(corpus.iter_utterances(), MAX_UTTERANCES)]

print(len(chosen_utts), "utterances loaded")

10000 utterances loaded


Okay we have chosen one line, lets proceed with working with this line to get a feel of how this works. Lets tokenise it to ensure we remove any ambiguities in variations of words that might trip up the model. eg: lowercase the words so 'Talk' and 'talk' aren't differentiated during the training.

In [None]:
import re
def tokenise(text: str) -> List[str]:
    return re.findall(r"[a-z0-9]+'[a-z0-9]+|[a-z0-9]+", text.lower(), flags=re.I)

Now with each utterance cleaned and broken down into simple tokens, we can map it out

In [None]:
def build_vocab(tokenised_utts):
    # Flatten tokenised utterances into a single list of words
    words: List[str] = [token for utt in tokenised_utts for token in utt]
    vocab = set(words)
    word_to_idx:Dict[int, str] = {word: idx for idx, word in enumerate(vocab)}
    idx_to_word:Dict[int, str] = {idx: word for word, idx in word_to_idx.items()}
    return words, word_to_idx, idx_to_word


def tokens_to_ids(tokenised_utts: List[List[str]], word_to_idx: Dict[str, int]) -> List[List[int]]:
    return [[word_to_idx[w] for w in utt if w in word_to_idx] for utt in tokenised_utts]

tokenised_utts = [tokenise(utt) for utt in chosen_utts]
words, word_to_idx, idx_to_word = build_vocab(tokenised_utts)

print(idx_to_word[105])

utt_ids = tokens_to_ids(tokenised_utts, word_to_idx)



gloss


# Step 2: Define the Skip-gram pairs. 

Lets say for example we had a window size of 2:

tokens = ["quick", "brown", "fox", "jumps", "over"]
                0        1      2       3        4

For this iteration, we choose 'fox' as the center word.
window size tells us how far left and right we can look for the context words, so we can make 2 steps up until index 0 or index 4.

Then all the words within index 0 and index 4 will be considered our **context** words (not including center word).
Then we just make pairs for all of them, with the pairs being **(center, context)**

Eg:(fox, quick), (fox, brown), (fox, jumps), (fox, over)

In [34]:

import random
def make_skip_gram_pairs(tokens: Sequence[int], window: int, dynamic_window: bool = False) -> List[Tuple[int, int]]:
    pairs: List[Tuple[int, int]] = []
    for i, center in enumerate(tokens):
        w = random.randint(1, window) if dynamic_window else window
        left_pointer = max(0, i - w)
        right_pointer = min(len(tokens) - 1, i + w)
        for j in range(left_pointer, right_pointer + 1):
            if j == i:
                continue
            pairs.append((center, tokens[j]))
    return pairs


def make_pairs_for_all_utterances(utt_ids: List[List[int]], window: int, dynamic_window: bool = False) -> List[Tuple[int, int]]:
    all_pairs: List[Tuple[int, int]] = []
    for utt in utt_ids:
        if not utt:
            continue
        all_pairs.extend(make_skip_gram_pairs(utt, window, dynamic_window))
    return all_pairs


We laod the data into a dataloader from PyTorch, so we can run as batches later

In [35]:
from torch.utils.data import Dataset, DataLoader
import torch
class Word2VecDataset(Dataset):
    def __init__(self, data: List[Tuple[int, int]]):
        self.data = data

    def __len__(self) -> int:
        return len(self.data)

    def __getitem__(self, idx: int) -> Tuple[int, int]:
        return self.data[idx]


WINDOW = 2
DYNAMIC_WINDOW = True
BATCH_SIZE = 256
training_data = make_pairs_for_all_utterances(utt_ids, WINDOW, DYNAMIC_WINDOW)
dataset = Word2VecDataset(training_data)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dataloader = DataLoader(
    dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    pin_memory=torch.cuda.is_available(),
)


# Step 4: Construct the actual model

In [None]:
import torch
import torch.nn as nn

class SkipGram(nn.Module):
    def __init__(self, vocab_size: int, embedding_dim: int):
        super().__init__()
        #embedding lookup tables/matrix, essentially map words to the embedding vectors
        self.input_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output_embeddings = nn.Embedding(vocab_size, embedding_dim)

    def forward(self, center_indices: torch.Tensor) -> torch.Tensor:

        center_vecs = self.input_embeddings(center_indices) 
        scores = center_vecs @ self.output_embeddings.weight.t()
        return scores


In [37]:
vocab_size = len(word_to_idx)

#this is a hyperparam, configure as needed
embedding_dim = 50

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SkipGram(vocab_size=vocab_size, embedding_dim=embedding_dim).to(device)

#standard loss fn and optimiser
criterion = nn.CrossEntropyLoss()
optimiser = torch.optim.Adam(model.parameters(), lr=1e-3)

print(f"Vocab size: {vocab_size}, Using device: {device}")


Vocab size: 17650, Using device: cuda


In [None]:
#no negative sampling yet
EPOCHS = 5
LOG_EVERY = 500  # print every N steps

model.train()
for epoch in range(1, EPOCHS + 1):
    print(f"\n=== Epoch {epoch}/{EPOCHS} ===")
    total_loss = 0.0
    num_steps = 0

    for step, (centers_tensor, contexts_tensor) in enumerate(dataloader, start=1):
        centers_tensor  = centers_tensor.to(device).long()
        contexts_tensor = contexts_tensor.to(device).long()

        logits = model(centers_tensor)             
        loss   = criterion(logits, contexts_tensor)  

        optimiser.zero_grad() 
        loss.backward()
        optimiser.step()

       
        num_steps += 1
        total_loss += loss.item()

        if step % LOG_EVERY == 0 or step == 1:
            avg_so_far = total_loss / num_steps
            print(f"[Epoch {epoch} | Step {step}] loss={avg_so_far:.4f}")

    epoch_avg = total_loss / max(1, num_steps)
    print(f"Epoch {epoch} completed. Avg loss: {epoch_avg:.4f}")



=== Epoch 1/5 ===
[Epoch 1 | Step 1] loss=29.8818
[Epoch 1 | Step 500] loss=24.1885
[Epoch 1 | Step 1000] loss=22.0120
[Epoch 1 | Step 1500] loss=20.4950
[Epoch 1 | Step 2000] loss=19.3149
[Epoch 1 | Step 2500] loss=18.3104
[Epoch 1 | Step 3000] loss=17.4336
[Epoch 1 | Step 3500] loss=16.6655
Epoch 1 completed. Avg loss: 16.6264

=== Epoch 2/5 ===
[Epoch 2 | Step 1] loss=10.3970
[Epoch 2 | Step 500] loss=11.0480
[Epoch 2 | Step 1000] loss=10.7564
[Epoch 2 | Step 1500] loss=10.4984
[Epoch 2 | Step 2000] loss=10.2817
[Epoch 2 | Step 2500] loss=10.0879
[Epoch 2 | Step 3000] loss=9.9032
[Epoch 2 | Step 3500] loss=9.7355
Epoch 2 completed. Avg loss: 9.7267

=== Epoch 3/5 ===
[Epoch 3 | Step 1] loss=8.7862
[Epoch 3 | Step 500] loss=8.3544
[Epoch 3 | Step 1000] loss=8.2742
[Epoch 3 | Step 1500] loss=8.1990
[Epoch 3 | Step 2000] loss=8.1274
[Epoch 3 | Step 2500] loss=8.0652
[Epoch 3 | Step 3000] loss=8.0023
[Epoch 3 | Step 3500] loss=7.9446
Epoch 3 completed. Avg loss: 7.9423

=== Epoch 4/5 =

In [None]:
import torch.nn.functional as F
NEIGHBOUR_TOP_K = 5
@torch.no_grad()


@torch.no_grad()
def nearest_neighbors(
    model: SkipGram,
    word_to_idx: Dict[str, int],
    idx_to_word: Dict[int, str],
    device: torch.device,
    query_word: str,
    top_k: int = NEIGHBOUR_TOP_K,
):
    model.eval()
    if query_word not in word_to_idx:
        print(f"'{query_word}' not in vocab")
        return []

    '''
    we use input embeddings matrix because in the training, we take the center word get a score for all the context words
    and so in the end, we have a trained input embeddings matrix that is used to find the nearest neighbors
    '''
    embed = model.input_embeddings.weight 
    q_idx = word_to_idx[query_word]
    q_vec = embed[q_idx].unsqueeze(0)     

    sims = F.cosine_similarity(q_vec, embed, dim=1)  #compare how similar  our query word is to all the other rows
    sims[q_idx] = float("-inf")  # don’t return self
    top_vals, top_inds = torch.topk(sims, k=min(top_k, embed.size(0) - 1)) #get top k values and indices


    return [(idx_to_word[i.item()], top_vals[j].item()) for j, i in enumerate(top_inds)]


example='day'
print(nearest_neighbors(model, word_to_idx, idx_to_word, device, example, top_k=NEIGHBOUR_TOP_K))


[('before', 0.5946093797683716), ('ignore', 0.5721971392631531), ('back', 0.5666858553886414), ('thread', 0.5606597065925598), ('through', 0.5575634241104126)]
