## Word2Vec from Scratch

We will be building a Word2Vec method
- What it is? It helps create the vector representation of a given word, called word embeddings.
- Why is it useful? The vectors we create will aim to capture semantic meanings and their relationships with different words, so the famous example 'king' and 'queen' will end up in a similar vector space.

There's two ways to do it:
1. Continuous Bag of Words (CBOW) which tries to predict a word in the sentence, given its surrounding neighbour words.
eg: the quick brown _____ jumps over the lazy dog. We can try to use our surrounding words of 'brown' and 'jumps' to try to predict the missing word 'fox'.

2. Skip Gram is the reverse, instead it will predict the neighbours of a given word. 
eg: the quick ______ fox ______ over the lazy dog.


## Step 1: setup the dataset

I'll go with the reddit text corpus today from Convokit. Its supposed stats are:

- Number of Utterances: 297132
- Number of Speakers: 119889
- Number of Conversations: 8286

We'll import the dataset then see what we're working with

In [None]:
from convokit import Corpus, download
corpus = Corpus(filename=download("reddit-corpus-small"))


In [51]:
chosen_utts = []
for i, utt in enumerate(corpus.iter_utterances()):
    if i >= 4:
        break
    chosen_utts.append(utt.text)
print(chosen_utts)

['Talk about your day. Anything goes, but subreddit rules still apply. Please be polite to each other! \n', 'I went to visit a few days ago and Ioved it. I can’t find any negatives other than how small the place is. I’m also just a visitor so the perspective is entirely different from someone who lives there. ', 'One time, my family and I had just returned from Japan and we needed a big cab to load up all our baggage. So this prime MPV turned up and he refused to take us because we live in Tampines. On top  that, he was extremely rude. Plus, he started arguing with the neighbouring taxi Drivers and he airport Marshalls promptly told him to leave, which he did.    \n   \nLuckily another MPV taxi turned up, and the driver his round was SUPER friendly.', 'Talk about your day. Anything goes, but subreddit rules still apply. Please be polite to each other! \n']


Okay we have chosen one line, lets proceed with working with this line to get a feel of how this works. Lets tokenise it to ensure we remove any ambiguities in variations of words that might trip up the model. eg: lowercase the words so 'Talk' and 'talk' aren't differentiated during the training.

In [52]:
import re
def tokenize(text: str):
    return re.findall(r"[a-z0-9]+'[a-z0-9]+|[a-z0-9]+", text.lower(), flags=re.I)
chosen_utts_cleaned = [tokenize(utt) for utt in chosen_utts]
print(chosen_utts_cleaned)



[['talk', 'about', 'your', 'day', 'anything', 'goes', 'but', 'subreddit', 'rules', 'still', 'apply', 'please', 'be', 'polite', 'to', 'each', 'other'], ['i', 'went', 'to', 'visit', 'a', 'few', 'days', 'ago', 'and', 'ioved', 'it', 'i', 'can', 't', 'find', 'any', 'negatives', 'other', 'than', 'how', 'small', 'the', 'place', 'is', 'i', 'm', 'also', 'just', 'a', 'visitor', 'so', 'the', 'perspective', 'is', 'entirely', 'different', 'from', 'someone', 'who', 'lives', 'there'], ['one', 'time', 'my', 'family', 'and', 'i', 'had', 'just', 'returned', 'from', 'japan', 'and', 'we', 'needed', 'a', 'big', 'cab', 'to', 'load', 'up', 'all', 'our', 'baggage', 'so', 'this', 'prime', 'mpv', 'turned', 'up', 'and', 'he', 'refused', 'to', 'take', 'us', 'because', 'we', 'live', 'in', 'tampines', 'on', 'top', 'that', 'he', 'was', 'extremely', 'rude', 'plus', 'he', 'started', 'arguing', 'with', 'the', 'neighbouring', 'taxi', 'drivers', 'and', 'he', 'airport', 'marshalls', 'promptly', 'told', 'him', 'to', 'leave

Now with each utterance cleaned and broken down into simple tokens, we can map it out

In [53]:
def build_vocab(chosen_utts_cleaned):
    # Flatten tokenized utterances into a single list of words
    words = [token for utterance in chosen_utts_cleaned for token in utterance]

    vocab = set(words)
    word_to_idx = {word: idx for idx, word in enumerate(vocab)}
    idx_to_word = {idx: word for word, idx in word_to_idx.items()}
    return words, word_to_idx, idx_to_word


# Convert tokens to integer ids using the built mapping
def tokens_to_ids(chosen_utts_cleaned, word_to_idx):
    return [[word_to_idx[w] for w in utt if w in word_to_idx] for utt in chosen_utts_cleaned]

words, word_to_idx, idx_to_word = build_vocab(chosen_utts_cleaned)
print(word_to_idx['days'])
print(idx_to_word[97])
chosen_utts_ids = tokens_to_ids(chosen_utts_cleaned, word_to_idx)
print(chosen_utts_ids)


2
talk
[[97, 41, 77, 57, 96, 56, 9, 86, 17, 67, 24, 75, 31, 104, 100, 42, 29], [78, 79, 100, 71, 68, 38, 2, 5, 33, 36, 50, 78, 88, 4, 43, 47, 16, 29, 40, 46, 21, 69, 102, 55, 78, 80, 22, 0, 68, 3, 76, 69, 13, 55, 84, 103, 20, 52, 92, 99, 25], [58, 89, 48, 98, 33, 78, 1, 0, 59, 20, 34, 33, 61, 11, 68, 106, 32, 100, 28, 12, 87, 74, 30, 76, 6, 35, 83, 95, 12, 33, 62, 39, 100, 18, 53, 60, 61, 49, 45, 91, 7, 90, 10, 62, 64, 63, 85, 73, 62, 93, 72, 65, 69, 70, 105, 15, 33, 62, 44, 54, 19, 101, 26, 100, 82, 94, 62, 14, 37, 51, 83, 105, 95, 12, 33, 69, 27, 81, 23, 64, 66, 8], [97, 41, 77, 57, 96, 56, 9, 86, 17, 67, 24, 75, 31, 104, 100, 42, 29]]


# Step 2: Define the Skip-gram pairs. 

Lets say for example we had a window size of 2:

tokens = ["quick", "brown", "fox", "jumps", "over"]
                0        1      2       3        4

For this iteration, we choose 'fox' as the center word.
window size tells us how far left and right we can look for the context words, so we can make 2 steps up until index 0 or index 4.

Then all the words within index 0 and index 4 will be considered our **context** words (not including center word).
Then we just make pairs for all of them, with the pairs being **(center, context)**

Eg:(fox, quick), (fox, brown), (fox, jumps), (fox, over)

In [55]:

import random
def make_skip_gram_pairs(tokens, window, dynamic_window: bool = False):
    pairs = []
    for i, center in enumerate(tokens):
        #random weight if dynamic_window is true
        w = random.randint(1, window) if dynamic_window else window

        #make all combinations of (center, context) pairs
        left_pointer = max(0, i- w)
        right_pointer = min(len(tokens)-1, i+w)
        for j in range(left_pointer, right_pointer+1):
            if j == i:
                continue
            pairs.append((center, tokens[j]))
    return pairs


#we can make pairs for all utterances, and return as a single list flattened since we're using a dataset class
def make_pairs_for_all_utterances(chosen_utts_ids, window, dynamic_window: bool = False):
    all_pairs = []
    for utt in chosen_utts_ids:
        all_pairs.extend(make_skip_gram_pairs(utt, window, dynamic_window))
    return all_pairs


window = 2
first_cleaned_utterance =  chosen_utts_cleaned[0]
print(f"this is what the pairs look like for the first utterance: {make_skip_gram_pairs(first_cleaned_utterance, window)}")
print(f"this is what it looks like for all utterances and in ID form: {make_pairs_for_all_utterances(chosen_utts_ids, window)}")

this is what the pairs look like for the first utterance: [('talk', 'about'), ('talk', 'your'), ('about', 'talk'), ('about', 'your'), ('about', 'day'), ('your', 'talk'), ('your', 'about'), ('your', 'day'), ('your', 'anything'), ('day', 'about'), ('day', 'your'), ('day', 'anything'), ('day', 'goes'), ('anything', 'your'), ('anything', 'day'), ('anything', 'goes'), ('anything', 'but'), ('goes', 'day'), ('goes', 'anything'), ('goes', 'but'), ('goes', 'subreddit'), ('but', 'anything'), ('but', 'goes'), ('but', 'subreddit'), ('but', 'rules'), ('subreddit', 'goes'), ('subreddit', 'but'), ('subreddit', 'rules'), ('subreddit', 'still'), ('rules', 'but'), ('rules', 'subreddit'), ('rules', 'still'), ('rules', 'apply'), ('still', 'subreddit'), ('still', 'rules'), ('still', 'apply'), ('still', 'please'), ('apply', 'rules'), ('apply', 'still'), ('apply', 'please'), ('apply', 'be'), ('please', 'still'), ('please', 'apply'), ('please', 'be'), ('please', 'polite'), ('be', 'apply'), ('be', 'please'), (

We laod the data into a dataset class from PyTorch, so we can run as batches later

In [58]:
from torch.utils.data import Dataset, DataLoader
class Word2VecDataset(Dataset):
    def __init__(self, data):
        self.data = data
    
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]


training_data = make_pairs_for_all_utterances(chosen_utts_ids, window)
dataset = Word2VecDataset(training_data)
dataloader = DataLoader(dataset, batch_size=10, shuffle=True)

