### word2vec
- paper: https://arxiv.org/pdf/1301.3781
- https://towardsdatascience.com/word2vec-with-pytorch-implementing-original-paper-2cd7040120b0/
##### notes
- two architecture types: CBOW (Continuous Bag-of-Words), and Skip-Gram
- CBOW would produce "is" for inputs "machine", "learning", "a", "method" and Skip-Gram would do the exact opposite
- word2vec has 2 layers: an embedding layer and a linear layer
- embedding layer: takes word id and returns its 300 dimensional vector. the embedding layer is basically a linear layer without bias and activation
- linear (dense) layer: has a softmax activation. create a multi-class classification task basically with the number of classes being the number of words in the vocabulary
- CBOW takes multiple words as input going through the embedding layer seperatly and then the word embedding vectors are averaged before going into the linear layer. skip-gram just takes a single word
- the model basically directly mirrors the training data so if you feed it a bunch of machine learning papers, its going to learning all of those words very well, but won't be very good with words from a book like harry potter or something like that. words also have different meanings in the different contexts for example model in machine learning is an algorithm and model in fashion is a person
- step 1: encode all the words into their IDs, an integer index from a vocabulary (corpus)
- how to choose a vocabulary? choose the top N most common words from training corpus usually
- vocabulary is usually stored as a dictionary like vocab = { "a": 1, "analysis": 2 }

In [1]:
import torch.nn as nn

In [2]:
EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1

In [3]:
class CBOW(nn.Module):
    def __init__(self, vocab_size: int):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(
            num_embeddings=vocab_size,
            embeddings_dim=EMBED_DIMENSION,
            max_norm=EMBED_MAX_NORM,
        )
        self.linear = nn.Linear(
            in_features=EMBED_DIMENSION,
            out_features=vocab_size,
        )
    def forward(self, X):
        X = self.embeddings(X)
        X = X.mean(axis=1)
        X = self.linear(X)
        return X

In [6]:
from torchtext.vocab import build_vocab_from_iterator
MIN_WORD_FREQUENCY=50
MAX_SEQUENCE_LENGTH = 256

def build_vocab(data_iter, tokenizer):
    vocab = build_vocab_from_iterator(
        map(tokenizer, data_iter),
        specials=["<unk>"],
        min_freq=MIN_WORD_FREQUENCY,
    )
    vocab.set_default_index(vocab["<unk>"])
    return vocab

def collate_cbow(batch, text_pipeline):
    batch_input, batch_output = [], []
    for text in batch:
        text_tokens_ids = text_pipeline(text)

        if len(text_tokens_ids) < CBOW_N_WORDS * 2 + 1:
            continue

        if MAX_SEQUENCE_LENGTH:
            text_tokens_ids = text_tokens_ids[:MAX_SEQUENCE_LENGTH]

        for idx in range(len(text_tokens_ids) - CBOW_N_WORDS * 2):
            token_id_sequence = text_tokens_ids[idx : (idx + CBOW_N_WORDS * 2 + 1)]
            output = token_id_sequence.pop(CBOW_N_WORDS)
            input_ = token_id_sequence
            batch_input.append(intput_)
            batch_output.append(output)
    batch_input = torch.tensor(batch_input, dtype=torch.long)
    batch_output = torch.tensor(batch_output, dtype=torch.long)
    return batch_input, batch_output

OSError: /home/ln/Downloads/venv/lib/python3.12/site-packages/torchtext/lib/libtorchtext.so: undefined symbol: _ZN5torch3jit17parseSchemaOrNameERKSs

In [8]:
from torch.utils.data import DataLoader
from functools import partial

dataloader = DataLoader(
    data_iter,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=partial(collate_cbow, text_pipeline=text_pipeline),
)

NameError: name 'data_iter' is not defined