# Word2Vec

In this notebook, you will implement and train two different flavors of word2vec models.

## Instructions

Complete the parts identified with `TODO: Implement`.  
Your code should go between the `START CODE HERE` and `END CODE HERE` tags.

⛔ **Note** You are NOT allowed to use external libraries.  
You are NOT allowed to use autograd libraries like pytorch either.

**Acknowledgement**: This assignment was designed with help of former TA of the course Jake Tae and further adapted to this semester by current course staff.

## Setup

In [1]:
# install
%pip install datasets

# imports
import os
import pathlib
import pickle
import random
import re
from collections import Counter
from typing import Optional

import nltk
import numpy as np
from datasets import load_dataset
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from tqdm import tqdm

nltk.download("stopwords")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/natejly/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# some initializations
SEED = 42
random.seed(SEED)
os.environ["PYTHONHASHSEED"] = str(SEED)
np.random.seed(SEED)

In [3]:
# data loading. We use a small subset of OpenWebText
def get_corpus(size: int = 1000) -> list[str]:
    # NOTE: Loading from a specific revision where openwebtext has been updated
    # to Parquet format since new versions of datasets don't allow scripts
    dataset = load_dataset("stas/openwebtext-10k", revision="refs/pr/3")
    corpus = dataset["train"]["text"][:size]
    return corpus


# for testing purpuses we can also use a tiny corpus
def get_tiny_corpus() -> list[str]:
    document1 = "Natural language processing is a subfield of computer science, linguistics, and machine learning."
    document2 = "It is concerned with giving computers the ability to support and manipulate natural language."
    document3 = "It involves processing natural language datasets using rule-based or probabilistic machine learning approaches."
    document4 = "The goal is a computer capable of understanding the contents of documents through machine learning."
    corpus = [document1, document2, document3, document4]
    return corpus

## Part 1: Preprocessing and Data Preparation

In this part, we will implement necessary functions to preprocess and load the data into our model

In [4]:
def preprocess(corpus: list[str]) -> list[list[str]]:
    """
    Applies:
    1. lowercase transformation
    2. leading & trailing whitespace removal
    3. removing any non-alphanumeric symbols (e.g., punctuation)
    4. splitting each documents by whitespace

    Hint: You can (but don't have to) use regular expressions.
    Also consider .strip() and .split() functions.
    """

    result = []

    # TODO: Implement
    for document in tqdm(corpus, desc="Preprocessing"):
        # START OF YOUR CODE
        text = document.lower()
        text = re.sub(r'[^\w\s]', '', text)
        text = text.split()
        result.append(text)
        # END OF YOUR CODE
    return result

In [5]:
def get_vocabulary(corpus: list[str], vocab_size: int = 2000) -> list[str]:
    """
    Tokenizes each document in the corpus and returns a list of distinct words.
    Sort the words by most frequent to least frequent. If there are
    more words than `vocab_size`, cut off the remaining infrequent words.
    The output should not contain duplicate tokens.
    You should also handle unseen tokens with an special <unk> token,
    so make sure to include this special token in your vocabulary.

    Hint: Consider using collections.Counter (already imported).
    """
    corpus = preprocess(corpus)

    # TODO: Implement (~3 lines)
    # START OF YOUR CODE
    word_counts = Counter(word for document in corpus for word in document)
    word_counts.pop("<unk>", None)
    vocabulary = ["<unk>"] + [word for word, _ in word_counts.most_common(vocab_size-1)]
    vocabulary = vocabulary[:vocab_size]

    # END OF YOUR CODE

    return vocabulary


# sanity check
vocab_size = 10
test_vocabulary = get_vocabulary(get_tiny_corpus(), vocab_size=vocab_size)
assert "<unk>" in test_vocabulary
assert len(test_vocabulary) <= vocab_size

Preprocessing: 100%|██████████| 4/4 [00:00<00:00, 10638.69it/s]


In [6]:
# helper function
def get_vocab2idx(vocabulary: list[str]) -> dict[str, int]:
    """
    Returns a dictionary mapping vocabulary to its index (zero-indexed).
    Example input/output shown below.

    >>> get_vocab2idx(['a', 'b'])
    {'a': 0, 'b': 1}
    """

    # TODO: Implement (~1 line)
    # START OF YOUR CODE
    vocab2idx = {}
    for i, word in enumerate(vocabulary):
        vocab2idx[word] = i
    # END OF YOUR CODE

    return vocab2idx

In [7]:
def generate_training_data(
    corpus: list[str], vocab2idx, window_size: int, k: int
) -> list[tuple[int, list[int], list[int]]]:
    """
    Generates the training data as a list. Each element of the list
    follows the format:

        (center word index, list of context word indicies, list of negative word indices)

    The context word indices are are the indices of the words within the windows size inclusive.
    The negative indicies are sampled from the entire vocabulary,
    excluding the positive and context indices. To keep things simple, we use
    uniform sampling. However, the original word2vec paper uses weighted sampling.

    Hint: Consider using functions in the built-in random library,
    such as random.choice(s).

    Example output shown below (note that the numbers are not correct):
    [
        (10, [1, 2, 3], [0, 4, 5, 8, 9]),
        ...
    ]
    """

    result = []
    # TODO: Implement
    # START OF YOUR CODE
    tokenized_corpus = preprocess(corpus)
    unk_idx = vocab2idx["<unk>"]
    vocab_indices = list(vocab2idx.values())

    for tokens in tokenized_corpus:
        token_indices = [vocab2idx.get(token, unk_idx) for token in tokens]
        n_tokens = len(token_indices)

        for center_pos, center_idx in enumerate(token_indices):
            left = max(0, center_pos - window_size)
            right = min(n_tokens, center_pos + window_size + 1)

            context_indices = [
                token_indices[pos] for pos in range(left, right) if pos != center_pos
            ]
            if not context_indices:
                continue

            blocked = set(context_indices)
            blocked.add(center_idx)
            candidates = [idx for idx in vocab_indices if idx not in blocked]

            if not candidates:
                negative_indices = []
            elif k <= len(candidates):
                negative_indices = random.sample(candidates, k)
            else:
                negative_indices = random.choices(candidates, k=k)

            result.append((center_idx, context_indices, negative_indices))
    # END OF YOUR CODE

    return result



## Part 2: Skipgram Model

### Part 2.1: Activation Functions

In this part,  we implement the activation functions (or non-linearities) that we need for our model.

In [8]:
def sigmoid(x: np.ndarray) -> np.ndarray:
    """
    Computes the sigmoid activation function.
    """

    # TODO: Implement (~1 line)
    # START OF YOUR CODE
    return 1 / (1 + np.exp(-x))
    # END OF YOUR CODE


def softmax(x: np.ndarray) -> np.ndarray:
    """
    Computes the softmax activation function.
    """
    # TODO: Implement (~1 line)
    # START OF YOUR CODE
    return np.exp(x) / np.sum(np.exp(x))
    # END OF YOUR CODE

### Part 2.2: Implementing the Skipgram model

We implement two methods. The first is using just the softmax loss. The second is a more efficient implmenetation which uses negative sampling.

In [9]:
# Softmax loss and gradients function


def get_softmax_loss_and_gradients(
    v_c: np.ndarray,
    u_idx: int,
    U: np.ndarray,
    negative_samples: Optional[list[int]] = None,
):
    """
    This part implements the softmax loss and returns the gradients with respect to the input.

    Given the center word v_c, the index of the context word u_idx,
    and the word embedding matrix U, compute the softmax loss and
    gradients for both the center word and the context word.

    Args:
      v_c: np.ndarray shape (dim)
      u_idx: int
      U: np.ndarray shape (V, dim)
      negative_samples: Not used (ignore for this part)

    Returns:
      loss: float
      grad_v_c: np.ndarray
      grad_outside_vectors: np.ndarray
    """

    # TODO: Implement (~6 lines)

    # START OF YOUR CODE
    scores = U @ v_c
    y_hat = softmax(scores)
    loss = -np.log(y_hat[u_idx])

    y_hat_minus_y = y_hat.copy()
    y_hat_minus_y[u_idx] -= 1
    grad_v_c = U.T @ y_hat_minus_y
    grad_outside_vectors = np.outer(y_hat_minus_y, v_c)
    # END OF YOUR CODE

    return loss, grad_v_c, grad_outside_vectors



In [10]:
# test your implementation on sample input with expected output

v_c = np.asarray([-1.16867804, 1.14282281, 0.75193303])
u_idx = 2
U = np.asarray(
    [
        [-0.34271452, -0.80227727, -0.16128571],
        [0.40405086, 1.8861859, 0.17457781],
        [0.25755039, -0.07444592, -1.91877122],
        [-0.02651388, 0.06023021, 2.46324211],
        [-0.19236096, 0.30154734, -0.03471177],
    ]
)

expected_loss = 4.575654879938905
expected_grads_vc = np.asarray([-0.14065423, 0.84958569, 3.07103537])
expected_grad_U = np.asarray(
    [
        [-0.03961545, 0.03873902, 0.02548877],
        [-0.46011424, 0.44993491, 0.2960397],
        [1.15664118, -1.13105225, -0.74418846],
        [-0.52786724, 0.51618898, 0.33963231],
        [-0.12904425, 0.12618934, 0.08302769],
    ]
)

loss, grads_vc, grads_U = get_softmax_loss_and_gradients(v_c, u_idx, U)
assert np.allclose(loss, expected_loss)
assert np.allclose(grads_vc, expected_grads_vc)
assert np.allclose(grads_U, expected_grad_U)

In [11]:
# negative sampling loss and corresponding gradients


def get_negative_sampling_loss_and_gradients(
    v_c: np.ndarray,
    u_idx: int,
    U: np.ndarray,
    negative_samples_idx: list[int],
) -> tuple[float, np.ndarray, np.ndarray]:
    """
    This part implements the negative sampling loss and also returns the gradients.

    Given the center word v_c, the index of the context word u_idx,
    and the word embedding matrix U, compute the negative sampling loss and
    gradients for both the center word and the context word.

    Args:
        v_c: np.ndarray shape (dim)
        u_idx: int
        v_c_idx: int, the index of the center word that we are considering
          used to eliminate this from being selected as negative
        U: np.ndarray shape (V, dim)
        dataset: list[tuple[int, int]]
        k: int, number of negative samples

    Returns:
        loss: float
        grad_v_c: np.ndarray
        grad_outside_vector: np.ndarray
    """

    # TODO: Implement (~8 lines)

    # START OF YOUR CODE
    grad_outside_vectors = np.zeros_like(U)

    u_o = U[u_idx]
    pos_score = np.dot(u_o, v_c)
    pos_sigmoid = sigmoid(pos_score)
    loss = -np.log(pos_sigmoid)
    grad_v_c = (pos_sigmoid - 1) * u_o
    grad_outside_vectors[u_idx] += (pos_sigmoid - 1) * v_c

    for neg_idx in negative_samples_idx:
        u_k = U[neg_idx]
        neg_score = np.dot(u_k, v_c)
        neg_sigmoid = sigmoid(neg_score)
        loss += -np.log(sigmoid(-neg_score))
        grad_v_c += neg_sigmoid * u_k
        grad_outside_vectors[neg_idx] += neg_sigmoid * v_c
    # END OF YOUR CODE

    return loss, grad_v_c, grad_outside_vectors



In [12]:
# test your implementation on sample input with expected output

negative_samples_idx = [0, 3]

expected_loss = 4.486897078520242
expected_grads_vc = np.asarray([-0.36363528, -0.16053028, 3.75446939])
expected_grad_U = np.asarray(
    [
        [-0.40411265, 0.39517227, 0.26000801],
        [0.0, 0.0, 0.0],
        [1.00696315, -0.98468561, -0.6478849],
        [-1.02337143, 1.00073089, 0.65844207],
        [0.0, 0.0, 0.0],
    ]
)

loss, grads_vc, grads_U = get_negative_sampling_loss_and_gradients(
    v_c, u_idx, U, negative_samples_idx
)
assert np.allclose(loss, expected_loss)
assert np.allclose(grads_vc, expected_grads_vc)
assert np.allclose(grads_U, expected_grad_U)

## Part 3: Word2vec Class and Training

In this part, we will implement a Word2vec class that actually trains the model.

> **Note:**
> We've done this part for you. You don't need to change anything here.

In [13]:
# now we implement the class that will train the model using our defined losses
# DO NOT CHANGE THIS CLASS


class Word2Vec:
    def __init__(
        self,
        corpus: list[str],
        embedding_dim: int = 20,
        learning_rate: float = 0.01,
        num_negative_samples: int = 10,
        window_size: int = 5,
        loss_method: str = "softmax",  # or "negative_sampling"
        save_freq: int = 1000,
        cache_path: str = "data/",
        log_freq: int = 1000,
        vocab_size: int = 2000,
    ) -> None:
        """
        Initializes the model parameters and learning hyperparameters.

        Args:
            corpus: list[str], list of documents
            embedding_dim: int, dimension of the word embedding
            learning_rate: float, learning rate for SGD
            num_negative_samples: int, number of negative samples to use for negative sampling
            window_size: int, window size for context words
            loss_method: str, "softmax" or "negative_sampling"
            save_freq: int, how often to save the model
            cache_path: str, where to save the model
            log_freq: int, how often to print the progress
            vocab_size: int, size of the vocabulary
        """
        self.corpus = corpus
        self.vocab = get_vocabulary(self.corpus, vocab_size=vocab_size)
        self.vocab2idx = get_vocab2idx(self.vocab)
        self.idx2vocab = {v: k for k, v in self.vocab2idx.items()}
        self.embedding_dim = embedding_dim
        self.learning_rate = learning_rate
        self.num_negative_samples = num_negative_samples
        self.window_size = window_size
        self.log_freq = log_freq
        self.save_freq = save_freq

        # the main parameters of the model that we are going to train
        self.center_vecs = (np.random.rand(len(self.vocab), embedding_dim) - 0.5) / embedding_dim
        self.outside_vecs = np.zeros((len(self.vocab), embedding_dim))

        assert loss_method in ["softmax", "negative_sampling"]
        if loss_method == "softmax":
            self.loss_and_grad_fn = get_softmax_loss_and_gradients
        else:
            self.loss_and_grad_fn = get_negative_sampling_loss_and_gradients

        # cache the training data
        self.cache_path = cache_path
        data_cache_path = f"{cache_path}/data.npy"
        if not os.path.exists(data_cache_path):
            pathlib.Path(cache_path).mkdir(parents=True, exist_ok=True)
            self.data = generate_training_data(
                corpus, self.vocab2idx, window_size, num_negative_samples
            )
            # save mmap
            with open(data_cache_path, "wb") as f:
                pickle.dump(self.data, f)
        else:
            # load from mmap
            with open(data_cache_path, "rb") as f:
                self.data = pickle.load(f)

    def train_step(
        self,
        center_word_idx: int,
        outside_words_indexes: list[int],
        negative_idxs: Optional[list[int]] = None,  # only used for negative sampling
    ) -> tuple[float, np.ndarray, np.ndarray]:
        """Run the train step for a given center word, outside (context) words, and negative samples (if applicable)"""

        loss = 0.0
        grad_center_vectors = np.zeros(self.center_vecs.shape)
        gradoutside_vectors = np.zeros(self.outside_vecs.shape)

        for ow_idx in outside_words_indexes:
            center_word_idx = center_word_idx
            loss_j, grad_v_c, grad_outside_vector_j = self.loss_and_grad_fn(
                self.center_vecs[center_word_idx],
                ow_idx,
                self.outside_vecs,
                negative_idxs,
            )
            loss += loss_j
            grad_center_vectors[center_word_idx] += grad_v_c
            gradoutside_vectors += grad_outside_vector_j

        return loss, grad_center_vectors, gradoutside_vectors

    def save(self, current_step: int) -> None:
        np.save(f"{self.cache_path}/center_vecs_{current_step}.npy", self.center_vecs)
        np.save(f"{self.cache_path}/outside_vecs_{current_step}.npy", self.outside_vecs)

    def load(self, checkpoint_path: str, step: int) -> None:
        # load the latest checkpoint
        self.center_vecs = np.load(f"{checkpoint_path}/center_vecs_{step}.npy")
        self.outside_vecs = np.load(f"{checkpoint_path}/outside_vecs_{step}.npy")

    def get_embeddings_avg(self) -> np.ndarray:
        return (self.center_vecs + self.outside_vecs) / 2

    def get_embeddigns_concat(self) -> np.ndarray:
        return np.concatenate((self.center_vecs, self.outside_vecs), axis=1)

    def train(
        self, batch_size: int = 32, num_epochs: int = 10, num_steps: Optional[int] = None
    ) -> None:
        """
        The training loop of the model
        Batch size is simulated by gradient accumulation (code doesn't support batch dimension)

        """
        gradient_accumulation_steps = batch_size

        global_steps = 0
        steps = 0
        loss = 0.0
        grad_center = np.zeros(self.center_vecs.shape)
        grad_outside = np.zeros(self.outside_vecs.shape)
        total_batches = len(self.data) // batch_size
        if num_steps is not None:
            stop_at = num_steps
            num_epochs = 10000  # not used anymore
        else:
            stop_at = total_batches * num_epochs
        done = False

        for epoch in range(num_epochs):
            if done:
                break
            local_step = 0
            for center_idx, outside_word_indexes, negative_idxs in self.data:
                current_loss, center_grad, outside_grad = self.train_step(
                    center_idx,
                    outside_word_indexes,
                    negative_idxs,
                )
                if steps % gradient_accumulation_steps == 0:
                    grad_center += center_grad
                    grad_center /= batch_size
                    self.center_vecs -= self.learning_rate * grad_center

                    grad_outside += outside_grad
                    grad_outside /= batch_size
                    self.outside_vecs -= self.learning_rate * grad_outside

                    # zero out the gradients
                    grad_center = np.zeros(self.center_vecs.shape)
                    grad_outside = np.zeros(self.outside_vecs.shape)
                    global_steps += 1

                    if global_steps % self.log_freq == 0:
                        progress_percent = round(global_steps / stop_at * 100, 2)
                        print(
                            f"ep {epoch} step {local_step} global step {global_steps} n_batches {total_batches} progress {progress_percent}%: loss {(loss / steps):.3f}"
                        )

                    if global_steps % self.save_freq == 0:
                        self.save(global_steps)

                loss += current_loss
                steps += 1
                local_step += 1

                if global_steps >= stop_at:
                    done = True
                    break

In [14]:
# Training
# DO NOT MODIFY THIS CELL

# Setting the hyperparameters

embedding_dim = 20
learning_rate = 0.7
training_method = "softmax"
num_negative_samples = 10
window_size = 2
batch_size = 64
num_epochs = 1
save_freq = 1000000
cache_path = "data"
vocab_size = 5000
limit_data_size = 100

corpus = get_corpus(size=limit_data_size)


def get_model(training_method: str = "softmax") -> Word2Vec:
    word2vec = Word2Vec(
        corpus,
        embedding_dim=embedding_dim,
        learning_rate=learning_rate,
        loss_method=training_method,
        num_negative_samples=num_negative_samples,
        window_size=window_size,
        save_freq=save_freq,
        vocab_size=vocab_size,
        cache_path=cache_path,
    )
    word2vec.train(batch_size=batch_size, num_epochs=num_epochs)
    return word2vec


word2vec_softmax = get_model(training_method="softmax")
word2vec_negative_sampling = get_model(training_method="negative_sampling")
word2vec_softmax.save("final-softmax")
word2vec_negative_sampling.save("final-negative_sampling")


def load_stopwords() -> set[str]:
    import nltk

    stopwords = nltk.corpus.stopwords.words("english")
    stopwords = set(stopwords)
    return stopwords


stopwords = load_stopwords()


def get_nearest_neighbors(
    word: str, embeddings: np.ndarray, vocab2idx: dict[str, int], idx2vocab: list[str], k: int = 5
) -> list[tuple[str, float]]:
    """
    Get the k nearest neighbors for a given word.
    """
    idx = vocab2idx[word]
    embedding = embeddings[idx]
    distances = np.linalg.norm(embedding - embeddings, axis=1)
    sorted_distances = np.argsort(distances)
    return [(idx2vocab[idx], distances[idx]) for idx in sorted_distances[:k]]


def print_nearest_neighbors(model: Word2Vec, k: int = 5, num_print: int = 20) -> None:
    # print nearest neighbors
    j = 0
    for word in model.vocab:
        j += 1
        if word in stopwords:
            continue
        print(
            word,
            get_nearest_neighbors(
                word,
                model.get_embeddings_avg(),
                model.vocab2idx,
                model.idx2vocab,
            ),
        )
        if j > num_print:
            break
    print(
        word,
        get_nearest_neighbors(
            word, model.get_embeddings_avg(), model.vocab2idx, model.idx2vocab, k=k
        ),
    )


print("Nearest neighbors for softmax model")
print_nearest_neighbors(word2vec_softmax)
print("\n---\nNearest neighbors for negative sampling model")
print_nearest_neighbors(word2vec_negative_sampling)

Preprocessing: 100%|██████████| 100/100 [00:00<00:00, 9025.44it/s]
Preprocessing: 100%|██████████| 100/100 [00:00<00:00, 10883.55it/s]


ep 0 step 63936 global step 1000 n_batches 1167 progress 85.69%: loss 34.005


Preprocessing: 100%|██████████| 100/100 [00:00<00:00, 5579.09it/s]


ep 0 step 63936 global step 1000 n_batches 1167 progress 85.69%: loss 30.442
Nearest neighbors for softmax model
<unk> [('<unk>', 0.0), ('were', 0.058117743106800936), ('dobson', 0.058529777830171145), ('liga', 0.05955627904550008), ('dinner', 0.060636233692665076)]
said [('said', 0.0), ('jail', 0.022853871360644512), ('bristol', 0.025014559215132293), ('shriberg', 0.025257632039757946), ('38', 0.02624368798740035)]
said [('said', 0.0), ('jail', 0.022853871360644512), ('bristol', 0.025014559215132293), ('shriberg', 0.025257632039757946), ('38', 0.02624368798740035)]

---
Nearest neighbors for negative sampling model
<unk> [('<unk>', 0.0), ('north', 0.0265327013088875), ('online', 0.028876199263734546), ('maximum', 0.030212628992952673), ('hell', 0.030900694951005492)]
said [('said', 0.0), ('waste', 0.022140011124549486), ('losing', 0.02375417395954198), ('reasonably', 0.02444103867124599), ('fan', 0.02462374798086869)]
said [('said', 0.0), ('waste', 0.022140011124549486), ('losing', 0.