Your task this week will be to train a skip-gram Word2Vec model. You may use the code from lecture as a starting point. However, since lecture focused on the CBOW model, you will have modifications to make to the model architecture and training data preparation.

To complete the assignment:

- Use the following text to train your model: gutenberg.org/cache/epub/7370/pg7370.txt
- Write code to:
  - Process your data
  - Create the training examples and labels
  - Train your model
  - Compare a few words to evaluate how well the model learned word representations (are they better than random?)
- Describe how the Skip-gram model architecture is different from CBOW, making direct references to your code

Note that the skip-gram model takes longer to train so it would be a good idea to use a GPU. A free option is to use Google Collab for a free GPU instance.

In [2]:
# import modules & set up logging
import os
import pandas as pd
from pandarallel import pandarallel
from collections import Counter

import re
import gensim
import requests
import torch
import random



import os
import json
import pickle
from typing import List
from collections import Counter, OrderedDict
from itertools import chain

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset



import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt_tab')
nltk.download('stopwords')

import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 200)

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hp\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
import multiprocessing

num_processors = multiprocessing.cpu_count()
print(f'Available CPUs: {num_processors}')

Available CPUs: 22


# Pytorch

## Configuration

In [4]:
if torch.cuda.is_available():
    device = "cuda"

elif torch.xpu.is_available():
    device = "xpu"
else:
    device = "cpu"

print(f"Using {device} as device.")

Using xpu as device.


In [5]:
data_dir = "./data"
model_dir = "./models"

In [41]:
os.makedirs(data_dir, exist_ok=True)
os.makedirs(model_dir, exist_ok=True)

In [6]:
debug = False # set to false for full training run

In [8]:
if debug:
    CONTEXT_WINDOW = 2 # the number of words on either side of target word
    EMBEDDING_SIZE = 5
    MIN_FREQ = 5 # dropping words that appear less than 5 times
    BATCH_SIZE = 3
    N_EPOCHS = 1
else:
    CONTEXT_WINDOW = 4 # the number of words on either side of target word
    EMBEDDING_SIZE = 100
    MIN_FREQ =1 # dropping words that appear less than 5 times
    BATCH_SIZE = 64
    N_EPOCHS = 10

## Process Data

In [12]:
url = "https://www.gutenberg.org/cache/epub/7370/pg7370.txt"
response = requests.get(url)
data: List[str] = response.text.splitlines()

In [13]:
data = [sentence.split() for sentence in data]
data = [line for line in data if line and any(word.strip() for word in line)]  # Remove empty lines
print(f"Number of lines in the data: {len(data)}")

Number of lines in the data: 4972


In [14]:
class Vocab:
    def __init__(
        self,
        word_counts: OrderedDict, # vocabular is based on word counts
        min_freq: int = 1, # min times a word must appear in corpus (rare words might not be worth considering)
        max_size: int = None, # we can limit the amount of words as well 
        specials: List[str] = None, # any other special tokens we may want to add, like padding tokens
        unk_token: str = "<unk>" # reserved token for when we run into words not in the vocabulary
    ):
        self.word_counts = word_counts
        self.min_freq = min_freq
        self.max_size = max_size
        self.unk_token = unk_token
        self.specials = list(specials) if specials else []

        if self.unk_token not in self.specials:
            self.specials.insert(0, self.unk_token) # unknown token should always be included

        self.token2idx = {}
        self.idx2token = []

        self._prepare_vocab()


    def __len__(self):
        return len(self.idx2token)
    

    def __contains__(self, value):
        return value in self.idx2token


    def _prepare_vocab(self):
        """Processes input OrderedDict: Filters based on min_freq & adds special tokens."""
        vocab_list = self.specials.copy()  # Copy specials to avoid modifying original list

        # filter words based on min_freq and add to vocab
        filtered_words = [
            word
            for word, freq in self.word_counts.items()
            if freq >= self.min_freq and word not in self.specials
        ]

        # enforcing max vocab size constraint
        if self.max_size is not None:
            n_to_keep = self.max_size - len(self.specials) # special tokens take up spaces
            filtered_words = filtered_words[:n_to_keep]

        # creating final vocab list
        vocab_list.extend(word for word in filtered_words)

        # create look up tables
        self.idx2token = vocab_list
        self.token2idx = {word: idx for idx, word in enumerate(vocab_list)}


    def get_token(self, idx: int) -> str:
        """Returns the token corresponding to an index. Raises error if index is out of range."""
        if 0 <= idx < len(self.idx2token):
            return self.idx2token[idx]
        raise IndexError(f"Index {idx} is out of range for vocabulary size {len(self.idx2token)}")


    def get_index(self, token: str) -> int:
        """Returns the index corresponding to a token. Defaults to unk_token if missing."""
        return self.token2idx.get(token, self.token2idx[self.unk_token])  # return unk_token index if word is not in vocab


    def get_tokens(self, indices: List[int]) -> List[str]:
        """Converts a list of indices into a list of tokens."""
        return [self.get_token(idx) for idx in indices]


    def get_indices(self, tokens: List[str]) -> List[int]:
        """Converts a list of tokens into a list of indices."""
        return [self.get_index(token) for token in tokens]

In [15]:
def pad_sentences(sentences: List[List[str]], context_length: int, pad_token: str = "<pad>") -> List[List[str]]:
    """
    Pads each sentence to fit the context window length with the literal string "<pad>".
    
    Args:
        sentences: A list of sentences, where each sentence is a list of tokens.
        context_length: The number of tokens to either side of the target token.

    Returns:
        A list of padded sentences.
    """
    padded_sentences = []
    for sentence in sentences:
        padded_sentence = [pad_token] * context_length + sentence + [pad_token] * context_length
        padded_sentences.append(padded_sentence)
    
    return padded_sentences

In [16]:
sentences = pad_sentences(data, CONTEXT_WINDOW)

In [17]:
vocab = Vocab(
    word_counts=OrderedDict(Counter(chain.from_iterable(sentences))),
    min_freq=MIN_FREQ,
    specials=["<pad>"]
)

In [18]:
# creating a vocabulary
print(f"Size of Vocabulary: {len(vocab):,}")

Size of Vocabulary: 7,844


In [19]:
for idx in [0, 1, 5, 100, 200, 276]:
    print(f"Index {idx} corresponds to `{vocab.get_token(idx)}`")

Index 0 corresponds to `<unk>`
Index 1 corresponds to `<pad>`
Index 5 corresponds to `eBook`
Index 100 corresponds to `JOHN`
Index 200 corresponds to `WARE,`
Index 276 corresponds to `what`


### Create the Train Data and label

In [20]:
def generate_skipgram_pairs(sentences: List[List[str]], context_length: int, vocab: Vocab):
    """
    Generate (target, context) pairs for Skip-gram model.
    
    In Skip-gram, we predict context words from the target word.
    Each target word generates multiple training examples (one for each context word).

    Args:
        sentences: A list of sentences, where each sentence is a list of tokens.
        context_length: The number of tokens to either side of the target token.
        vocab: a vocab object that maps words to indices and vice versa

    Returns:
        A list of tuples, where each tuple is (target token, context token).
    """
    targets = []
    contexts = []

    for sentence in sentences:
        # Convert sentence to indices
        enc_sentence = vocab.get_indices(sentence)
        
        # Each target word will generate multiple (target, context) pairs
        for target_idx in range(context_length, len(enc_sentence) - context_length):
            target = enc_sentence[target_idx]
            
            # Words to the left of the target
            for j in range(target_idx - context_length, target_idx):
                contexts.append(enc_sentence[j])
                targets.append(target)
            
            # Words to the right of the target
            for j in range(target_idx + 1, target_idx + context_length + 1):
                contexts.append(enc_sentence[j])
                targets.append(target)

    return torch.tensor(targets), torch.tensor(contexts)

In [21]:
def generate_cbow_pairs(sentences: List[List[str]], context_length: int, vocab: Vocab):
    """
    Generate (context, target) pairs for CBOW model.

    Args:
        sentences: A list of sentences, where each sentence is a list of tokens.
        context_length: The number of tokens to either side of the target token.
        vocab: a pytorch vocab object that maps words to indices and vice versa

    Returns:
        A list of tuples, where each tuple is (context tokens, target token).
    """

    contexts = []
    targets = []

    for sentence in sentences:
        
        # using vocab object to converting the sentence from a list of tokens to a list of integers
        enc_sentence = vocab.get_indices(sentence)
        
        # each sentence can potentially generate several training examples
        # target_idx refers to all the position of the target word in the sentence
        # <context-word> <context-word> target-idx <context-word> <context-word>
        for target_idx in range(context_length, len(enc_sentence) - context_length):

            # Create context list and remove target token from it
            target = enc_sentence[target_idx]
            context = (
                enc_sentence[target_idx - context_length : target_idx] # words to the left of the target word
                + enc_sentence[target_idx + 1 : target_idx + context_length + 1] # words to the right of the target word
            )

            # Append the (context, target) pair to the list
            contexts.append(context)
            targets.append(target)

    return torch.tensor(contexts), torch.tensor(targets)

In [22]:
skip_contexts, skip_targets = generate_skipgram_pairs(sentences, CONTEXT_WINDOW, vocab)

# note that there are more training examples than sentences
# because one sentence, if long enough, can provide mulitple training examples
print(f"Number of training examples: {skip_targets.shape}")

Number of training examples: torch.Size([476448])


In [26]:
# converting first context-target pair back to string
for idx in [10,20,33,446,121]:
    print("context:", vocab.get_tokens([skip_contexts[idx].item()]))
    print("target:", vocab.get_tokens([skip_targets[idx].item()]))
    print()

context: ['Project']
target: ['<pad>']

context: ['Gutenberg']
target: ['eBook']

context: ['of']
target: ['Project']

context: ['License']
target: ['this']

context: ['of']
target: ['for']



In [28]:
cbow_contexts, cbow_targets = generate_cbow_pairs(sentences, CONTEXT_WINDOW, vocab)

# note that there are more training examples than sentences
# because one sentence, if long enough, can provide mulitple training examples
print(f"Number of training examples: {cbow_targets.shape}")

Number of training examples: torch.Size([59556])


In [31]:
for idx in [6, 27, 1000]:
    print("context:", vocab.get_tokens(cbow_contexts[idx].tolist()))
    print("target:", vocab.get_tokens([cbow_targets[idx].item()]))
    print()

context: ['Gutenberg', 'eBook', 'of', 'Second', 'of', 'Government', '<pad>', '<pad>']
target: ['Treatise']

context: ['most', 'other', 'parts', 'of', 'world', 'at', 'no', 'cost']
target: ['the']

context: ['to', 'this', 'reflection,', 'viz.', 'there', 'cannot', 'be', 'done']
target: ['that']



### Dataset

In [32]:
class CBOWDataset(Dataset): # subclassing Dataset is required here
    
    def __init__(self, contexts, targets): # necessary method / function
        self.contexts = contexts
        self.targets = targets

    def __len__(self): # necessary method / function
        return len(self.contexts)

    def __getitem__(self, idx): # necessary method / function
        return self.contexts[idx], self.targets[idx]

In [33]:
# Dataset class (similar to CBOW but with target/context swapped)
class SkipGramDataset(Dataset):
    def __init__(self, targets, contexts):
        self.targets = targets
        self.contexts = contexts

    def __len__(self):
        return len(self.targets)

    def __getitem__(self, idx):
        return self.contexts[idx], self.targets[idx]

## Create and Train Model

### CBOW

In [34]:
class CBOW(nn.Module):

    def __init__(self, vocab_size, dims=EMBEDDING_SIZE):
        super().__init__()
        self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=dims)
        self.linear = nn.Linear(in_features=dims, out_features=vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).mean(dim=1) # note which dimension we are taking the mean of
        out = self.linear(embeds) # outputting raw logits, number of ouputs == vocabulary size
        return out
    
    def debug_forward(self, inputs):
        embeds = self.embeddings(inputs)
        print("\nembeddings shape:", embeds.shape)
        print(embeds)
        agg = embeds.mean(dim=1)
        print("\nembeddings shape after aggregating:", agg.shape)
        print(agg)
        out = self.linear(agg)
        print("\nshape of logits:", out.shape)
        print(out)
        return out


In [36]:
cbow_model = CBOW(vocab_size=len(vocab)).to(device)
print(cbow_model)
print(f"Size of Vocabulary: {len(vocab):,}")

CBOW(
  (embeddings): Embedding(7844, 100)
  (linear): Linear(in_features=100, out_features=7844, bias=True)
)
Size of Vocabulary: 7,844


In [None]:
# setting up loss and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=vocab.get_index(vocab.unk_token))
optimizer = optim.Adam(cbow_model.parameters(), lr=0.001)

# setting up dataloader
dataset = CBOWDataset(cbow_contexts, cbow_targets)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

# number of passes through the data
for epoch in range(N_EPOCHS):

    epoch_loss = 0

    for batch_contexts, batch_targets in dataloader:
        
        batch_contexts, batch_targets = batch_contexts.to(device), batch_targets.to(device)
        
        if debug:
            print(f"{batch_contexts=}")
            print(f"{batch_targets=}")
            
            if torch.isnan(batch_contexts).any() or torch.isinf(batch_contexts).any():
                print("NaN or Inf detected in batch_contexts")
            if torch.isnan(batch_targets).any() or torch.isinf(batch_targets).any():
                print("NaN or Inf detected in batch_targets")


        # zero gradients
        optimizer.zero_grad()

        # forward pass
        if debug:
            pred = cbow_model.debug_forward(batch_contexts) # use regular forward for training run
        else:
            pred = cbow_model.forward(batch_contexts)
        
        loss = criterion(pred, batch_targets)

        # backward pass
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

        if debug: break

    if debug: break

    print(f'Epoch {epoch+1}/{N_EPOCHS}, Loss: {epoch_loss/len(dataset):.4f}')

Epoch 1/10, Loss: 0.1115
Epoch 2/10, Loss: 0.0983
Epoch 3/10, Loss: 0.0932
Epoch 4/10, Loss: 0.0887
Epoch 5/10, Loss: 0.0845
Epoch 6/10, Loss: 0.0805
Epoch 7/10, Loss: 0.0768
Epoch 8/10, Loss: 0.0733
Epoch 9/10, Loss: 0.0701
Epoch 10/10, Loss: 0.0670


### SkipGram

In [35]:
# Skip-gram model 
class SkipGram(nn.Module):
    def __init__(self, vocab_size, dims=EMBEDDING_SIZE):
        super().__init__()
        self.embeddings = nn.Embedding(num_embeddings=vocab_size, embedding_dim=dims)
        self.linear = nn.Linear(in_features=dims, out_features=vocab_size)

    def forward(self, inputs):
        # Unlike CBOW, we don't average embeddings in Skip-gram
        # We simply look up the embedding of the target word
        embeds = self.embeddings(inputs)
        out = self.linear(embeds)  # outputting raw logits
        return out
    
    def debug_forward(self, inputs):
        embeds = self.embeddings(inputs)
        print("\nembeddings shape:", embeds.shape)
        print(embeds)
        out = self.linear(embeds)
        print("\nshape of logits:", out.shape)
        print(out)
        return out

In [37]:
skip_model = SkipGram(vocab_size=len(vocab)).to(device)
print(skip_model)
print(f"Size of Vocabulary: {len(vocab):,}")

SkipGram(
  (embeddings): Embedding(7844, 100)
  (linear): Linear(in_features=100, out_features=7844, bias=True)
)
Size of Vocabulary: 7,844


In [39]:
### training time takes about 20 minutes on T4 gpu

# setting up loss and optimizer
criterion = nn.CrossEntropyLoss(ignore_index=vocab.get_index(vocab.unk_token))
optimizer = optim.Adam(skip_model.parameters(), lr=0.001)

# setting up dataloader
dataset = SkipGramDataset(skip_contexts, skip_targets)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)

# number of passes through the data
for epoch in range(N_EPOCHS):

    epoch_loss = 0

    for batch_contexts, batch_targets in dataloader:
        
        batch_contexts, batch_targets = batch_contexts.to(device), batch_targets.to(device)
        
        if debug:
            print(f"{batch_contexts=}")
            print(f"{batch_targets=}")
            
            if torch.isnan(batch_contexts).any() or torch.isinf(batch_contexts).any():
                print("NaN or Inf detected in batch_contexts")
            if torch.isnan(batch_targets).any() or torch.isinf(batch_targets).any():
                print("NaN or Inf detected in batch_targets")


        # zero gradients
        optimizer.zero_grad()

        # forward pass
        if debug:
            pred = skip_model.debug_forward(batch_contexts) # use regular forward for training run
        else:
            pred = skip_model.forward(batch_contexts)
        
        loss = criterion(pred, batch_targets)

        # backward pass
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

        if debug: break

    if debug: break

    print(f'Epoch {epoch+1}/{N_EPOCHS}, Loss: {epoch_loss/len(dataset):.4f}')

Epoch 1/10, Loss: 0.1076
Epoch 2/10, Loss: 0.0993
Epoch 3/10, Loss: 0.0970
Epoch 4/10, Loss: 0.0954
Epoch 5/10, Loss: 0.0942
Epoch 6/10, Loss: 0.0933
Epoch 7/10, Loss: 0.0925
Epoch 8/10, Loss: 0.0919
Epoch 9/10, Loss: 0.0913
Epoch 10/10, Loss: 0.0909


## Save Models

In [45]:
torch.save(cbow_model.embeddings.weight.data, f"{model_dir}/cbow_weights.pt")
torch.save(skip_model.embeddings.weight.data, f"{model_dir}/skip_weights.pt")

In [43]:
# save vocab locally
with open(f"{model_dir}/vocab.pkl", "wb") as f:
    pickle.dump(vocab, f)

## Evaluating Mdoel: compare the trained model with untrained model

In [44]:
def closest_words(embeddings, vocab, word, n=10):
    """
    Find the closest words in terms of cosine similarity.

    Args:
    model: The trained CBOW model.
    vocab: The vocabulary object.
    word: The target word.
    n: Number of closest words to find (default is 10).

    Returns:
    A list of tuples containing the closest words and their cosine similarities.
    """
    if word not in vocab.idx2token:
        raise ValueError(f"'{word}' not in vocabulary")

    # Get the index of the word
    word_idx = vocab.get_index(word)

    # Compute cosine similarity between the word and all other words
    word_embedding = embeddings[word_idx]
    similarities = F.cosine_similarity(word_embedding.unsqueeze(0), embeddings, dim=1)

    # Exclude the word itself
    similarities[word_idx] = -1

    # Find the top n similar words
    closest_idxs = similarities.topk(n).indices

    return [(vocab.get_token(idx), similarities[idx].item()) for idx in closest_idxs]

In [55]:
# cbow_weights_from_class = torch.load(f"{model_dir}/cbow_weights.pt", weights_only=True, map_location=torch.device(device))
skip_weights_from_class = torch.load(f"{model_dir}/skip_weights.pt", weights_only=True, map_location=torch.device(device))

In [56]:
with open(f"{model_dir}/vocab.pkl", "rb") as f:
    vocab_from_class = pickle.load(f)

In [58]:
closest_words(
    embeddings=skip_weights_from_class, # can substitute own weights here
    vocab=vocab_from_class, # can substitute own vocab here
    word="Project", 
    n=10
)

[('Gutenberg™', 0.6182439923286438),
 ('group', 0.46330147981643677),
 ('Gutenberg', 0.45303237438201904),
 ('attached', 0.4118044972419739),
 ('Foundation', 0.38001516461372375),
 ('trademark.', 0.3743043839931488),
 ('included', 0.3677965998649597),
 ('electronic', 0.3648824989795685),
 ('License', 0.36455070972442627),
 ('concept', 0.35457512736320496)]

In [60]:
skip_model_untrained = SkipGram(vocab_size=len(vocab))

closest_words(skip_model_untrained.embeddings.weight.data, vocab, "Project", n=10)

[('ministerially', 0.3565547466278076),
 ('Chap.', 0.3304622173309326),
 ('wrought', 0.32428815960884094),
 ('find,', 0.32320037484169006),
 ('pleases,', 0.3195438086986542),
 ('farther,', 0.31941407918930054),
 ('Rome', 0.31889426708221436),
 ('different', 0.3183806836605072),
 ('BEFORE', 0.31094738841056824),
 ('each', 0.3072778880596161)]

In [61]:
def compare_trained_untrained(embeddings_trained, embeddings_untrained, vocab, target_words, topn=5):
    """
    Compare the closest words for trained and untrained embeddings.

    Args:
        embeddings_trained: Trained model embeddings.
        embeddings_untrained: Untrained model embeddings.
        vocab: Vocabulary object.
        target_words: List of target words to evaluate.
        topn: Number of closest words to retrieve.

    Returns:
        A DataFrame showing the closest words for both trained and untrained embeddings.
    """
    rows = []

    for word in target_words:
        row = [word]

        # Get closest words for trained embeddings
        try:
            trained_sim = closest_words(embeddings_trained, vocab, word, n=topn)
            row.extend([f"{w} ({sim:.6f})" for w, sim in trained_sim])
        except ValueError:
            row.extend([None] * topn)

        # Get closest words for untrained embeddings
        try:
            untrained_sim = closest_words(embeddings_untrained, vocab, word, n=topn)
            row.extend([f"{w} ({sim:.6f})" for w, sim in untrained_sim])
        except ValueError:
            row.extend([None] * topn)

        rows.append(row)

    columns = (
        ["Target Word"] +
        [f"Trained Top {i+1}" for i in range(topn)] +
        [f"Untrained Top {i+1}" for i in range(topn)]
    )

    return pd.DataFrame(rows, columns=columns)


In [63]:
target_words = ["Project", "slavery", "property", "war", "state", "love", "land", "owner", "child", "history", "health",
                "legislative", "federative", "prerogative", "paternal", "political", "power", "tyrant", "king", "dissolution"]
# Compare trained and untrained embeddings
comparison_df = compare_trained_untrained(
    embeddings_trained=skip_weights_from_class,
    embeddings_untrained=skip_model_untrained.embeddings.weight.data,
    vocab=vocab,
    target_words=target_words,
    topn=5
)

# Display the comparison DataFrame
comparison_df

Unnamed: 0,Target Word,Trained Top 1,Trained Top 2,Trained Top 3,Trained Top 4,Trained Top 5,Untrained Top 1,Untrained Top 2,Untrained Top 3,Untrained Top 4,Untrained Top 5
0,Project,Gutenberg™ (0.618244),group (0.463301),Gutenberg (0.453032),attached (0.411804),Foundation (0.380015),ministerially (0.356555),Chap. (0.330462),wrought (0.324288),"find, (0.323200)","pleases, (0.319544)"
1,slavery,inhabitants. (0.416616),enters (0.405078),flattered (0.380136),chanced (0.375924),190. (0.372523),deference (0.338117),appears (0.330488),"These, (0.322855)","company, (0.321896)",odium (0.319615)
2,property,wonderful (0.375763),him? (0.366444),direct; (0.345035),writ (0.344350),favourite (0.339717),"otherwise, (0.351606)",thought (0.349163),on. (0.328751),"violates, (0.326966)",inroads (0.326171)
3,war,distinction (0.360338),"orders, (0.353231)",1.F.1. (0.339296),"weal, (0.333891)",pure (0.313938),inclinations (0.410378),eBooks. (0.338991),164. (0.333776),"agreement, (0.333652)",unity (0.330755)
4,state,hundredth (0.398169),"prevent, (0.373378)",law: (0.365230),another: (0.356500),"continued, (0.348488)","dangerous, (0.388810)",conjugal (0.376794),yellow (0.359922),left; (0.353840),"But, (0.330724)"
5,love,determination (0.441058),beware (0.412676),agreeing (0.381267),consenting (0.364311),stickler (0.363374),wilderness (0.375820),comprehending (0.357659),children’s (0.343036),choose (0.342096),"winter, (0.334080)"
6,land,raise (0.370643),straitening (0.365629),latter (0.360880),throne; (0.358619),Almighty: (0.349587),"slavery, (0.325534)",sacredness (0.317318),(any (0.317094),said (0.314947),defends (0.307193)
7,owner,Literary (0.449295),1980. (0.432703),indemnify (0.403222),Archive (0.401090),162. (0.388566),security: (0.345496),"Justin, (0.342210)",finding (0.332893),uncultivated (0.325989),miserable (0.317754)
8,child,though (0.379086),lion (0.375819),devised; (0.365077),villany. (0.361852),"nourish, (0.358377)",Self-defence (0.402027),returned (0.372312),"ferro, (0.372173)","comes, (0.353006)",honest (0.327401)
9,history,For (0.384496),venison (0.377063),careful (0.372775),Cum (0.355754),righteous (0.355308),potestate (0.386430),"settled, (0.308138)",offices (0.306163),222. (0.305195),secretly (0.300081)


The trained Skip-gram model produces much more meaningful and contextually relevant word embeddings compared to the untrained model:

**Trained Model**: The closest words are much more semantically related to the target word. For example, for the word "Project," the top 5 closest words in the trained model include "Gutenberg," "group," and "Foundation," all of which are meaningful associations with projects or collaborative work. Similarly, for the word "love," the closest words are contextually relevant, like "determination," "beware," and "agreeing," which are related to feelings and actions associated with love.

**Untrained Model**: The closest words here are much more random and lack meaningful semantic connections. For "Project," the untrained model suggests words like "ministerially," "Chap.," and "wrought," which don't align well with the intended meaning of "Project." Similarly, for "slavery," the untrained model suggests words like "deference" and "appears," which are far less contextually relevant than the trained model's suggestions.

# Gensim

## Data Processing

Sentence tokenization and word tokenization:

In [4]:
url = "https://www.gutenberg.org/cache/epub/7370/pg7370.txt"
text = requests.get(url).text.lower()

sentences = []
for sent in sent_tokenize(text):
    tokens = word_tokenize(sent)
    tokens = [word for word in tokens if word.isalpha()]
    if tokens:
        sentences.append(tokens)

print(f"Total sentences: {len(sentences)}")
print(f"Example sentence tokens: {sentences[0][:10]}")

Total sentences: 1460
Example sentence tokens: ['project', 'gutenberg', 'ebook', 'of', 'second', 'treatise', 'of', 'government', 'this', 'ebook']


In [5]:
sentence_texts = [" ".join(tokens) for tokens in sentences]
book_df = pd.DataFrame(sentence_texts, columns=["text"])

# Memory usage and shape
print(f"Memory used by DF {book_df.memory_usage().sum()}")
print(f"Read rows: {book_df.shape[0]}, columns: {book_df.shape[1]}")
book_df.head()

Memory used by DF 11812
Read rows: 1460, columns: 1


Unnamed: 0,text
0,project gutenberg ebook of second treatise of government this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restricti...
1,you may copy it give it away or it under the terms of the project gutenberg license included with this ebook or online at
2,if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook
3,title second treatise of government author john locke release date january ebook most recently updated december language english credits dave gowan and chuck greif start of the project gutenberg e...
4,john locke s second treatise of government was published in the complete unabridged text has been republished several times in edited commentaries


In [6]:
book_df.tail()

Unnamed: 0,text
1455,for forty years he produced and distributed project ebooks with only a loose network of volunteer support
1456,project ebooks are often created from several printed editions all of which are confirmed as not protected by copyright in the unless a copyright notice is included
1457,thus we do not necessarily keep ebooks in compliance with any particular paper edition
1458,most people start at our website which has the main pg search facility
1459,this website includes information about project including how to make donations to the project gutenberg literary archive foundation how to help produce our new ebooks and how to subscribe to our ...


Clean-up text:

In [7]:
pandarallel.initialize()
stop_words = nltk.corpus.stopwords.words('english')

book_df['clean_sentences'] = book_df['text'].parallel_apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
book_df['clean_sentences'] = book_df['clean_sentences'].parallel_apply(lambda x: re.sub('[^a-zA-Z0-9 @ . , : - _]', '', x))

book_df = book_df[['clean_sentences']]
book_df.head()

INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


Unnamed: 0,clean_sentences
0,project gutenberg ebook second treatise government ebook use anyone anywhere united states parts world cost almost restrictions whatsoever
1,may copy give away terms project gutenberg license included ebook online
2,located united states check laws country located using ebook
3,title second treatise government author john locke release date january ebook recently updated december language english credits dave gowan chuck greif start project gutenberg ebook second treatis...
4,john locke second treatise government published complete unabridged text republished several times edited commentaries


In [8]:
sentences = [row.split() for row in book_df['clean_sentences']]
sentences[:2]

[['project',
  'gutenberg',
  'ebook',
  'second',
  'treatise',
  'government',
  'ebook',
  'use',
  'anyone',
  'anywhere',
  'united',
  'states',
  'parts',
  'world',
  'cost',
  'almost',
  'restrictions',
  'whatsoever'],
 ['may',
  'copy',
  'give',
  'away',
  'terms',
  'project',
  'gutenberg',
  'license',
  'included',
  'ebook',
  'online']]

## Create the training examples and labels

Manual training pair generation:

The following code manually creates Skip-gram training pairs (center word, context word) using a sliding window. While this step is not required when using `gensim.Word2Vec`, we include it here to demonstrate how training examples and labels are constructed in Skip-gram.

Build vocabulary and index mapping:

In [9]:
# Flatten all words and count
word_counts = Counter(word for sentence in sentences for word in sentence)

# Optional: filter rare words (min frequency = 5)
min_freq = 5
vocab = [word for word, count in word_counts.items() if count >= min_freq]

# Create mappings
word2idx = {word: idx for idx, word in enumerate(vocab)}
idx2word = {idx: word for word, idx in word2idx.items()}
vocab_size = len(word2idx)

print(f"Vocab size: {vocab_size}")

Vocab size: 1028


Create skip-gram training pairs:

In [10]:
window_size = 2
training_pairs = []

for sentence in sentences:
    sentence = [word for word in sentence if word in word2idx]  # filter unknown words
    for idx, center_word in enumerate(sentence):
        center_idx = word2idx[center_word]
        for w in range(-window_size, window_size + 1):
            context_pos = idx + w
            if w != 0 and 0 <= context_pos < len(sentence):
                context_word = sentence[context_pos]
                context_idx = word2idx[context_word]
                training_pairs.append((center_idx, context_idx))

print(f"Total training pairs: {len(training_pairs)}")
print("Example:", training_pairs[:5])

# Convert first 5 training pairs back to words
for i in range(5):
    center_idx, context_idx = training_pairs[i]
    center_word = idx2word[center_idx]
    context_word = idx2word[context_idx]
    print(f"Pair {i+1}: ({center_word}, {context_word})")

Total training pairs: 74194
Example: [(0, 1), (0, 2), (1, 0), (1, 2), (1, 3)]
Pair 1: (project, gutenberg)
Pair 2: (project, ebook)
Pair 3: (gutenberg, project)
Pair 4: (gutenberg, ebook)
Pair 5: (gutenberg, second)


Convert to tensors:

In [11]:
X = torch.tensor([pair[0] for pair in training_pairs])
y = torch.tensor([pair[1] for pair in training_pairs])

print("X (center word indices):", X[:5])
print("y (context word indices):", y[:5])

# Convert first 5 tensor pairs to word format, split into X_words and y_words
X_words = [idx2word[X[i].item()] for i in range(5)]
y_words = [idx2word[y[i].item()] for i in range(5)]

print("X_words (center):", X_words)
print("y_words (context):", y_words)

X (center word indices): tensor([0, 0, 1, 1, 1])
y (context word indices): tensor([1, 2, 0, 2, 3])
X_words (center): ['project', 'project', 'gutenberg', 'gutenberg', 'gutenberg']
y_words (context): ['gutenberg', 'ebook', 'project', 'ebook', 'second']


## Training

Skip-gram:

In [12]:
%%time

workers = num_processors - 1

sg_model = gensim.models.Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=2,
    min_count=5,
    sg=1,                    # 1 = Skip-gram
    compute_loss=True,
    workers=workers,
    epochs=10
)

CPU times: user 950 ms, sys: 8.5 ms, total: 958 ms
Wall time: 1.07 s


Traing loss computation:

In [13]:
%%time

# getting the training loss value
sg_training_loss = sg_model.get_latest_training_loss()
print(sg_training_loss)

1401080.375
CPU times: user 188 µs, sys: 1.02 ms, total: 1.2 ms
Wall time: 1.28 ms


CBOW:

In [14]:
%%time

workers = num_processors - 1

cbow_model = gensim.models.Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=2,
    min_count=5,
    sg=0,                    # 0 = CBOW
    compute_loss=True,
    workers=workers,
    epochs=10
)

CPU times: user 545 ms, sys: 12.2 ms, total: 557 ms
Wall time: 773 ms


In [15]:
%%time

# getting the training loss value
cbow_training_loss = cbow_model.get_latest_training_loss()
print(cbow_training_loss)

564519.375
CPU times: user 410 µs, sys: 0 ns, total: 410 µs
Wall time: 367 µs


## Evaluating

Compare a few words to evaluate how well the model learned word representations (are they better than random?)

In [16]:
def sg_cbow_similar_words_df(sim_model_sg, sim_model_cbow, word_list, topn=5):
    rows = []

    for word in word_list:
        row = [word]

        # Get Skip-gram results
        try:
            sg_sim = sim_model_sg.wv.most_similar(word, topn=topn)
            row.extend([f"{w} ({sim:.6f})" for w, sim in sg_sim])
        except KeyError:
            row.extend([None] * topn)

        # Get CBOW results
        try:
            cbow_sim = sim_model_cbow.wv.most_similar(word, topn=topn)
            row.extend([f"{w} ({sim:.6f})" for w, sim in cbow_sim])
        except KeyError:
            row.extend([None] * topn)

        rows.append(row)

    columns = (
        ["Target Word"] +
        [f"SG Top{i+1}" for i in range(topn)] +
        [f"CBOW Top{i+1}" for i in range(topn)]
    )

    return pd.DataFrame(rows, columns=columns)

In [17]:
target_words = ["government", "slavery", "property", "war", "state", "love", "land", "owner", "child", "history", "health",
                "legislative", "federative", "prerogative", "paternal", "political", "power", "tyrant", "king", "dissolution"]
topn = 5

In [18]:
sg_cbow_similar_words_df(sg_model, cbow_model, target_words, topn)

Unnamed: 0,Target Word,SG Top1,SG Top2,SG Top3,SG Top4,SG Top5,CBOW Top1,CBOW Top2,CBOW Top3,CBOW Top4,CBOW Top5
0,government,ends (0.993854),members (0.993708),forms (0.993450),societies (0.993133),secure (0.993059),upon (0.999460),set (0.999302),together (0.999293),till (0.999262),society (0.999259)
1,slavery,pay (0.998278),continued (0.998122),representatives (0.998082),former (0.998056),food (0.998037),bound (0.998777),set (0.998758),able (0.998742),agreement (0.998711),long (0.998634)
2,property,become (0.995551),pleases (0.995544),human (0.995401),restraint (0.995378),members (0.995198),left (0.999322),upon (0.999304),thereby (0.999291),together (0.999286),even (0.999285)
3,war,puts (0.984027),man (0.977977),still (0.977045),though (0.976346),common (0.976180),still (0.998885),people (0.998856),though (0.998792),mankind (0.998736),private (0.998731)
4,state,law (0.973860),puts (0.960767),man (0.959877),liberty (0.956544),war (0.955352),law (0.998331),man (0.997872),makes (0.997748),made (0.997732),king (0.997712)
5,love,notwithstanding (0.998241),hinder (0.998150),last (0.998102),commission (0.998042),doctrine (0.998034),makes (0.998787),great (0.998703),whose (0.998700),certain (0.998699),far (0.998697)
6,land,conditions (0.993983),ground (0.993326),supposing (0.993027),inhabitants (0.992895),subdued (0.992872),made (0.999334),yet (0.999317),together (0.999293),force (0.999282),makes (0.999274)
7,owner,agree (0.997564),ebooks (0.997431),volunteers (0.997204),including (0.997112),defect (0.997032),others (0.997341),possession (0.997324),set (0.997245),ends (0.997229),might (0.997201)
8,child,equality (0.997602),tie (0.997448),education (0.997336),useful (0.997278),understood (0.997271),either (0.999282),community (0.999275),people (0.999261),find (0.999261),far (0.999258)
9,history,speak (0.997824),head (0.997790),slaves (0.997787),immediately (0.997742),wholly (0.997720),beginning (0.999207),must (0.999155),condition (0.999132),yet (0.999116),thing (0.999105)


In [19]:
vocab_words = list(sg_model.wv.index_to_key)
random_words = random.sample(vocab_words, 20)
print("Random words from corpus:", random_words)

Random words from corpus: ['facto', 'information', 'remedy', 'since', 'consequently', 'guilty', 'wise', 'fruits', 'government', 'grown', 'change', 'depending', 'magistrates', 'works', 'apt', 'greatest', 'world', 'easy', 'resisting', 'employed']


In [20]:
sg_cbow_similar_words_df(sg_model, cbow_model, random_words, topn)

Unnamed: 0,Target Word,SG Top1,SG Top2,SG Top3,SG Top4,SG Top5,CBOW Top1,CBOW Top2,CBOW Top3,CBOW Top4,CBOW Top5
0,facto,belongs (0.997854),words (0.997834),obey (0.997764),hence (0.997759),kings (0.997744),far (0.997855),great (0.997734),still (0.997662),received (0.997662),hath (0.997652)
1,information,paragraph (0.997715),copies (0.997364),ebook (0.997334),distribution (0.997250),donations (0.996666),foundation (0.998346),title (0.998315),people (0.998300),king (0.998295),whilst (0.998264)
2,remedy,open (0.997710),mean (0.997468),lies (0.997372),answer (0.997331),judges (0.997160),much (0.999154),country (0.999152),either (0.999131),little (0.999114),ought (0.999106)
3,since,keep (0.997601),advantage (0.997580),independent (0.997537),owe (0.997500),seize (0.997463),always (0.999280),first (0.999267),way (0.999262),must (0.999248),people (0.999248)
4,consequently,tyranny (0.998103),danger (0.998099),doth (0.998076),seize (0.997997),kingdom (0.997977),thing (0.998590),king (0.998553),execution (0.998513),better (0.998470),must (0.998465)
5,guilty,room (0.998304),legislators (0.998151),food (0.998130),remain (0.998115),join (0.998111),still (0.998307),upon (0.998277),well (0.998265),answer (0.998262),till (0.998215)
6,wise,affairs (0.998067),monarchies (0.998058),keep (0.998046),looked (0.998041),finds (0.998030),distinct (0.996497),members (0.996390),say (0.996340),evident (0.996335),make (0.996312)
7,fruits,speak (0.998002),accordingly (0.997996),last (0.997960),conjunction (0.997943),conquered (0.997885),also (0.997175),parents (0.997134),cases (0.996999),copyright (0.996997),family (0.996987)
8,government,ends (0.993854),members (0.993708),forms (0.993450),societies (0.993133),secure (0.993059),upon (0.999460),set (0.999302),together (0.999293),till (0.999262),society (0.999259)
9,grown,commonly (0.997983),ambition (0.997965),degree (0.997822),concerning (0.997820),danger (0.997795),form (0.996268),little (0.996217),necessary (0.996153),far (0.996136),commonly (0.996111)


# Skip-gram Model Evaluation Summary

Overall, the Skip-gram model trained on the Gutenberg text shows moderate performance. It performs well on many target words, especially those related to social or political context. For example:

- `"slavery"` returns similar words like `'pay'`, `'continued'`, and `'representatives'`, which are contextually aligned with the narrative around governance and rights.
- `"history"` brings up `'speak'`, `'freemen'`, and `'head'`, which are all conceptually relevant.
- `"owner"` and `"love"` also show relatively meaningful top results that reflect relationships or transitions commonly seen in literature.

However, there are also cases where the model doesn't return strong semantic matches:

- `"war"` shows top results like `'puts'`, `'man'`, and `'food'`, which are too generic.
- `"tyrant"` and `"dissolution"` return contextually weak terms such as `'hundred'`, `'presently'`, and `'occasions'`, which lack clear relevance.
- `"prerogative"` returns verbs like `'hurt'` and `'entered'`, which may be grammatically related but don't reflect deeper meaning.

This variation is likely due to the limited size and thematic scope of the training corpus (a single literary book). The context window (`window=2`) and relatively short training time (10 epochs) also restrict how well the model can capture broader or abstract relationships.

To improve the model, using a larger and more diverse dataset, increasing the window size, or training longer could help strengthen the word embeddings.


# CBOW Model Evaluation Summary

The CBOW model trained on the same Gutenberg text demonstrates reasonable performance but tends to return more general or syntactically frequent terms. It captures some meaningful relationships, but overall the results are more abstract or less thematically grounded compared to the Skip-gram model.

For example:

- `"slavery"` returns similar words like `'bound'`, `'set'`, and `'agreement'`, which are loosely connected but not as contextually deep as in the Skip-gram model.
- `"history"` brings up terms like `'beginning'`, `'must'`, and `'condition'`, which are common in narrative structure but less semantically rich.
- `"property"` and `"government"` return words like `'left'`, `'upon'`, and `'set'`, which may reflect common collocations but not deeper meaning.

Some results also include broad or ambiguous terms:

- `"tyrant"` returns words like `'way'`, `'present'`, and `'done'`, which are vague.
- `"dissolution"` shows words like `'presently'`, `'prince'`, and `'form'`, which lack thematic clarity.
- `"prerogative"` returns `"people"`, `"force"`, and `"governments"`, which are thematically nearby but quite generic.

This is expected, as CBOW averages context words to predict the center word, which can dilute the specificity of the representation. Like Skip-gram, CBOW performance is also affected by the small and domain-specific training corpus.

Training on a larger and more diverse dataset or adjusting training parameters (like window size or epochs) would likely improve the quality of results.

# Skip-gram vs. CBOW Comparison
## 1. From the pytorch function we can see that:


 **Input Handling**
   - **CBOW:** In CBOW, the model takes the context words (surrounding words) and tries to predict the target word. The input to the model is a collection of context words (e.g., a sliding window around a target word), and the average of their embeddings is computed to make a prediction for the target word. In the code, the line:
     ```python
     embeds = self.embeddings(inputs).mean(dim=1)
     ```
     indicates that the embeddings for the input words are averaged along dimension 1 (which corresponds to the context words). This averaged embedding is then passed to the linear layer for prediction.

   - **Skip-gram:** In Skip-gram, the model works in the opposite direction. It takes the target word as input and tries to predict the surrounding context words. The input is a single word (the target word), and its embedding is looked up directly without any averaging. In the code, the line:
     ```python
     embeds = self.embeddings(inputs)
     ```
     shows that the embedding for the target word is directly retrieved (not averaged). This embedding is then passed to the linear layer for the prediction of the context words.

 **Embedding Aggregation**
   - **CBOW:** The embeddings of the context words are aggregated (averaged) before passing them through the linear layer. This aggregation step is essential for CBOW to combine information from multiple context words into a single vector that is used to predict the target word.
   - **Skip-gram:** In Skip-gram, there is no aggregation. The model uses the embedding of the single target word as is, and directly predicts the context words based on that single embedding.






## 2. From the gensim's Word2Vec results we can see that:

> Skip-gram performs better when capturing meaningful semantic relationships. For instance, it identifies connections between "slavery" and words like "representatives" and "continued", while CBOW returns more generic alternatives like "bound" and "set". Similarly, Skip-gram's results for "history" and "owner" are more focused and contextually aligned with the themes in the book.

> After randomly selecting 20 target words, the results still show that **Skip-gram** produces more **semantically meaningful** outputs, while **CBOW** tends to return more **syntactically frequent or general-purpose** words.

> In terms of training loss, the Skip-gram model had a higher loss value (1,401,080.375) compared to the CBOW model (564,519.375). However, lower loss in CBOW does not necessarily indicate better semantic performance — especially in smaller, domain-specific corpora. Skip-gram typically optimizes more individual word-context predictions (**one-to-many**), which naturally leads to a higher total loss. It also tends to take **longer to train** due to the increased number of training examples and updates per word. On the other hand, CBOW is inherently **faster** because it averages context words to predict a single center word (**many-to-one**), reducing the number of updates. Therefore, when comparing training loss and time, it is important to consider the underlying architecture and how each model handles word frequency, context richness, and training efficiency.

In summary, Skip-gram showed stronger performance for this task due to its ability to learn from specific word pairs, which is particularly important when the corpus is small and the domain is narrow.


# Further Training

Increase window from 2 to 5:

Skip-gram:

In [21]:
%%time

workers = num_processors - 1

sg_model = gensim.models.Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=1,                    # 1 = Skip-gram
    compute_loss=True,
    workers=workers,
    epochs=10
)

CPU times: user 1.54 s, sys: 7.52 ms, total: 1.55 s
Wall time: 2.26 s


In [22]:
%%time

# getting the training loss value
sg_training_loss = sg_model.get_latest_training_loss()
print(sg_training_loss)

2466048.75
CPU times: user 106 µs, sys: 12 µs, total: 118 µs
Wall time: 124 µs


CBOW:

In [23]:
%%time

workers = num_processors - 1

cbow_model = gensim.models.Word2Vec(
    sentences=sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0,                    # 0 = CBOW
    compute_loss=True,
    workers=workers,
    epochs=10
)

CPU times: user 390 ms, sys: 6.53 ms, total: 397 ms
Wall time: 396 ms


In [24]:
%%time

# getting the training loss value
cbow_training_loss = cbow_model.get_latest_training_loss()
print(cbow_training_loss)

536882.5625
CPU times: user 86 µs, sys: 0 ns, total: 86 µs
Wall time: 91.1 µs


Similar words:

In [25]:
sg_cbow_similar_words_df(sg_model, cbow_model, target_words, topn)

Unnamed: 0,Target Word,SG Top1,SG Top2,SG Top3,SG Top4,SG Top5,CBOW Top1,CBOW Top2,CBOW Top3,CBOW Top4,CBOW Top5
0,government,ends (0.967691),form (0.966334),representatives (0.966092),necessary (0.964567),require (0.964526),upon (0.999645),together (0.999551),set (0.999532),either (0.999526),new (0.999516)
1,slavery,forfeited (0.996781),subjected (0.996269),sovereignty (0.995833),fortunes (0.995743),proved (0.995729),able (0.999205),violence (0.999192),set (0.999156),bound (0.999149),execution (0.999146)
2,property,gave (0.968743),without (0.967010),could (0.966045),consent (0.963188),appropriate (0.961546),left (0.999470),without (0.999435),made (0.999414),labour (0.999387),thereby (0.999384)
3,war,state (0.952422),puts (0.950416),right (0.926274),man (0.923274),force (0.916896),people (0.999341),mankind (0.999280),rule (0.999269),though (0.999256),still (0.999252)
4,state,puts (0.964353),nature (0.959098),war (0.952422),man (0.943646),whether (0.938932),man (0.999139),makes (0.999062),yet (0.999040),law (0.999034),war (0.999026)
5,love,principles (0.996739),food (0.996523),notwithstanding (0.996247),accordingly (0.996131),evil (0.996075),long (0.999037),great (0.999034),far (0.999028),certain (0.999028),received (0.999024)
6,land,value (0.957648),acres (0.942017),part (0.939996),money (0.938780),would (0.938316),made (0.999466),thing (0.999465),together (0.999457),might (0.999451),yet (0.999436)
7,owner,anyone (0.996454),tax (0.995962),including (0.995249),distribute (0.994986),section (0.994975),set (0.997984),agree (0.997965),possession (0.997945),agreement (0.997850),ends (0.997840)
8,child,honour (0.992755),subjection (0.992333),mother (0.991918),minority (0.990299),education (0.987576),either (0.999485),find (0.999465),see (0.999430),respect (0.999420),things (0.999419)
9,history,examples (0.994785),families (0.994317),instances (0.994194),practice (0.994048),speak (0.994033),must (0.999399),beginning (0.999379),way (0.999350),thing (0.999348),condition (0.999327)


# Further Exploration

When increasing the window size from 2 to 5, the **Skip-gram model's training loss increased** to 2,466,048.75, while **CBOW's remained lower at 536,882.5625**. This difference is expected due to Skip-gram making more individual predictions as the window expands. However, despite the higher loss, Skip-gram shows clear semantic improvements with the larger context window, while CBOW's results remain largely unchanged. This highlights Skip-gram's advantage in capturing richer relationships, even with increased computational cost.

To be more specific, after increasing the context window, the **Skip-gram model shows noticeably improved semantic results**. For example, the word "slavery" now returns more relevant terms such as "forfeited", "subjected", and "sovereignty", which align well with historical and political themes. Similarly, words like "government", "war", and "owner" are now linked to stronger and more interpretable semantic neighbors.

In contrast, **CBOW's outputs remain relatively unchanged**. It continues to return frequent or structurally common words such as "able", "together", "thing", and "must", with limited variation across different target terms. This reflects the limitations of CBOW's context-averaging approach, which tends to smooth over the semantic specificity gained from a wider context.

These findings reinforce that **Skip-gram benefits more from an increased window size**. Although it incurs **higher computational cost and loss**, the trade-off results in richer and more interpretable word embeddings.