# Compare Word2Vec Embeddings (CBOW vs. SkipGram)

**Goal:** Load the trained CBOW and SkipGram model states and the shared vocabulary to compare the quality of their learned word embeddings.

**Evaluations:**
1.  Find nearest neighbors for given words for both models.
2.  Perform word analogy tasks for both models.

## ⚙️ Setup and Imports

In [11]:
import torch
import torch.nn as nn
import numpy as np
import os
import sys
from typing import Union, Optional, Dict, List # Added typing

# Add project root to path
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.append(project_root)
    print(f"Appended project root: {project_root}")

# Import project modules
from utils import logger, get_device, format_num_words # Import helper
from src.word2vec.vocabulary import Vocabulary
# No need to import model definitions (CBOW/SkipGram) if only loading state_dict

# Configure display
%matplotlib inline 
import pandas as pd
pd.options.display.max_rows = 100

## ⚙️ Configuration: Define Runs to Compare

Specify the parameters for the CBOW and SkipGram runs you want to load and compare. **Ensure these match the parameters used during training!**

In [8]:
# --- Shared Parameters --- (Must match for vocabulary)
RUN_CORPUS_NAME = "text8"
RUN_NUM_WORDS = -1 # Set to -1 for 'All' if that's what you used, or e.g., 10000000
RUN_MIN_FREQ = 5
BASE_MODEL_DIR = os.path.join(project_root, "models/word2vec")

# --- CBOW Run Specific Parameters --- (Update these to match your CBOW run)
CBOW_EMBED_DIM = 128
CBOW_WINDOW_SIZE = 5 # Or 3 if you used 3 for the run being loaded
CBOW_EPOCHS = 15 # Or 5 if loading the 5 epoch run
CBOW_LR = 0.001
CBOW_BATCH_SIZE = 512
CBOW_NEG_SAMPLES = 5 # Or 0 if it used CrossEntropyLoss

# --- SkipGram Run Specific Parameters --- (Update these to match your SkipGram run)
SG_EMBED_DIM = 128
SG_WINDOW_SIZE = 5
SG_EPOCHS = 3 # Or 15 if loading the 15 epoch run
SG_LR = 0.001
SG_BATCH_SIZE = 512
SG_NEG_SAMPLES = 5

# --- Determine Device --- 
device = get_device()

# --- Construct Paths --- 
nw_str = format_num_words(RUN_NUM_WORDS)
vocab_filename = f"{RUN_CORPUS_NAME}_vocab_NW{nw_str}_MF{RUN_MIN_FREQ}.json"
VOCAB_PATH = os.path.join(BASE_MODEL_DIR, vocab_filename)

cbow_run_name = f"CBOW_D{CBOW_EMBED_DIM}_W{CBOW_WINDOW_SIZE}_NW{nw_str}_MF{RUN_MIN_FREQ}_E{CBOW_EPOCHS}_LR{CBOW_LR}_BS{CBOW_BATCH_SIZE}"
CBOW_MODEL_STATE_PATH = os.path.join(BASE_MODEL_DIR, cbow_run_name, "model_state.pth")

sg_run_name = f"SkipGram_D{SG_EMBED_DIM}_W{SG_WINDOW_SIZE}_NW{nw_str}_MF{RUN_MIN_FREQ}_E{SG_EPOCHS}_LR{SG_LR}_BS{SG_BATCH_SIZE}"
SG_MODEL_STATE_PATH = os.path.join(BASE_MODEL_DIR, sg_run_name, "model_state.pth")

logger.info(f"Comparing CBOW Run: {cbow_run_name}")
logger.info(f"Comparing SkipGram Run: {sg_run_name}")
logger.info(f"Using Vocabulary: {VOCAB_PATH}")

2025-04-16 22:55:15 | DropoutDisco | INFO     | [device_setup.py:39] | ✅ MPS device found and available (Built: True). Selecting MPS.
2025-04-16 22:55:15 | DropoutDisco | INFO     | [device_setup.py:51] | ✨ Selected compute device: MPS
2025-04-16 22:55:15 | DropoutDisco | INFO     | [684270727.py:37] | Comparing CBOW Run: CBOW_D128_W5_NWAll_MF5_E15_LR0.001_BS512
2025-04-16 22:55:15 | DropoutDisco | INFO     | [684270727.py:38] | Comparing SkipGram Run: SkipGram_D128_W5_NWAll_MF5_E3_LR0.001_BS512
2025-04-16 22:55:15 | DropoutDisco | INFO     | [684270727.py:39] | Using Vocabulary: /Users/Oks_WORKSPACE/Desktop/DEV/W1_project/Dropout_Disco/models/word2vec/text8_vocab_NWAll_MF5.json


## 💾 Load Vocabulary and Model Embeddings

In [9]:
# --- Load Shared Vocabulary --- 
vocab = None
vocab_size = 0
try:
    if not os.path.exists(VOCAB_PATH):
        logger.error(f"❌ Vocabulary file not found: {VOCAB_PATH}")
    else:
        vocab = Vocabulary.load_vocab(VOCAB_PATH)
        vocab_size = len(vocab)
except Exception as e:
    logger.error(f"❌ Failed to load vocabulary: {e}", exc_info=True)

# --- Function to Load Embeddings --- 
def load_embedding_matrix(model_state_path: str, expected_vocab_size: int, expected_embed_dim: int) -> Optional[torch.Tensor]:
    """Loads state dict and extracts/validates the embedding matrix."""
    embedding_matrix = None
    if not os.path.exists(model_state_path):
        logger.error(f"Model state file not found: {model_state_path}")
        return None
        
    try:
        logger.info(f"🧠 Loading model state from: {model_state_path}")
        state_dict = torch.load(model_state_path, map_location=torch.device('cpu'))
        logger.info(f"  Keys: {list(state_dict.keys())}")
        
        # --- Determine which embedding key to use --- 
        # Prioritize 'in_embed.weight' (for NS models), fallback to 'embeddings.weight' (old CBOW)
        embedding_key = None
        if 'in_embed.weight' in state_dict:
            embedding_key = 'in_embed.weight'
        elif 'embeddings.weight' in state_dict:
            embedding_key = 'embeddings.weight'
            logger.warning(f"Using fallback 'embeddings.weight' key.")
        
        if embedding_key:
            embedding_matrix = state_dict[embedding_key].clone().to(device)
            logger.info(f"  Extracted '{embedding_key}'. Shape: {embedding_matrix.shape}")
            
            # Validate shape
            loaded_vocab_size, loaded_embed_dim = embedding_matrix.shape
            if loaded_vocab_size != expected_vocab_size:
                logger.error(f"❌ Vocab size mismatch! Expected {expected_vocab_size}, got {loaded_vocab_size}.")
                return None
            if loaded_embed_dim != expected_embed_dim:
                 logger.error(f"❌ Embed dim mismatch! Expected {expected_embed_dim}, got {loaded_embed_dim}.")
                 return None
            logger.info(f"✅ Embedding matrix loaded and validated.")
            return embedding_matrix
        else:
            logger.error("❌ No recognized embedding weight key found in state dictionary!")
            return None
            
    except Exception as e:
        logger.error(f"❌ Failed to load model state: {e}", exc_info=True)
        return None

# --- Load Both Embeddings --- 
embedding_matrix_cbow = None
embedding_matrix_sg = None

if vocab:
    logger.info("--- Loading CBOW Embeddings ---")
    embedding_matrix_cbow = load_embedding_matrix(CBOW_MODEL_STATE_PATH, vocab_size, CBOW_EMBED_DIM)
    
    logger.info("--- Loading SkipGram Embeddings ---")
    embedding_matrix_sg = load_embedding_matrix(SG_MODEL_STATE_PATH, vocab_size, SG_EMBED_DIM)
    
# Final Check
if vocab and embedding_matrix_cbow is not None and embedding_matrix_sg is not None:
    logger.info("✅✅ Vocabulary, CBOW, and SkipGram embeddings loaded successfully!")
else:
    logger.error("🚨 Failed to load all required artifacts for comparison.")

2025-04-16 22:55:15 | DropoutDisco | INFO     | [vocabulary.py:157] | Attempting to load vocabulary from: /Users/Oks_WORKSPACE/Desktop/DEV/W1_project/Dropout_Disco/models/word2vec/text8_vocab_NWAll_MF5.json
2025-04-16 22:55:15 | DropoutDisco | INFO     | [vocabulary.py:198] | 📚 Vocab loaded (71,291 words) from /Users/Oks_WORKSPACE/Desktop/DEV/W1_project/Dropout_Disco/models/word2vec/text8_vocab_NWAll_MF5.json
2025-04-16 22:55:15 | DropoutDisco | INFO     | [3633925237.py:62] | --- Loading CBOW Embeddings ---
2025-04-16 22:55:15 | DropoutDisco | INFO     | [3633925237.py:22] | 🧠 Loading model state from: /Users/Oks_WORKSPACE/Desktop/DEV/W1_project/Dropout_Disco/models/word2vec/CBOW_D128_W5_NWAll_MF5_E15_LR0.001_BS512/model_state.pth
2025-04-16 22:55:15 | DropoutDisco | INFO     | [3633925237.py:24] |   Keys: ['embeddings.weight', 'linear.weight', 'linear.bias']
2025-04-16 22:55:16 | DropoutDisco | INFO     | [3633925237.py:37] |   Extracted 'embeddings.weight'. Shape: torch.Size([71291,

## 🛠️ Evaluation Utility Functions (Modified)

Modify the utility functions to accept the vocabulary and specific embedding matrix to use.

In [12]:
# --- Evaluation Utility Functions (Modified for Comparison) ---

def get_embedding_vector(word: str, vocab_obj: Vocabulary, embedding_mat: torch.Tensor) -> Optional[torch.Tensor]:
    """Retrieves the learned embedding vector for a word from a specific matrix."""
    if not vocab_obj or embedding_mat is None: return None
    word_idx = vocab_obj.get_index(word)
    if word_idx == vocab_obj.unk_index and word != vocab_obj.unk_token:
        logger.warning(f"Word '{word}' is UNK.")
    if 0 <= word_idx < embedding_mat.shape[0]:
        return embedding_mat[word_idx]
    else:
        logger.error(f"Invalid index {word_idx} for word '{word}'.")
        return None

def get_nearest_neighbors_data(
    input_word: str,
    embedding_mat: torch.Tensor,
    vocab_obj: Vocabulary,
    top_n: int = 10
) -> List[Dict[str, Union[str, float]]]:
    """
    Finds and returns the top_n most similar words and their similarities.

    Args:
        input_word (str): The word to find neighbors for.
        embedding_mat (torch.Tensor): The embedding matrix to use.
        vocab_obj (Vocabulary): The vocabulary object.
        top_n (int): Number of neighbors to return.

    Returns:
        List[Dict[str, Union[str, float]]]: A list of dictionaries,
            each like {'Word': neighbor_word, 'Similarity': score}.
            Returns empty list on error.
    """
    results = []
    if embedding_mat is None or not vocab_obj:
        logger.error("Cannot find neighbors: Embeddings/Vocab not loaded.")
        return results

    input_vector = get_embedding_vector(input_word, vocab_obj, embedding_mat)
    if input_vector is None:
        logger.error(f"Could not get embedding for '{input_word}'.")
        return results

    try:
        # Calculate similarities
        cos = nn.CosineSimilarity(dim=1)
        similarities = cos(embedding_mat.float(), input_vector.float().unsqueeze(0))

        # Get top indices
        k = min(top_n + 1, len(vocab_obj))
        top_indices = torch.argsort(similarities, descending=True)[:k]

        # Collect results, skipping input word
        for idx_tensor in top_indices:
            idx = idx_tensor.item()
            word = vocab_obj.get_word(idx)
            if word.lower() == input_word.lower():
                continue
            sim = similarities[idx].item()
            results.append({'Word': word, 'Similarity': sim})
            if len(results) == top_n:
                break
        return results

    except Exception as e:
        logger.error(f"Error finding neighbors for '{input_word}': {e}", exc_info=True)
        return []


def compare_nearest_neighbors(
    input_word: str,
    embed_cbow: torch.Tensor,
    embed_sg: torch.Tensor,
    vocab_obj: Vocabulary,
    top_n: int = 10
):
    """
    Compares nearest neighbors from CBOW and SkipGram models side-by-side.

    Args:
        input_word (str): Word to evaluate.
        embed_cbow (torch.Tensor): CBOW embedding matrix.
        embed_sg (torch.Tensor): SkipGram embedding matrix.
        vocab_obj (Vocabulary): Shared vocabulary object.
        top_n (int): Number of neighbors to show.
    """
    logger.info(f"Comparing nearest neighbors for '{input_word}'...")

    # Get results from both models
    results_cbow = get_nearest_neighbors_data(input_word, embed_cbow, vocab_obj, top_n)
    results_sg = get_nearest_neighbors_data(input_word, embed_sg, vocab_obj, top_n)

    # Combine into a DataFrame
    max_len = max(len(results_cbow), len(results_sg), top_n)
    data_for_df = {
        'Rank': list(range(1, max_len + 1)),
        'CBOW Word': [res.get('Word', '') for res in results_cbow] + [''] * (max_len - len(results_cbow)),
        'CBOW Sim': [f"{res.get('Similarity', 0):.4f}" for res in results_cbow] + [''] * (max_len - len(results_cbow)),
        'SkipGram Word': [res.get('Word', '') for res in results_sg] + [''] * (max_len - len(results_sg)),
        'SkipGram Sim': [f"{res.get('Similarity', 0):.4f}" for res in results_sg] + [''] * (max_len - len(results_sg)),
    }
    comparison_df = pd.DataFrame(data_for_df).head(top_n) # Ensure only top_n rows

    print(f"\n--- Nearest Neighbors Comparison for '{input_word}' ---")
    display(comparison_df)


# --- Word Analogy Functions (Remain largely the same, just need modification for comparison) ---

def get_analogy_result(
    word_a: str, word_b: str, word_c: str,
    embedding_mat: torch.Tensor, vocab_obj: Vocabulary, top_n: int = 5
) -> List[Dict[str, Union[str, float]]]:
    """Performs analogy and returns results list."""
    results = []
    if embedding_mat is None or not vocab_obj: return results

    vec_a = get_embedding_vector(word_a, vocab_obj, embedding_mat)
    vec_b = get_embedding_vector(word_b, vocab_obj, embedding_mat)
    vec_c = get_embedding_vector(word_c, vocab_obj, embedding_mat)
    if vec_a is None or vec_b is None or vec_c is None: return results

    try:
        target_vector = (vec_a - vec_b + vec_c).float()
        cos = nn.CosineSimilarity(dim=1)
        similarities = cos(embedding_mat.float(), target_vector.unsqueeze(0))
        k = min(top_n + 3, len(vocab_obj))
        top_indices = torch.argsort(similarities, descending=True)[:k]

        input_words = {word_a.lower(), word_b.lower(), word_c.lower()}
        for idx_tensor in top_indices:
            idx = idx_tensor.item()
            word = vocab_obj.get_word(idx)
            if word.lower() in input_words: continue
            sim = similarities[idx].item()
            results.append({'Word': word, 'Similarity': sim})
            if len(results) == top_n: break
        return results
    except Exception as e:
        logger.error(f"Error performing analogy '{word_a}-{word_b}+{word_c}': {e}")
        return []

def compare_word_analogies(
    word_a: str, word_b: str, word_c: str,
    embed_cbow: torch.Tensor, embed_sg: torch.Tensor,
    vocab_obj: Vocabulary, top_n: int = 5
):
    """Compares analogy results from CBOW and SkipGram side-by-side."""
    logger.info(f"Comparing analogy: '{word_a}' - '{word_b}' + '{word_c}' = ?")

    results_cbow = get_analogy_result(word_a, word_b, word_c, embed_cbow, vocab_obj, top_n)
    results_sg = get_analogy_result(word_a, word_b, word_c, embed_sg, vocab_obj, top_n)

    max_len = max(len(results_cbow), len(results_sg), top_n)
    data_for_df = {
        'Rank': list(range(1, max_len + 1)),
        'CBOW Word': [res.get('Word', '') for res in results_cbow] + [''] * (max_len - len(results_cbow)),
        'CBOW Sim': [f"{res.get('Similarity', 0):.4f}" for res in results_cbow] + [''] * (max_len - len(results_cbow)),
        'SkipGram Word': [res.get('Word', '') for res in results_sg] + [''] * (max_len - len(results_sg)),
        'SkipGram Sim': [f"{res.get('Similarity', 0):.4f}" for res in results_sg] + [''] * (max_len - len(results_sg)),
    }
    comparison_df = pd.DataFrame(data_for_df).head(top_n)

    print(f"\n--- Analogy Comparison for '{word_a} - {word_b} + {word_c}' ---")
    display(comparison_df)

## ▶️ Run Comparative Evaluations

Now run the nearest neighbor and analogy tasks for both CBOW and SkipGram models.

### Nearest Neighbors Comparison

In [13]:
# --- Test Nearest Neighbors Comparison ---
test_words = ['king', 'computer', 'france', 'history', 'running', 'apple', 'man']

if vocab and embedding_matrix_cbow is not None and embedding_matrix_sg is not None:
    for word in test_words:
        # Call the comparison function
        compare_nearest_neighbors(
            word,
            embedding_matrix_cbow,
            embedding_matrix_sg,
            vocab,
            top_n=10
        )
else:
    logger.error("Cannot run comparisons: Embeddings/Vocab not fully loaded.")

2025-04-16 22:55:59 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'king'...

--- Nearest Neighbors Comparison for 'king' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,kings,0.6316,kings,0.6425
1,2,son,0.555,iii,0.6323
2,3,prince,0.523,son,0.627
3,4,monarch,0.5016,queen,0.6193
4,5,queen,0.4942,prince,0.5995
5,6,regent,0.4759,throne,0.587
6,7,throne,0.4745,reign,0.5726
7,8,duke,0.4696,henry,0.57
8,9,monarchs,0.4436,emperor,0.5656
9,10,vassal,0.4299,duke,0.5519


2025-04-16 22:55:59 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'computer'...

--- Nearest Neighbors Comparison for 'computer' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,computers,0.7149,computers,0.7722
1,2,console,0.5243,computing,0.7087
2,3,computing,0.5101,hardware,0.7069
3,4,digital,0.502,software,0.7045
4,5,device,0.4889,graphics,0.6611
5,6,hardware,0.4809,interface,0.6532
6,7,processor,0.4796,graphical,0.653
7,8,portable,0.4679,console,0.652
8,9,microprocessor,0.4583,digital,0.6368
9,10,software,0.4583,programming,0.6305


2025-04-16 22:56:00 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'france'...

--- Nearest Neighbors Comparison for 'france' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,french,0.6235,spain,0.7153
1,2,spain,0.6193,belgium,0.6788
2,3,italy,0.5325,italy,0.6604
3,4,luxembourg,0.5291,germany,0.6572
4,5,portugal,0.5247,portugal,0.6538
5,6,belgium,0.5199,netherlands,0.6489
6,7,canada,0.5025,luxembourg,0.6415
7,8,corsica,0.4957,switzerland,0.6312
8,9,calais,0.4828,paris,0.6278
9,10,paris,0.48,hungary,0.6269


2025-04-16 22:56:00 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'history'...

--- Nearest Neighbors Comparison for 'history' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,origins,0.5696,origins,0.5773
1,2,chronology,0.5338,historical,0.5757
2,3,overview,0.493,article,0.5737
3,4,annals,0.4861,references,0.5624
4,5,timeline,0.4853,culture,0.5154
5,6,histories,0.485,geography,0.4989
6,7,handbook,0.4786,prehistory,0.4959
7,8,geography,0.4731,overview,0.4953
8,9,literature,0.4247,timeline,0.486
9,10,culture,0.4226,chronology,0.4848


2025-04-16 22:56:00 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'running'...

--- Nearest Neighbors Comparison for 'running' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,run,0.6474,run,0.7206
1,2,ran,0.5622,runs,0.6214
2,3,stretch,0.4714,ran,0.6155
3,4,going,0.4369,platforms,0.5138
4,5,throttle,0.4269,drivers,0.4993
5,6,operating,0.4096,operating,0.4835
6,7,platform,0.3976,off,0.477
7,8,watching,0.3956,lane,0.4767
8,9,hack,0.3929,segment,0.4739
9,10,boot,0.3833,window,0.4704


2025-04-16 22:56:00 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'apple'...

--- Nearest Neighbors Comparison for 'apple' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,intel,0.57,macintosh,0.8007
1,2,macintosh,0.5636,ibm,0.722
2,3,atari,0.5356,intel,0.7209
3,4,microsoft,0.5102,microsoft,0.716
4,5,imac,0.5085,pc,0.6858
5,6,amiga,0.5053,mac,0.6605
6,7,microcomputer,0.497,iic,0.6596
7,8,ibm,0.4922,amiga,0.6559
8,9,compaq,0.4751,chip,0.6333
9,10,hypercard,0.4677,atari,0.6319


2025-04-16 22:56:00 | DropoutDisco | INFO     | [4155698150.py:88] | Comparing nearest neighbors for 'man'...

--- Nearest Neighbors Comparison for 'man' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,woman,0.5926,woman,0.6101
1,2,girl,0.5434,my,0.5542
2,3,men,0.5048,himself,0.5417
3,4,serpent,0.4784,heroically,0.5123
4,5,eyes,0.4686,god,0.4954
5,6,person,0.4684,mr,0.4942
6,7,creature,0.4608,creature,0.4924
7,8,farmer,0.4515,me,0.4906
8,9,goat,0.444,love,0.4905
9,10,lover,0.4361,devil,0.4818


## ✨ Nearest Neighbor Comparison: CBOW vs. SkipGram



Comparing the top 10 most similar words for several test words reveals interesting differences and similarities between the embeddings learned by CBOW and SkipGram (both trained on 'All' text8, 128 dim, W=5, MF=5, LR=0.001, BS=512; CBOW=15 epochs, SkipGram=3 epochs NS).



**Observations:**

*   **👑 `king`:**
    *   **CBOW:** Finds related royalty/leadership terms (`kings`, `prince`, `monarch`, `queen`, `regent`, `throne`, `duke`, `monarchs`, `vassal`). Good semantic clustering.
    *   **SkipGram:** Also finds royalty (`kings`, `son`, `queen`, `prince`, `throne`, `reign`, `henry`, `emperor`, `duke`), but includes `iii` (likely Roman numeral common after king names) and `son`. SkipGram seems slightly tighter on the core concept, finding `queen` higher up. Both are strong.
*   **💻 `computer`:**
    *   **CBOW:** Focuses on plurals, related concepts, and components (`computers`, `console`, `computing`, `digital`, `device`, `hardware`, `processor`, `portable`, `microprocessor`, `software`). Very relevant.
    *   **SkipGram:** Also very relevant (`computers`, `computing`, `hardware`, `software`), but brings in terms like `graphics`, `interface`, `graphical`, `programming`. SkipGram seems slightly better at capturing related *actions* or *fields* alongside hardware/concepts. Similarities are generally higher.
*   **🇫🇷 `france`:**
    *   **CBOW:** Excellent geographical clustering (`french`, `spain`, `italy`, `luxembourg`, `portugal`, `belgium`, `canada`, `corsica`, `calais`, `paris`). Captures neighboring countries and related places very well.
    *   **SkipGram:** Also excellent (`spain`, `belgium`, `italy`, `germany`, `portugal`, `netherlands`, `luxembourg`, `switzerland`, `paris`, `hungary`). Captures neighbors, major European countries, and the capital. SkipGram's list might feel slightly more focused on peer countries.
*   **📜 `history`:**
    *   **CBOW:** Related concepts (`origins`, `chronology`, `overview`, `annals`, `timeline`, `histories`, `handbook`, `geography`, `literature`, `culture`). Good abstract connections.
    *   **SkipGram:** Also good (`origins`, `historical`, `article`, `references`, `culture`, `geography`, `prehistory`, `overview`, `timeline`, `chronology`). Includes `historical` (adjective form) and terms related to *studying* history (`article`, `references`). Seems solid.
*   **🏃 `running`:**
    *   **CBOW:** Finds related verb forms (`run`, `ran`, `going`, `operating`, `watching`), physical actions (`stretch`), and some less related terms (`throttle`, `platform`, `hack`, `boot`). Decent, but noisy.
    *   **SkipGram:** Finds verb forms (`run`, `runs`, `ran`), related concepts (`platforms`, `drivers`, `operating`, `lane`, `segment`, `window`), and `off`. Seems slightly less focused on physical action and more on states or related concepts. Similarity scores are higher.
*   **🍎 `apple`:**
    *   **CBOW:** Clearly captures the *company* meaning (`intel`, `macintosh`, `atari`, `microsoft`, `imac`, `amiga`, `microcomputer`, `ibm`, `compaq`, `hypercard`). Excellent tech company clustering.
    *   **SkipGram:** Also captures the company (`macintosh`, `ibm`, `intel`, `microsoft`, `pc`, `mac`, `iic`, `amiga`, `chip`, `atari`). Similarity scores are much higher, potentially indicating stronger associations learned by SkipGram in this context.
*   **👨 `man`:**
    *   **CBOW:** Finds the direct counterpart `woman` first, then related concepts (`girl`, `men`, `person`, `creature`) but also noise (`serpent`, `eyes`, `farmer`, `goat`, `lover`). Mixed bag.
    *   **SkipGram:** Also finds `woman` first, but then diverges into pronouns (`my`, `himself`, `me`), abstract concepts (`god`, `mr`, `creature`, `love`, `devil`) and an odd one (`heroically`). Seems less focused on simple human categories than CBOW here, but captures strong associations from the text.



**Small Conclusion:**

*   Both CBOW (15 epochs) and SkipGram (3 epochs NS) learned **meaningful semantic relationships**! 🥳 They cluster related concepts like countries, tech companies, and royalty well.
*   **SkipGram often shows higher similarity scores** for the top neighbors, potentially indicating tighter clusters or stronger learned associations in its vector space, even with fewer epochs (thanks to Negative Sampling's efficiency).
*   **Differences:** They sometimes capture slightly different *types* of similarity. CBOW might be slightly better at broader related concepts (e.g., `history`), while SkipGram might be slightly better at direct peers or more abstract/frequent co-occurrences (e.g., `apple`, `man`).
*   **Task Dependency:** Which one is "better" might depend on the specific downstream task (our HN prediction model). SkipGram is often preferred for capturing rare word meanings, while CBOW can be faster and smoother for frequent words.

Overall, both models produced promising embeddings, with SkipGram showing potential advantages in similarity strength even with less training time (due to NS). 👍

### Word Analogy Comparison

In [16]:
# --- Test Word Analogy Comparison ---
analogies = [
    ('king', 'man', 'woman'),           # Target: queen
    ('paris', 'france', 'germany'),     # Target: berlin
    ('walking', 'walked', 'swimming'),  # Target: swam
    ('big', 'bigger', 'small'),         # Target: smaller
    ('cold', 'colder', 'hot')           # Target: hotter
]

if vocab and embedding_matrix_cbow is not None and embedding_matrix_sg is not None:
    for a, b, c in analogies:
         # Call the comparison function
         compare_word_analogies(
             a, b, c,
             embedding_matrix_cbow,
             embedding_matrix_sg,
             vocab,
             top_n=10
         )
else:
    logger.error("Cannot run analogies: Embeddings/Vocab not fully loaded.")

2025-04-16 22:58:34 | DropoutDisco | INFO     | [4155698150.py:150] | Comparing analogy: 'king' - 'man' + 'woman' = ?

--- Analogy Comparison for 'king - man + woman' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,isabella,0.5062,queen,0.6498
1,2,queen,0.4857,prince,0.615
2,3,son,0.4759,husband,0.6053
3,4,wife,0.4567,daughter,0.5858
4,5,wives,0.4555,wife,0.585
5,6,duke,0.4452,mary,0.5823
6,7,kings,0.4428,throne,0.5811
7,8,monarch,0.4323,son,0.5615
8,9,daughter,0.4319,princess,0.559
9,10,husband,0.4205,mother,0.5527


2025-04-16 22:58:34 | DropoutDisco | INFO     | [4155698150.py:150] | Comparing analogy: 'paris' - 'france' + 'germany' = ?

--- Analogy Comparison for 'paris - france + germany' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,berlin,0.6104,berlin,0.7401
1,2,dresden,0.4647,munich,0.7337
2,3,leipzig,0.4422,vienna,0.7209
3,4,bonn,0.4381,leipzig,0.6919
4,5,vienna,0.4239,frankfurt,0.6609
5,6,taganrog,0.42,dresden,0.6403
6,7,german,0.4111,nuremberg,0.6243
7,8,belgrade,0.4065,hamburg,0.6151
8,9,basel,0.3965,bonn,0.6132
9,10,frankfurt,0.3821,cologne,0.6116


2025-04-16 22:58:34 | DropoutDisco | INFO     | [4155698150.py:150] | Comparing analogy: 'walking' - 'walked' + 'swimming' = ?

--- Analogy Comparison for 'walking - walked + swimming' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,outdoor,0.5449,indoor,0.6288
1,2,indoor,0.5277,outdoor,0.6119
2,3,cycling,0.5053,cycling,0.5852
3,4,climbing,0.5024,pool,0.5762
4,5,backpacking,0.4774,recreation,0.5732
5,6,bicycles,0.4624,diving,0.5717
6,7,jumping,0.4551,recreational,0.5688
7,8,recreational,0.4514,climbing,0.5582
8,9,sport,0.4492,riding,0.5455
9,10,bouldering,0.4476,golf,0.5444


2025-04-16 22:58:34 | DropoutDisco | INFO     | [4155698150.py:150] | Comparing analogy: 'big' - 'bigger' + 'small' = ?

--- Analogy Comparison for 'big - bigger + small' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,large,0.4134,laminated,0.3335
1,2,cedar,0.3387,bumpers,0.3306
2,3,huge,0.333,large,0.3244
3,4,slight,0.33,straws,0.3155
4,5,sizeable,0.3288,gmsk,0.3122
5,6,madeline,0.3287,pentatonic,0.3072
6,7,tiny,0.3271,walleye,0.3063
7,8,long,0.3266,sidewalk,0.3004
8,9,thin,0.3239,columnar,0.2963
9,10,vernon,0.3232,teeing,0.2951


2025-04-16 22:58:34 | DropoutDisco | INFO     | [4155698150.py:150] | Comparing analogy: 'cold' - 'colder' + 'hot' = ?

--- Analogy Comparison for 'cold - colder + hot' ---


Unnamed: 0,Rank,CBOW Word,CBOW Sim,SkipGram Word,SkipGram Sim
0,1,thin,0.4536,newsreel,0.337
1,2,dirty,0.3868,toothbrush,0.3364
2,3,flame,0.3763,lantern,0.3341
3,4,balloon,0.3723,rainbow,0.3324
4,5,mug,0.3705,squeaky,0.3263
5,6,chili,0.3678,cliffe,0.3238
6,7,shorts,0.3673,stanley,0.3176
7,8,breakfast,0.3672,parkin,0.3135
8,9,darin,0.36,big,0.3134
9,10,noodle,0.3505,foley,0.3122


## 🧠 Word Analogy Comparison: CBOW vs. SkipGram



Comparing the results of vector arithmetic (`vec(a) - vec(b) + vec(c) ≈ vec(?)`) tests if the embeddings capture linear relationships between concepts.



**Observations:**

1.  **`king - man + woman = ?` (Target: `queen`)** 👑
    *   **CBOW:** Top result is `isabella`, followed by `queen` (#2), `son`, `wife`, `wives`, `duke`. It captures related royalty/family concepts, and `queen` is highly ranked! 👍
    *   **SkipGram:** Top result is `queen`! 🎉 Followed by `prince`, `husband`, `daughter`, `wife`. SkipGram nails the target and finds other very relevant royalty/family terms. Seems slightly better/cleaner here.
2.  **`paris - france + germany = ?` (Target: `berlin`)** 🗺️
    *   **CBOW:** Top result is `berlin`! ✅ Followed by other German/European cities (`dresden`, `leipzig`, `bonn`, `frankfurt`) and related terms. Excellent performance.
    *   **SkipGram:** Top result is `berlin`! ✅ Followed by other major German/European cities (`munich`, `vienna`, `leipzig`, `frankfurt`). Again, excellent performance, potentially slightly higher focus on major capitals. Both models clearly captured this relationship.
3.  **`walking - walked + swimming = ?` (Target: `swam`)** 🏊‍♂️
    *   **CBOW:** Top results are `outdoor`, `indoor`, `cycling`, `climbing`, `backpacking`, `bicycles`, `jumping`, `recreational`, `sport`. These are related to activities/sports, but **miss** the grammatical tense relationship entirely. ❌
    *   **SkipGram:** Top results are `indoor`, `outdoor`, `cycling`, `pool`, `recreation`, `diving`, `recreational`, `climbing`, `riding`, `golf`. Similar to CBOW, focuses on related activities/locations but **fails** to capture the past tense `swam`. ❌
4.  **`big - bigger + small = ?` (Target: `smaller`)** 📏
    *   **CBOW:** Top result is `large`, followed by unrelated words (`cedar`, `centro`, `northwest`, `haldeman`). It finds a synonym but **fails** the comparative relationship. ❌
    *   **SkipGram:** Top results are completely unrelated (`laminated`, `bumpers`, `straws`, `gmsk`, `pentatonic`). It **fails** entirely on this analogy. ❌ (Interestingly, `large` appears at rank #3).
5.  **`cold - colder + hot = ?` (Target: `hotter`)** 🔥
    *   **CBOW:** Results are completely unrelated (`thin`, `dirty`, `flame`, `balloon`, `mug`). **Total failure**. ❌
    *   **SkipGram:** Results are also completely unrelated (`newsreel`, `toothbrush`, `lantern`, `rainbow`, `squeaky`). **Total failure**. ❌



**Conclusion:**

*   **Semantic Success:** Both CBOW and SkipGram performed **remarkably well** on the semantic analogies (`king`/`queen`, `paris`/`berlin`). SkipGram was slightly more direct in hitting the top target for `king`. This shows they learned strong country-capital and gender-role relationships. ✅✅
*   **Syntactic Struggle:** Both models **struggled significantly** with the syntactic analogies involving verb tense (`walking`/`swam`) and comparative adjectives (`bigger`/`smaller`, `colder`/`hotter`). The results for these were largely noise. 📉
*   **CBOW vs SkipGram:** For semantic tasks, both were strong, with SkipGram perhaps slightly cleaner. For the (failed) syntactic tasks, neither showed a clear advantage in this evaluation. SkipGram's failure on the comparative adjectives was particularly notable.
*   **Training Limitations:** Capturing fine-grained syntactic relationships often requires more data, different architectures, or specific training objectives beyond the standard word2vec setup. The 3 epochs for SkipGram NS might be insufficient for these subtler patterns compared to the 15 epochs for CBOW Full Softmax (though neither succeeded here).

Overall, the embeddings are good at capturing meaning but less adept at grammar/syntax based on these tests. For using them in our HN prediction model (which likely relies more on semantics), they should be quite effective! 👍