# Lab 7: Inserting Word Embeddings with a Genetic Algorithm

Welcome to the lab! Our goal is to insert new words (e.g., from the CIFAR-100 dataset) into a pre-trained Skip-Gram model *without* having to retrain the entire model from scratch.

This is a common problem: how do you update a massive, trained model with new vocabulary?

---

### üéØ Our Goal & Method

We will implement a **(1+$\lambda$) Evolution Strategy (ES)**, a simple Genetic Algorithm. This algorithm will "evolve" a new vector for each missing word.

The evolution will be guided by a custom **fitness function** that scores how "good" a candidate vector is. A good vector should:

1.  **Fit the Corpus:** The vector should have a high dot product with its *context* words (words it appears near) and a low dot product with *negative samples* (random words). This is the core idea of Skip-Gram.
2.  **Match the Space:** The vector's **norm** (length) should be similar to the average norm of existing words. This helps it "fit in" to the pre-existing geometric structure.
3.  **Match Semantics:** The vector should be close to known "anchor" words (e.g., we can manually specify that 'beaver' should be near 'animal').

### üë©‚Äçüíª Your Task

You will implement the core functions in `lab6.py`. This notebook will then use your functions to:
* Load the pre-trained model and data.
* Analyze the vocabulary overlap to find missing words.
* Extract contexts for the missing words.
* Run your Genetic Algorithm to evolve new embeddings.
* Analyze and visualize the newly inserted words.

In [1]:
# =============================================================================
# SETUP & IMPORTS
# =============================================================================
import os
import random
import numpy as np
import torch

# --- Imports from Previous Labs ---
# We rely on text processing tools from Lab 2 and Lab 6
from src.lab2 import process_text_network
from src.lab6_current_best import (
    prepare_visual_genome_text,
    filter_punctuation_from_network,
    analyze_embeddings,
    find_similar_words
)

# --- Imports for Lab 7 (Genetic Algorithm) ---
# These functions handle the core Evolutionary Strategy logic,
# data loading for CIFAR-100, and visualization tools.
from src.lab7 import (
    load_trained_model,
    create_mappings,
    compute_embedding_stats,
    get_cifar100_vocabulary,
    analyze_vocabulary_overlap,
    extract_word_contexts,
    evolve_embedding,           # <--- The core GA function you will implement/use
    visualize_with_inserted_words,
    run_sanity_checks
)

In [9]:
# =============================================================================
# CONFIGURATION & HYPERPARAMETERS
# =============================================================================
# This cell defines all the key parameters for our experiment.
#
# -----------------------------------------------------------------------------
# üìñ TUNING NOTE:
# The `ga_...` parameters and `fitness_weights` are the most
# interesting ones to experiment with. Small changes here can
# significantly alter the final position and quality of the
# inserted embeddings.
# -----------------------------------------------------------------------------


# --- 1. File & Data Paths ---
# Defines where to load the pre-trained model from and where to find
# the corpus text data.
model_path = 'best_model.pth'
text_file = 'vg_text.txt'
zip_url = "https://homes.cs.washington.edu/~ranjay/visualgenome/data/dataset/region_descriptions.json.zip"

# --- 2. Pre-trained Model Parameters ---
# These parameters MUST match the architecture of the 'baseline_model.pth'
# that we are loading.
rare_threshold = 0.00025    # Filtering threshold used to train the model
embedding_dim = 96         # Dimensionality of the embedding vectors
dropout = 0.3              # Dropout rate of the loaded model
punctuation_tokens = {'.', ',', '<RARE>', "'"} # Tokens to ignore

# --- 4. Genetic Algorithm (ES) Parameters ---
context_window = 5         # (Window for context extraction, from Lab 6)

ga_pop_size = 100          # Population size (Œª in 1+Œª).
                           # We create 100 "offspring" per generation.
ga_generations = 300       # Number of evolutionary cycles to run.
ga_mutation_factor = 0.05  # The "step size" of the mutation. This is
                           # scaled by the embedding space's global
                           # standard deviation.

# --- 5. Fitness Function Weights ---
# **This is the most critical part!**
# These weights define the "goal" of the evolution. They control the
# trade-off between the three parts of our fitness function.
fitness_weights = {
    # 60% priority: Fit the corpus (Skip-Gram objective).
    'corpus': 0.70,

    # 15% priority: Match the "shape" of the space.
    # (i.e., new vectors should have a similar norm/length).
    'norm': 0.15,

    # 25% priority: Be close to our semantic anchors.
    'anchor': 0.15 #0.15
}

print("‚úì All configurations set.")
print(f"  GA Config: {ga_pop_size} population, {ga_generations} generations")
print(f"  Fitness Weights: Corpus={fitness_weights['corpus']}, "
      f"Norm={fitness_weights['norm']}, "
      f"Anchor={fitness_weights['anchor']}")

‚úì All configurations set.
  GA Config: 100 population, 300 generations
  Fitness Weights: Corpus=0.7, Norm=0.15, Anchor=0.15


In [10]:
# =============================================================================
# STAGE 1: LOAD CORPUS & BUILD VOCABULARY
# =============================================================================
# In this stage, we load the raw text corpus (Visual Genome) and process
# it to exactly match the vocabulary used to train our baseline model.
#
# We need this vocabulary ('nodes') to:
# 1. Load the pre-trained model (which requires the 'vocab_size').
# 2. Create the 'word_to_idx' and 'idx_to_word' mappings.
# 3. Extract contexts for our new words.

print("\n[STAGE 1] Loading Data & Building Vocabulary")
print("-" * 70)

# --- 1. Download or load text data ---
# We use the Visual Genome dataset as our corpus.
if not os.path.exists(text_file):
    print(f"Text file not found. Downloading corpus from {zip_url}...")
    # This function (from lab6) downloads, extracts, and cleans the text.
    text_file = prepare_visual_genome_text(
        zip_url=zip_url,
        zip_path="region_descriptions.json.zip",
        json_path="region_descriptions.json",
        output_path=text_file
    )
else:
    print(f"‚úì Text file '{text_file}' already exists.")

# --- 2. Build co-occurrence network ---
# This step (from lab2) processes the entire text file, counts word
# frequencies, filters rare words, and builds the co-occurrence network.
print("\nProcessing text to build co-occurrence network...")
network_data = process_text_network(
    text_file,
    rare_threshold=rare_threshold,  # Must match the model's training
    rare_token="<RARE>",
    distance_mode="inverted",
    verbose=True  # Set to True to see progress
)


# --- 3. Filter punctuation ---
# We clean up the network by removing punctuation tokens.
network_data = filter_punctuation_from_network(
    network_data,
    punctuation_tokens=punctuation_tokens
)

# --- 4. Finalize Vocabulary ---
# The 'nodes' list from the network IS our final vocabulary.
nodes = network_data['nodes']
vocab_size = len(nodes)



print(f"\n{'-'*70}")
print(f"‚úì STAGE 1 Complete: Vocabulary built.")
print(f"  Total vocabulary size (Vocab Size): {vocab_size} nodes")
print(f"  Total graph edges: {network_data['graph'].number_of_edges():,}")


[STAGE 1] Loading Data & Building Vocabulary
----------------------------------------------------------------------
‚úì Text file 'vg_text.txt' already exists.

Processing text to build co-occurrence network...
Loaded text: 154199868 characters
Tokenized: 27533256 tokens
Sample tokens: ['wlall', 'awi', 'walks', 'refloor', 'keysia', 'leyboard', 'enhibit', 'adelaida', 'elights', 'referre', 'beacch', 'johann', 'stackd', 'chartreuse', 'clu', 'eric', 'moody', 'toeach', 'graffit', 'novak']
Replaced 62380 rare tokens (threshold=0.00025)
Final vocabulary: 456 unique tokens
Sample tokens: ['between', 'player', 'picture', 'couch', 'windshield', 'stone', 'striped', 'many', 'boat', 'pair', 'flowers', 'people', 'mouse', 'chain', 'wearing', 'pants', 'eyes', 'bowl', 'teddy', 'bed']
Graph: 456 nodes, 84581 edges
Top tokens by frequency:
   1. '<RARE>' (freq=4416304)
   2. 'a' (freq=2220903)
   3. 'the' (freq=2155082)
   4. 'on' (freq=1396037)
   5. 'of' (freq=980462)
   6. 'is' (freq=787911)
   7. 'i

In [11]:
import math
import re
from collections import Counter
from typing import Dict, Set, Optional


DEFAULT_STOPWORDS = {
    "a","an","the","and","or","but","if","then","else","when","while",
    "to","of","in","on","at","by","for","with","about","against","between",
    "into","through","during","before","after","above","below","from","up","down",
    "out","over","under","again","further","once",
    "is","am","are","was","were","be","been","being",
    "have","has","had","do","does","did",
    "it","its","it's","this","that","these","those",
    "he","him","his","she","her","hers","they","them","their","theirs",
    "i","me","my","mine","we","us","our","ours","you","your","yours",
    "as","not","no","nor","so","too","very","can","could","should","would","will","just",
    "there","here","than","such"
}


def compute_idf_from_vg_text(
    vg_text_path: str,
    vocab: Optional[Set[str]] = None,
    smooth: bool = True
) -> Dict[str, float]:
    """
    Compute IDF scores using the SAME tokenizer as extract_word_contexts:
    regex: r'\\b[a-z]+\\b'
    
    Each line is treated as a document (same assumption as extract_word_contexts).
    """

    df = Counter()
    N = 0
    vocab_set = set(vocab) if vocab is not None else None

    with open(vg_text_path, "r", encoding="utf-8") as f:
        for line in f:
            tokens = re.findall(r'\b[a-z]+\b', line.lower())
            if not tokens:
                continue

            if vocab_set is not None:
                doc_terms = set(t for t in tokens if t in vocab_set)
            else:
                doc_terms = set(tokens)

            if not doc_terms:
                continue

            N += 1
            for t in doc_terms:
                df[t] += 1

    if N == 0:
        return {}

    idf = {}
    if smooth:
        for t, dft in df.items():
            idf[t] = math.log((N + 1.0) / (dft + 1.0)) + 1.0
    else:
        for t, dft in df.items():
            idf[t] = math.log(N / float(dft))

    return idf

In [12]:
# =============================================================================
# STAGE 2: LOAD PRE-TRAINED MODEL & COMPUTE STATS
# =============================================================================
# Now we load the Skip-Gram model that was trained on the vocabulary we
# just built. We also compute the vital statistics of this embedding space.

print("\n[STAGE 2] Loading Model & Analyzing Embedding Space")
print("-" * 70)

# --- 1. Load Model & Embeddings ---
# This function loads the .pth file, initializes the SkipGramModel
# with the *exact* same architecture, and loads the trained weights.
# It then extracts the final (input) embedding matrix as a NumPy array.
model, embeddings = load_trained_model(
    model_path=model_path,
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    dropout=dropout
)

# --- 2. Create Mappings ---
# We need fast lookups between words and their corresponding index
# in the embedding matrix.
word_to_idx, idx_to_word = create_mappings(nodes)

# --- 3. Compute Embedding Space Statistics ---
# **Crucial Step for the GA:** We compute the statistics of the *existing*
# embedding space. Our fitness function will use these values to
# ensure that new, evolved vectors "fit in" with the original vectors.
#
# - 'mean_norm': The average length (L2 norm) of all existing vectors.
# - 'std_norm': The standard deviation of the vector lengths.
# - 'global_std': The standard deviation of *all* embedding values.
#                 (Used to scale the GA's mutation strength).
embedding_stats = compute_embedding_stats(embeddings)

print(f"\n{'-'*70}")
print(f"‚úì STAGE 2 Complete: Model loaded and stats computed.")
print(f"  Embedding matrix shape: {embeddings.shape}")
print(f"  Key Stats for Fitness Function:")
print(f"    ‚îú‚îÄ Mean Vector Norm: {embedding_stats['mean_norm']:.4f}")
print(f"    ‚îú‚îÄ Std. Dev of Norms: {embedding_stats['std_norm']:.4f}")
print(f"    ‚îî‚îÄ Global Std. Dev: {embedding_stats['global_std']:.4f} (for mutation)")


[STAGE 2] Loading Model & Analyzing Embedding Space
----------------------------------------------------------------------
‚úì Loaded model: 455 embeddings, dim=96

----------------------------------------------------------------------
‚úì STAGE 2 Complete: Model loaded and stats computed.
  Embedding matrix shape: (455, 96)
  Key Stats for Fitness Function:
    ‚îú‚îÄ Mean Vector Norm: 1.7074
    ‚îú‚îÄ Std. Dev of Norms: 0.3664
    ‚îî‚îÄ Global Std. Dev: 0.1782 (for mutation)


In [13]:
#CLEAN ANCHORS

import numpy as np
from typing import Dict, List, Tuple, Optional

def _l2_normalize(mat: np.ndarray, eps: float = 1e-10) -> np.ndarray:
    return mat / (np.linalg.norm(mat, axis=1, keepdims=True) + eps)

def clean_anchors_for_word(
    word: str,
    anchors: Dict[str, List[str]],
    embeddings: np.ndarray,
    word_to_idx: Dict[str, int],
    lambda_mad: float = 1.5,
    k_min: int = 2,
    verbose: bool = False
) -> Tuple[List[str], Dict[str, float]]:
    """
    Returns:
      cleaned_anchors: list of anchors kept for `word`
      scores: coherence score per anchor (median cosine to other anchors)
    """
    cand = anchors.get(word, [])
    cand = [a for a in cand if a in word_to_idx]
    if len(cand) <= 2:
        # nothing meaningful to clean
        scores = {a: 1.0 for a in cand}
        return cand, scores

    # Build normalized anchor matrix
    A = np.array([embeddings[word_to_idx[a]] for a in cand], dtype=np.float32)
    A = _l2_normalize(A)

    # Pairwise cosine matrix
    S = A @ A.T  # (m, m)
    m = S.shape[0]

    # coherence score per anchor = median cosine to others
    scores_list = []
    for i in range(m):
        others = np.concatenate([S[i, :i], S[i, i+1:]])
        scores_list.append(float(np.median(others)))

    scores = {cand[i]: scores_list[i] for i in range(m)}

    # Robust threshold using MAD
    med = float(np.median(scores_list))
    mad = float(np.median(np.abs(np.array(scores_list) - med)))
    mad = max(mad, 0.01) 
    thr = med - lambda_mad * mad

    kept = [a for a, sc in scores.items() if sc >= thr]

    # Fallback: keep top-k_min by score if too few
    if len(kept) < k_min:
        kept = sorted(scores.keys(), key=lambda a: scores[a], reverse=True)[:k_min]

    if verbose:
        removed = [a for a in cand if a not in kept]
        print(f"[anchor-clean] {word}: kept={kept}, removed={removed}")
        print(f"  scores={ {a: round(scores[a], 3) for a in cand} }")
        print(f"  med={med:.3f}, mad={mad:.3f}, thr={thr:.3f}")

    return kept, scores


In [14]:
# =============================================================================
# STAGE 3: SANITY CHECKS (VERIFICATION)
# =============================================================================
# Before we start the complex GA process, let's verify that the model
# we loaded is sane and produces reasonable results.
#
# This function will:
# 1. Check model properties (e.g., it's in eval mode).
# 2. Check embedding matrix stats (e.g., no NaNs/Infs).
# 3. Print nearest neighbors for a few common words ('man', 'dog', etc.)
#    using the `find_similar_words` function from Lab 6.
#
# If the neighbors look sensible, our embedding space is good!

run_sanity_checks(model, embeddings, nodes, word_to_idx)


SANITY CHECKS

1. Model Configuration:
   Training mode: False
   Device: cpu

2. Embedding Quality:
   Shape: (455, 96)
   Mean: -0.001574, Std: 0.178225
   Min: -0.877702, Max: 0.860647
   Contains NaN: False, Contains Inf: False

3. Embedding Norms:
   Mean: 1.7074, Std: 0.3664
   Range: [1.2575, 3.1674]

4. Vocabulary Test:
   'man       ' ‚Üí idx=   9, norm=1.4181
      Similar: woman(0.955), person(0.943), guy(0.923), lady(0.910), boy(0.889)
   'woman     ' ‚Üí idx=  19, norm=1.4987
      Similar: man(0.955), lady(0.950), guy(0.947), person(0.936), girl(0.934)
   'dog       ' ‚Üí idx=  64, norm=1.5306
      Similar: bear(0.936), cat(0.918), animal(0.913), cow(0.894), horse(0.886)
   'car       ' ‚Üí idx=  45, norm=1.5639
      Similar: vehicle(0.967), van(0.954), truck(0.938), bus(0.931), train(0.911)
   'blue      ' ‚Üí idx=  11, norm=1.3646
      Similar: red(0.934), white(0.931), purple(0.907), yellow(0.906), colorful(0.906)

‚úì SANITY CHECKS COMPLETE


In [15]:
# =============================================================================
# STAGE 4: DEFINE TARGET WORDS
# =============================================================================
# Our goal is to insert new words. But *which* words?
#
# This notebook is set up for two potential experiments:
# 1. Insert a small, manually-defined list ('wolf', 'tiger', etc.)
# 2. Insert all the words from CIFAR-100 that are missing from our vocab.
#
# For this lab, we will do a mix: we'll load the CIFAR-100 list to
# perform an analysis, but our *actual* target list for the GA
# will be the small, manual list we defined in the config cell.
# This lets us test the algorithm quickly on a few interesting words.

print("\n[STAGE 4] Vocabulary Analysis & Target Selection")
print("-" * 70)

# --- 1. Analysis: Compare against CIFAR-100 ---
# Let's see how many CIFAR-100 class names are missing from our
# Visual Genome vocabulary.
cifar_vocab = get_cifar100_vocabulary()

# This function prints a detailed analysis and returns the list of
# words that are in CIFAR-100 but NOT in our 'nodes' (model vocab).
missing_cifar_words = analyze_vocabulary_overlap(cifar_vocab, nodes)


[STAGE 4] Vocabulary Analysis & Target Selection
----------------------------------------------------------------------

Loading CIFAR-100 vocabulary...
‚úì CIFAR-100 vocabulary loaded: 100 classes

VOCABULARY OVERLAP ANALYSIS
CIFAR-100 vocabulary: 100 classes
Network vocabulary: 455 words
Overlapping words: 32 (32.0%)
Missing from network: 68

Found: apple, baby, bear, bed, bicycle, bottle, bowl, boy, bridge, bus, can, chair, clock, cloud, couch, cup, elephant, girl, house, keyboard, lamp, man, motorcycle, mountain, mouse, orange, plate, road, table, tank, train, woman

Missing: aquarium_fish, beaver, bee, beetle, butterfly, camel, castle, caterpillar, cattle, chimpanzee, cockroach, crab, crocodile, dinosaur, dolphin, flatfish, forest, fox, hamster, kangaroo, lawn_mower, leopard, lion, lizard, lobster, maple_tree, mushroom, oak_tree, orchid, otter, palm_tree, pear, pickup_truck, pine_tree, plain, poppy, porcupine, possum, rabbit, raccoon, ray, rocket, rose, sea, seal, shark, shrew, s

In [12]:
## ŒòŒïŒõŒ© ŒùŒë ŒîŒ© Œ†ŒüŒôŒïŒ£ ŒõŒïŒûŒïŒôŒ£ ŒõŒïŒôŒ†ŒüŒ•ŒùŒï ŒöŒëŒô Œ†ŒüŒ£ŒïŒ£ ŒïŒôŒùŒëŒô.
print(missing_cifar_words)
print(len(missing_cifar_words))


['aquarium_fish', 'beaver', 'bee', 'beetle', 'butterfly', 'camel', 'castle', 'caterpillar', 'cattle', 'chimpanzee', 'cockroach', 'crab', 'crocodile', 'dinosaur', 'dolphin', 'flatfish', 'forest', 'fox', 'hamster', 'kangaroo', 'lawn_mower', 'leopard', 'lion', 'lizard', 'lobster', 'maple_tree', 'mushroom', 'oak_tree', 'orchid', 'otter', 'palm_tree', 'pear', 'pickup_truck', 'pine_tree', 'plain', 'poppy', 'porcupine', 'possum', 'rabbit', 'raccoon', 'ray', 'rocket', 'rose', 'sea', 'seal', 'shark', 'shrew', 'skunk', 'skyscraper', 'snail', 'snake', 'spider', 'squirrel', 'streetcar', 'sunflower', 'sweet_pepper', 'telephone', 'television', 'tiger', 'tractor', 'trout', 'tulip', 'turtle', 'wardrobe', 'whale', 'willow_tree', 'wolf', 'worm']
68


In [13]:
# =============================================================================
# STAGE 5: EXTRACT CORPUS CONTEXTS
# =============================================================================
# This is a critical data-gathering step for our Genetic Algorithm.
#
# To evaluate the 'corpus' part of our fitness function, we need to
# know which words our new target words co-occur with in the text.
#
# This function will read the entire 'vg_text.txt' corpus and find
# all co-occurrences for our 'target_words' within the specified window.
#
# It returns a dictionary, e.g.:
# contexts = {
#     'wolf': Counter({'animal': 12, 'forest': 8, ...}),
#     'tiger': Counter({'stripes': 20, 'cat': 15, ...})
# }
#
# This Counter will be used to select the positive samples (ctx_vecs)
# in the fitness function.

# We need the 'set(nodes)' to quickly check if a context word is
# actually in our vocabulary. We ignore any context words that aren't.
vocab_set = set(nodes)

idf = compute_idf_from_vg_text(
    vg_text_path=text_file,
    vocab=vocab_set
)

  # Œ±œÅœáŒπŒ∫ŒÆ conservative œÑŒπŒºŒÆ

print(idf["man"])


1.7524530945187917


In [14]:
idf_threshold = 1.65

IDF_EXEMPT = {
    # 0 contexts
    "flatfish",
    "possum",
    "shrew",

    # very low contexts (<=10)
    "aquarium_fish",
    "skunk",
    "sweet_pepper",

    # low contexts (<=30)
    "cockroach",
    "porcupine",
    "trout",

    # borderline / phrase-based / sparse
    "chimpanzee",
    "maple_tree",
    "lawn_mower",
    "caterpillar",
    "snail",
    "worm"
}

contexts = extract_word_contexts(
    text_file=text_file,
    target_words=missing_cifar_words,
    vocab_set=vocab_set,
    window=context_window,
    stopwords=DEFAULT_STOPWORDS,
    idf = idf,
    idf_threshold = 1.8,
    idf_exempt_targets = IDF_EXEMPT
)

print(f"\n{'-'*70}")
print("‚úì STAGE 5 Complete: Contexts extracted.")

 Complete 

Context statistics: 
    aquarium_fish  :      7 contexts,   7 unique words
    beaver         :    164 contexts,  69 unique words
    bee            :    226 contexts,  92 unique words
    beetle         :    219 contexts,  85 unique words
    butterfly      :   2179 contexts, 284 unique words
    camel          :    634 contexts, 175 unique words
    castle         :   2470 contexts, 271 unique words
    caterpillar    :    127 contexts,  59 unique words
    cattle         :   2498 contexts, 248 unique words
    chimpanzee     :     57 contexts,  23 unique words
    cockroach      :     27 contexts,  15 unique words
    crab           :    552 contexts, 160 unique words
    crocodile      :    113 contexts,  62 unique words
    dinosaur       :    747 contexts, 188 unique words
    dolphin        :    415 contexts, 136 unique words
    flatfish       :      0 contexts,   0 unique words
    forest         :  13789 contexts, 377 unique words
    fox            :    436 cont

In [None]:
anchors = {
    # =====================
    # context < 100 ‚Üí 8
    # =====================

    "aquarium_fish": [
        "water", "ocean", "waves", "beach", "boat", "sand"
    ],

    "chimpanzee": [
        "animal", "face", "eyes", "mouth", "hands", "arm", "body", "hair"
    ],

    "cockroach": [
        "bug", "legs", "floor", "trash", "ground", "dirt"
    ],

    "maple_tree": [
        "trees", "branches", "leaf", "leaves", "trunks", "grass", "ground", "growing"
    ],

    "lawn_mower": [
        "grass", "ground", "wheels", "engine", "metal", "pushes", "dirt", "parked"
    ],

    "porcupine": [
        "animal", "furry", "tail", "legs", "bushes", "trees"
    ],

    "raccoon": [
        "animal", "furry", "tail", "ears", "eyes", "face", "ground", "trees"
    ],

    "skunk": [
        "animal", "furry", "tail", "legs", "ground", "bushes", "trees", "dirt"
    ],

    "sweet_pepper": [
        "broccoli", "food", "plate", "bowl", "knife", "fork", "sandwich", "cheese"
    ],

    "trout": [
        "boat", "water", "ocean", "waves", "beach", "sand", "plate", "knife"
    ],

    # =====================
    # context = 0 ‚Üí 10
    # =====================

    "flatfish": [
        "ocean", "water", "sand", "bottom", "under",
        "rocks", "covered", "shadow", "deep", "ground"
    ],

    "possum": [
        "animal", "furry", "tail", "legs", "ears",
        "nose", "ground", "bushes", "trees", "branches"
    ],

    "shrew": [
        "animal", "furry", "nose", "ears", "legs",
        "ground", "grass", "dirt", "bushes", "branches"
    ],

    # =====================
    # context > 100 ‚Üí 5
    # =====================

    "beaver": [
        "animal", "tail", "water", "trees", "branches"
    ],

    "bee": [
        "bug", "flying", "flowers"
    ],

    "beetle": [
        "bug", "trees", "branches"
    ],

    "butterfly": [
        "wing", "flying", "flowers", "colored", "plant"
    ],

    "camel": [
        "animal", "horse", "sand", "ground", "tail"
    ],

    "castle": [
        "building", "tower", "windows", "wall", "bridge"
    ],

    "caterpillar": [
        "bug", "leaf", "leaves", "branches", "plant"
    ],

    "cattle": [
        "cow", "horses", "grass", "field", "animal"
    ],

    "crab": [
        "legs", "rocks", "sand", "beach", "water"
    ],

    "crocodile": [
        "animal", "tail", "legs", "mouth", "water"
    ],

    "dinosaur": [
        "animal", "tail", "legs", "head", "body"
    ],

    "dolphin": [
        "swimming", "waves", "water", "ocean", "surfboard"
    ],

    "forest": [
        "trees", "branches", "leaves", "grass", "dirt"
    ],

    "fox": [
        "animal", "dog", "tail", "furry", "ears"
    ],

    "hamster": [
        "animal", "furry", "ears", "legs", "ground"
    ],

    "kangaroo": [
        "animal", "legs", "tail", "ground", "standing"
    ],

    "leopard": [
        "cat", "spot", "furry", "tail", "legs"
    ],

    "lion": [
        "cat", "furry", "tail", "face", "legs"
    ],

    "lizard": [
        "animal", "tail", "legs", "head", "ground"
    ],

    "lobster": [
        "plate", "knife", "food", "restaurant", "ocean"
    ],

    "mushroom": [
        "ground", "dirt", "wet", "grass", "plant"
    ],

    "oak_tree": [
        "trees", "branches", "trunks", "leaves", "grass"
    ],

    "orchid": [
        "flowers", "colored", "pink", "green", "vase"
    ],

    "otter": [
        "animal", "water", "ocean", "furry", "tail"
    ],

    "palm_tree": [
        "beach", "sand", "ocean", "sky", "trees"
    ],

    "pear": [
        "fruit", "apple", "bananas", "bowl", "basket"
    ],

    "pickup_truck": [
        "truck", "wheels", "engine", "license", "parked"
    ],

    "pine_tree": [
        "trees", "branches", "trunks", "grass", "mountain"
    ],

    "plain": [
        "grass", "field", "ground", "sky", "clouds"
    ],

    "poppy": [
        "flowers", "colored", "grass", "field", "green"
    ],

    "rabbit": [
        "animal", "ears", "tail", "legs", "furry"
    ],

    "ray": [
        "wing", "sand", "bottom", "shadow", "ocean"
    ],

    "rocket": [
        "sky", "engine", "metal", "air","airplane", "jet"
    ],

    "rose": [
        "flowers", "pink", "vase", "green", "plant"
    ],

    "sea": [
        "ocean", "waves", "sky", "beach", "boat"
    ],

    "seal": [
        "beach", "sand", "wet", "furry", "rocks"
    ],

    "shark": [
        "teeth", "mouth", "tail", "under", "ocean"
    ],

    "skyscraper": [
        "building", "tower", "windows", "street", "lights"
    ],

    "snail": [
        "ground", "grass", "leaf", "wet", "bug"
    ],

    "snake": [
        "animal", "body", "head", "tail", "ground"
    ],

    "spider": [
        "bug", "legs", "ground", "branches"
    ],

    "squirrel": [
        "animal", "tail", "furry", "trees", "branches"
    ],

    "streetcar": [
        "train", "bus", "street", "track", "wheels"
    ],

    "sunflower": [
        "flowers", "yellow", "green", "field", "sky"
    ],

    "telephone": [
        "phone", "cell", "computer", "monitor", "keyboard"
    ],

    "television": [
        "tv", "remote", "monitor", "computer", "desk"
    ],

    "tiger": [
        "cat", "striped", "furry", "tail", "legs"
    ],

    "tractor": [
        "vehicle", "wheels", "engine", "dirt", "field"
    ],

    "tulip": [
        "flowers", "colored", "vase", "plant", "green"
    ],

    "turtle": [
        "sand", "beach", "water", "legs", "head"
    ],

    "wardrobe": [
        "shirt", "pants", "jacket", "shoes", "suitcase"
    ],

    "whale": [
        "sea", "waves", "boat", "sky", "ocean"
    ],

    "willow_tree": [
        "trees", "branches", "leaves", "water", "grass"
    ],

    "wolf": [
        "animal", "dog", "tail", "legs", "furry"
    ],

    "worm": [
        "ground", "dirt", "grass", "plant", "bug"
    ],
}


# =====================
# PATCH ONLY: problematic anchors
# =====================

anchors.update({

    # ---- aquatic cluster is too "beach/sea" dominated ----
    # aquarium_fish was glued to sea/whale/palm_tree
    # Add "fish" + container-ish tokens that exist in your vocab (container shows up in report),
    # and REMOVE some beach-only pull.
    "aquarium_fish": [
        "fish", "water", "ocean", "waves", "container", "bowl", "glass", "tank"
    ],

    # palm_tree was essentially "sea/beach" so it becomes sea-like.
    # Remove sea/ocean/beach/sand pull and make it TREE-like.
    "palm_tree": [
        "trees", "tree", "trunks", "branches", "leaves"
    ],

    # whale/sea glue is expected, but palm_tree being pulled there is not.
    # (No need to change sea/whale unless you want to reduce their similarity too.)

    # ---- animals collapsing into one blob ----
    # beaver was glued to squirrel/forest. Make it more "wood/wet" and less generic forest.
    "beaver": [
        "animal", "tail", "water", "wet", "wooden"
    ],

    # camel was ending up near reptiles/rodents. Push it back toward large mammals/hoofed.
    "camel": [
        "animal", "horse", "cow", "zebra", "giraffes"
    ],

    # shark was drifting toward snake/crocodile due to body-part anchors.
    # Keep one body cue + make it clearly aquatic.
    "shark": [
        "ocean", "water", "waves", "teeth", "tail"
    ],

    # ---- insects being pulled into plants/air/design ----
    # bee was near "air/design/lines". Anchor it to flowers cluster deliberately.
    "bee": [
        "bug", "wing", "flowers", "sunflower", "poppy"
    ],

    # beetle was anchored to trees/branches so it became a tree.
    # Pull it to "bug/ground/legs" instead (closer to cockroach/spider/worm side).
    "beetle": [
        "bug", "legs", "ground", "dirt", "spider"
    ],
})

# keep your filter
for w in anchors:
    anchors[w] = [a for a in anchors[w] if a in word_to_idx]




In [22]:

print(anchors)
missing = {}
for w, lst in anchors.items():
    miss = [a for a in lst if a not in word_to_idx]
    if miss:
        missing[w] = miss

print(missing)

print(contexts)


for w, arr in anchors.items():
    if w in {"aquarium_fish","palm_tree","beaver","camel","shark","bee","beetle"}:
        print(w, len(arr), arr)


{'aquarium_fish': ['water', 'ocean', 'waves', 'container', 'bowl', 'glass', 'tank'], 'chimpanzee': ['animal', 'face', 'eyes', 'mouth', 'hands', 'arm', 'body', 'hair'], 'cockroach': ['legs', 'floor', 'trash', 'ground', 'dirt'], 'maple_tree': ['trees', 'branches', 'leaf', 'leaves', 'grass', 'ground', 'growing'], 'lawn_mower': ['grass', 'ground', 'wheels', 'engine', 'metal', 'dirt', 'parked'], 'porcupine': ['animal', 'tail', 'legs', 'bushes', 'trees'], 'raccoon': ['animal', 'tail', 'ears', 'eyes', 'face', 'ground', 'trees'], 'skunk': ['animal', 'tail', 'legs', 'ground', 'bushes', 'trees', 'dirt'], 'sweet_pepper': ['broccoli', 'food', 'plate', 'bowl', 'knife', 'fork', 'sandwich', 'cheese'], 'trout': ['boat', 'water', 'ocean', 'waves', 'beach', 'sand', 'plate', 'knife'], 'flatfish': ['ocean', 'water', 'sand', 'bottom', 'under', 'rocks', 'covered', 'shadow', 'ground'], 'possum': ['animal', 'tail', 'legs', 'ears', 'nose', 'ground', 'bushes', 'trees', 'branches'], 'shrew': ['animal', 'nose', '

# STAGE 6: THE (1+$\lambda$) EVOLUTION STRATEGY IN DETAIL

Welcome to the core of our solution: the **(1+$\lambda$) Evolution Strategy (ES)**. This powerful, yet conceptually simple, Genetic Algorithm will "evolve" a new embedding vector for each of our target words. Think of it as a guided search through the high-dimensional embedding space, aiming to find the perfect location for a new word.

---

### üß¨ Evolutionary Algorithms: A Quick Primer

Evolutionary Algorithms (EAs) are a family of optimization techniques inspired by natural evolution. They operate on a *population* of candidate solutions, which iteratively "evolve" over *generations* through processes like:

* **Initialization:** Creating the first set of candidate solutions.
* **Selection:** Choosing which solutions get to "reproduce" or survive.
* **Mutation:** Introducing random changes to create new solutions.
* **Crossover (Recombination):** Combining parts of parent solutions (not used in (1+$\lambda$) ES).
* **Fitness Evaluation:** Assigning a "quality score" to each solution.

### The (1+$\lambda$) ES: Parent-Centric Evolution

The (1+$\lambda$) ES is a specific type of EA, where:

* **1:** Represents the single **parent** solution. This is the best solution found so far.
* **$\lambda$:** Represents the number of **offspring** generated from the parent in each generation.

The process is straightforward:

1.  **Start with a Parent:** Begin with an initial candidate solution (our embedding vector).
2.  **Generate Offspring:** Create $\lambda$ new candidate solutions by adding random "noise" (mutation) to the parent.
3.  **Evaluate All:** Calculate the *fitness* of the parent and all $\lambda$ offspring.
4.  **Select New Parent:** The best solution (parent or any of the offspring) becomes the parent for the next generation. This ensures that the population always improves or stays the same.

---

### üéØ Our Application: Evolving Word Embeddings

For our task, each "solution" is a `D`-dimensional embedding vector for a target word.

#### 1. Initialization: Bootstrapping from Contexts

How do we start a new word's journey in the embedding space? We give it a head start! The `initialize_embedding` function performs a **corpus bootstrap**:

* It finds the most frequent context words for our target word (e.g., for 'wolf', its contexts might be 'animal', 'forest', 'howl').
* It then calculates a **weighted average** of the *existing* embedding vectors of these context words.

This creates an initial vector that is already close to where it "should" be, based on its linguistic usage. If no contexts are found, it initializes to the mean of all existing embeddings.

In [23]:
# =============================================================================
# STAGE 6: RUN THE GENETIC ALGORITHM
# =============================================================================

print("\n[STAGE 6] Evolving New Embeddings via (1+Œª) ES")
print("-" * 70)
print(f"Running (1 + {ga_pop_size}) Evolution Strategy for {len(missing_cifar_words)} words...")
print(f"Generations per word: {ga_generations}")

inserted_embeddings_list = []
ga_config = {
    "ga_pop_size": ga_pop_size,
    "ga_generations": ga_generations,
    "ga_mutation_factor": ga_mutation_factor,
    "fitness_weights": fitness_weights
}

# --- Main Evolution Loop ---
for word in missing_cifar_words:
    # Evolve the embedding for this single word
    evolved_vec = evolve_embedding(
        word,
        contexts,
        embeddings,       # The original embedding matrix
        word_to_idx,
        nodes,            # The original vocab list (for neg. sampling)
        embedding_stats,  # Stats for the 'norm' fitness term
        anchors,          # Dict of anchors for the 'anchor' fitness term
        ga_config,         # All GA hyperparameters
        idf = idf,
        idf_threshold = 1.8,
        idf_power = 1.0,
        num_negatives = 15

    )
    inserted_embeddings_list.append(evolved_vec)

# --- Collect Results ---
# We now have a list of new vectors. We stack them into a NumPy
# matrix for easy analysis.
inserted_embeddings = np.array(inserted_embeddings_list)

print(f"\n{'-'*70}")
print(f"‚úì STAGE 6 Complete: Evolved {len(missing_cifar_words)} new embeddings.")


[STAGE 6] Evolving New Embeddings via (1+Œª) ES
----------------------------------------------------------------------
Running (1 + 100) Evolution Strategy for 68 words...
Generations per word: 300

  Evolving: 'aquarium_fish' G0=0.5988 G50=0.6083 G100=0.6086 G150=0.6086 G200=0.6086 G250=0.6086 ‚úì Final=0.6087

  Evolving: 'beaver' G0=0.6151 G50=0.6181 G100=0.6183 G150=0.6184 G200=0.6184 G250=0.6184 ‚úì Final=0.6184

  Evolving: 'bee' G0=0.6164 G50=0.6208 G100=0.6210 G150=0.6211 G200=0.6211 G250=0.6211 ‚úì Final=0.6211

  Evolving: 'beetle' G0=0.6246 G50=0.6342 G100=0.6346 G150=0.6347 G200=0.6347 G250=0.6347 ‚úì Final=0.6347

  Evolving: 'butterfly' G0=0.6128 G50=0.6152 G100=0.6152 G150=0.6153 G200=0.6153 G250=0.6153 ‚úì Final=0.6154

  Evolving: 'camel' G0=0.6287 G50=0.6404 G100=0.6408 G150=0.6408 G200=0.6408 G250=0.6408 ‚úì Final=0.6409

  Evolving: 'castle' G0=0.6325 G50=0.6348 G100=0.6350 G150=0.6350 G200=0.6350 G250=0.6351 ‚úì Final=0.6351

  Evolving: 'caterpillar' G0=0.6313 G5

In [24]:
# =============================================================================
# STAGE 7: MERGE EMBEDDINGS
# =============================================================================
# We now have two sets of embeddings:
# 1. `embeddings`: The original (N x D) matrix for the original `nodes`.
# 2. `inserted_embeddings`: Our new (K x D) matrix for the `target_words`.
#
# To analyze and visualize them together, we simply stack them into
# one large matrix and one large vocabulary list.

print("\n[STAGE 7] Combining Original and Evolved Embeddings")
print("-" * 70)

# `np.vstack` stacks the two matrices vertically, creating a
# new matrix of shape (N + K, D).
all_embeddings = np.vstack([embeddings, inserted_embeddings])

# We concatenate the vocabulary lists in the *same order*.
# - Indices 0 to N-1 correspond to the original `nodes`.
# - Indices N to N+K-1 correspond to our new `target_words`.
all_vocab = nodes + missing_cifar_words

print(f"‚úì Combined embedding matrix shape: {all_embeddings.shape}")
print(f"‚úì Combined vocabulary size: {len(all_vocab)} words")
print(f"  ‚îú‚îÄ Original words: {len(nodes)}")
print(f"  ‚îî‚îÄ Newly inserted words: {len(missing_cifar_words)}")


[STAGE 7] Combining Original and Evolved Embeddings
----------------------------------------------------------------------
‚úì Combined embedding matrix shape: (523, 96)
‚úì Combined vocabulary size: 523 words
  ‚îú‚îÄ Original words: 455
  ‚îî‚îÄ Newly inserted words: 68


In [25]:
# =============================================================================
# STAGE 8: ANALYSIS & VERIFICATION
# =============================================================================
# This is the moment of truth!
#
# We will use the `analyze_embeddings` function from Lab 6 to run a
# standard set of quality checks on our *newly combined* embedding space.
#
# We will specifically test our newly inserted words to see if they
# behave logically.
#
# 1. Similarity: We'll find the nearest neighbors for our new words.
#    (e.g., are 'dog' and 'animal' near 'wolf'?).
# 2. Analogies: We'll dynamically create analogies using our anchor list.
#    (e.g., "dog is to cat as wolf is to ???")
# 3. Clustering: We'll use our new words as seeds for clustering.

print("\n[STAGE 8] Analyzing Combined Embedding Space")
print("-" * 70)

# --- 1. Setup Similarity Test ---
# We'll just use the first few target words as examples.
similarity_examples = missing_cifar_words[:]
print(f"Running similarity checks for: {similarity_examples}")

# --- 2. Setup Analogy Test ---
# Let's auto-create analogy tasks based on our anchor list.
# e.g., if anchors['wolf'] = ['dog', 'animal', ...],
# we create the analogy ('dog', 'animal', 'wolf').
# The model should solve: "dog" - "animal" + "wolf" approx ???
analogy_examples = []
for word in missing_cifar_words:
    if word in anchors and len(anchors[word]) >= 2:
        # Create an analogy: (anchor1, anchor2, new_word)
        analogy_examples.append((
            anchors[word][0],  # e.g., 'dog'
            anchors[word][1],  # e.g., 'animal'
            word               # e.g., 'wolf'
        ))
    if len(analogy_examples) >= 3:
        break  # Limit to 3 analogies for brevity
print(f"Running analogy checks for: {analogy_examples}")

# --- 3. Setup Clustering Test ---
# We'll use our newly inserted words as the seeds for clustering.
# This helps us see which original words group around our new ones.
cluster_seeds = missing_cifar_words
print(f"Running clustering checks for: {cluster_seeds}")

# --- 4. Run Full Analysis ---
analyze_embeddings(
    nodes=all_vocab,  # <--- FIX: Renamed 'vocabulary' to 'nodes'
    embeddings=all_embeddings,
    similarity_examples=similarity_examples,
    analogy_examples=analogy_examples,
    cluster_seeds=cluster_seeds
)


[STAGE 8] Analyzing Combined Embedding Space
----------------------------------------------------------------------
Running similarity checks for: ['aquarium_fish', 'beaver', 'bee', 'beetle', 'butterfly', 'camel', 'castle', 'caterpillar', 'cattle', 'chimpanzee', 'cockroach', 'crab', 'crocodile', 'dinosaur', 'dolphin', 'flatfish', 'forest', 'fox', 'hamster', 'kangaroo', 'lawn_mower', 'leopard', 'lion', 'lizard', 'lobster', 'maple_tree', 'mushroom', 'oak_tree', 'orchid', 'otter', 'palm_tree', 'pear', 'pickup_truck', 'pine_tree', 'plain', 'poppy', 'porcupine', 'possum', 'rabbit', 'raccoon', 'ray', 'rocket', 'rose', 'sea', 'seal', 'shark', 'shrew', 'skunk', 'skyscraper', 'snail', 'snake', 'spider', 'squirrel', 'streetcar', 'sunflower', 'sweet_pepper', 'telephone', 'television', 'tiger', 'tractor', 'trout', 'tulip', 'turtle', 'wardrobe', 'whale', 'willow_tree', 'wolf', 'worm']
Running analogy checks for: [('water', 'ocean', 'aquarium_fish'), ('animal', 'tail', 'beaver'), ('wing', 'flowers'

In [30]:
# =============================================================================
# STAGE 9: SAVE & VERIFY FINAL MODEL
# =============================================================================
# Our experiment is complete! We have a new, larger embedding matrix
# and vocabulary list. Let's save these to a new .pth file so we
# can use them in other applications without re-running the GA.
#
# We will then immediately reload the file to verify it saved correctly.

print("\n[STAGE 9] Saving and Verifying Final Artifacts")
print("-" * 70)

# --- 1. Define Save Path ---
save_path = "best_skipgram_523words.pth"

print(all_vocab)

# --- 2. Combine and Save ---
# We are not saving the original PyTorch model, but rather the *results*
# of our process: the combined embedding matrix and the new vocab list.

# Combine the original 'embeddings' and our new 'inserted_embeddings'
full_embeddings = np.vstack([embeddings, inserted_embeddings])
full_embeddings_tensor = torch.tensor(full_embeddings, dtype=torch.float32)

#MEGETHOS
assert full_embeddings.shape[0] == len(all_vocab)

#2) Original vocab mapping
for i, w in enumerate(nodes[:10]):
    assert np.allclose(full_embeddings[i], embeddings[i])

#3) Inserted words mapping
N = len(nodes)
for i, w in enumerate(missing_cifar_words[:10]):
    assert np.allclose(full_embeddings[N + i], inserted_embeddings[i])

#4) Round-trip vocab check
vocab = {w:i for i,w in enumerate(all_vocab)}
for i, w in enumerate(all_vocab[:10]):
    assert vocab[w] == i

print(f"Saving combined model to {save_path}...")
torch.save({
    # The (N+K, D) embedding matrix
    'embeddings': full_embeddings_tensor,

    # The (N+K) vocabulary list
    'vocab': all_vocab,

    # Just the list of words we inserted
    'inserted_words': missing_cifar_words,

    # Metadata for quick loading
    'metadata': {
        'n_original': len(nodes),
        'n_inserted': len(missing_cifar_words),
        'vocab_size': len(all_vocab),
        'embedding_dim': inserted_embeddings.shape[1]
    }
}, save_path)

print(f"‚úì Model saved to: {save_path}")

# ============================================================================
# VERIFICATION: RELOAD AND TEST
# ============================================================================
# Let's confirm the file is valid by loading it and running the
# neighbor-finding test on both original *and* inserted words.

print("\nReloading model for verification...")

# --- 1. Reload Checkpoint ---
checkpoint = torch.load(save_path, map_location='cpu')
loaded_embeddings = checkpoint['embeddings'].numpy()
loaded_vocab = checkpoint['vocab']
loaded_inserted = checkpoint['inserted_words']

print(f"‚úì Model reloaded. Vocab: {len(loaded_vocab)}, Inserted: {len(loaded_inserted)}")

# --- 2. Run Vocabulary Test ---
# This test now includes our newly evolved words.
print("\n" + "="*70)
print("Final Vocabulary Test (Original + Inserted)")
print("="*70)

# We'll test a mix of original words and our new target words
test_words = [
    'man', 'woman', 'dog', 'car', 'blue',  # Original words
] + missing_cifar_words  # Our new words

for word in test_words:
    if word not in loaded_vocab:
        print(f"   '{word:10s}' ‚Üí Not in vocabulary.")
        continue

    # Find the word's index and vector
    idx = loaded_vocab.index(word)
    norm = np.linalg.norm(loaded_embeddings[idx])

    # Use lab6's 'find_similar_words' on our fully combined data
    similar = find_similar_words(word, loaded_vocab, loaded_embeddings, top_k=5)

    if not similar:
        print(f"   '{word:10s}' ‚Üí idx={idx:4d}, norm={norm:.4f} (No neighbors found?)")
        continue

    neighbor_str = ', '.join([f"{w}({s:.3f})" for w, s in similar])

    print(f"   '{word:10s}' ‚Üí idx={idx:4d}, norm={norm:.4f}")
    print(f"      Similar: {neighbor_str}\n")

print("\n" + "="*70)
print("‚úì Vocabulary Test Complete")
print("="*70)


[STAGE 9] Saving and Verifying Final Artifacts
----------------------------------------------------------------------
['a', 'the', 'on', 'of', 'is', 'in', 'white', 'black', 'and', 'man', 'with', 'blue', 'red', 'green', 'wearing', 'brown', 'building', 'are', 'person', 'woman', 'this', 'wall', 'sky', 'window', 'yellow', 'shirt', 'sign', 'water', 'table', 'to', 'has', 'tree', 'light', 'train', 'two', 'grass', 'an', 'side', 'large', 'small', 'street', 'front', 'ground', 'top', 'plate', 'car', 'part', 'orange', 'head', 'clouds', 'wooden', 'standing', 'bus', 'pole', 'sitting', 'metal', 'behind', 'holding', 'color', 'trees', 'silver', 'snow', 'gray', 'people', 'dog', 'hand', 'road', 'tennis', 'hair', 'grey', 'dark', 'glass', 'at', 'plane', 'back', 'floor', 'cat', 'background', 'fence', 'clock', 'door', 'giraffe', 'leaves', 'boy', 'left', 'field', 'right', 'next', 'long', 'by', 'bear', 'chair', 'elephant', 'tall', 'pink', "man's", 'girl', 'horse', 'pizza', 'for', 'baseball', 'zebra', 'pants',

In [6]:
# ============================================================================
# VERIFICATION: RELOAD AND TEST
# ============================================================================
# Let's confirm the file is valid by loading it and running the
# neighbor-finding test on both original *and* inserted words.

print("\nReloading model for verification...")

# --- 1. Reload Checkpoint ---
save_path = "best_skipgram_523words.pth"
checkpoint = torch.load(save_path, map_location='cpu')
loaded_embeddings = checkpoint['embeddings'].numpy()
loaded_vocab = checkpoint['vocab']
loaded_inserted = checkpoint['inserted_words']

print(f"‚úì Model reloaded. Vocab: {len(loaded_vocab)}, Inserted: {len(loaded_inserted)}")

# --- 2. Run Vocabulary Test ---
# This test now includes our newly evolved words.
print("\n" + "="*70)
print("Final Vocabulary Test (Original + Inserted)")
print("="*70)

# We'll test a mix of original words and our new target words
test_words = [
    'man', 'woman', 'dog', 'car', 'blue',  # Original words
] + missing_cifar_words  # Our new words

for word in test_words:
    if word not in loaded_vocab:
        print(f"   '{word:10s}' ‚Üí Not in vocabulary.")
        continue

    # Find the word's index and vector
    idx = loaded_vocab.index(word)
    norm = np.linalg.norm(loaded_embeddings[idx])

    # Use lab6's 'find_similar_words' on our fully combined data
    similar = find_similar_words(word, loaded_vocab, loaded_embeddings, top_k=5)

    if not similar:
        print(f"   '{word:10s}' ‚Üí idx={idx:4d}, norm={norm:.4f} (No neighbors found?)")
        continue

    neighbor_str = ', '.join([f"{w}({s:.3f})" for w, s in similar])

    print(f"   '{word:10s}' ‚Üí idx={idx:4d}, norm={norm:.4f}")
    print(f"      Similar: {neighbor_str}\n")

print("\n" + "="*70)
print("‚úì Vocabulary Test Complete")
print("="*70)


Reloading model for verification...
‚úì Model reloaded. Vocab: 523, Inserted: 68

Final Vocabulary Test (Original + Inserted)


NameError: name 'missing_cifar_words' is not defined

In [16]:
# =============================================================================
# STAGE 10: VISUALIZATION & FINAL SUMMARY
# =================================_===========================================
#
# Our final step is to visualize the combined embedding space using t-SNE.
# This special visualization function will highlight our newly inserted
# words in red, allowing us to *visually inspect* whether they landed in
# semantically appropriate regions of the space.
#
# e.g., Did 'wolf' and 'tiger' land near other 'animal' words?
#       Did 'rocket' land near 'plane'?
#       Did 'castle' land near 'building' or 'house'?

print("\n[STAGE 10] Visualizing Combined Embedding Space")
print("-" * 70)

visualize_with_inserted_words(
    nodes=all_vocab,
    embeddings=all_embeddings,
    inserted_words=missing_cifar_words,
    output_file="embeddings_with_inserted.png",
    sample_size=500  # Sample 500 words, including all our inserted ones
)

# --- 11. Final Summary ---
# Finally, let's print a summary of the entire run.

print("\n" + "="*70)
print("‚úì PIPELINE COMPLETE")
print("="*70)

print(f"Inserted words: {', '.join(missing_cifar_words)}")
print(f"New vocabulary size: {len(all_vocab)} words")

print("\nüí° Key Insights:")
print(f"  ‚Ä¢ (1+{ga_pop_size}) ES successfully evolved {len(missing_cifar_words)} embeddings.")
print("  ‚Ä¢ The Lab 6 analysis report (Stage 8) quantitatively checks quality.")
print("  ‚Ä¢ The t-SNE plot above (Stage 10) visually confirms semantic coherence.")
print("="*70)


[STAGE 10] Visualizing Combined Embedding Space
----------------------------------------------------------------------


NameError: name 'all_vocab' is not defined

In [3]:
from src.cw2 import build_my_embeddings
from src.lab6_current_best import find_similar_words


import os
import random
import numpy as np
import torch

import math
import re
from collections import Counter
from typing import Dict, Set, Optional


print("\n[TEST] Loading embeddings...")
vocab, E = build_my_embeddings("best_skipgram_523words.pth")

nodes = list(vocab.keys()) 

print(f"‚úì Loaded vocab size: {len(vocab)}")
print(f"‚úì Embedding matrix shape: {E.shape}")

# ------------------------------------------------------------
# CIFAR-100 sanity check (subset)
# ------------------------------------------------------------
cifar_probe = [
        "airplane", "apple", "bear", "camel",
        "clock", "keyboard", "train", "tractor"
]

print("\n[TEST] CIFAR words presence + norms")
for w in cifar_probe:
    if w not in vocab:
        raise ValueError(f"CIFAR word missing: {w}")
    idx = vocab[w]
    print(f"  {w:12s} idx={idx:4d} norm={np.linalg.norm(E[idx]):.4f}")

# ------------------------------------------------------------
# Random round-trip check
# ------------------------------------------------------------
print("\n[TEST] Random round-trip index check")
for w in list(vocab.keys())[:10]:
    idx = vocab[w]
    assert np.allclose(E[idx], E[vocab[w]])
print("‚úì Round-trip mapping OK")

# ------------------------------------------------------------
# Cosine similarity sanity (no NaNs)
# ------------------------------------------------------------
sims = np.dot(E[:50], E[:50].T)
if np.isnan(sims).any():
    raise ValueError("NaNs detected in similarity matrix")

print("\n‚úì ALL CHECKS PASSED ‚Äî MODEL IS SUBMISSION-READY ‚úÖ")



# Sample 100 random words from vocabulary
random.seed(42)  # reproducibility
test_words = random.sample(nodes, min(100, len(nodes)))


N = 455  # number of original Visual Genome words
inserted_words = nodes[N:]

print("=" * 80)
print("NEAREST NEIGHBORS (100 RANDOM WORDS)")
print("=" * 80)

for word in nodes:
    print(f"\nMost similar to '{word}':")
    neighbors = find_similar_words(word, nodes, E, top_k=8)

    if not neighbors:
        print("  (No neighbors found)")
        continue

    for nb, sim in neighbors:
        print(f"  {nb:<18s} similarity={sim:.4f}")




[TEST] Loading embeddings...
‚úì Loaded vocab size: 523
‚úì Embedding matrix shape: (523, 96)

[TEST] CIFAR words presence + norms
  airplane     idx= 183 norm=1.8580
  apple        idx= 373 norm=1.7192
  bear         idx=  90 norm=1.7016
  camel        idx= 460 norm=1.7074
  clock        idx=  79 norm=1.6388
  keyboard     idx= 262 norm=1.9388
  train        idx=  33 norm=1.6313
  tractor      idx= 514 norm=1.7100

[TEST] Random round-trip index check
‚úì Round-trip mapping OK

‚úì ALL CHECKS PASSED ‚Äî MODEL IS SUBMISSION-READY ‚úÖ
NEAREST NEIGHBORS (100 RANDOM WORDS)

Most similar to 'a':
  can                similarity=0.7962
  he                 similarity=0.7072
  her                similarity=0.6918
  has                similarity=0.6392
  to                 similarity=0.6270
  is                 similarity=0.6077
  these              similarity=0.6040
  by                 similarity=0.5970

Most similar to 'the':
  this               similarity=0.7267
  his                simi