# Lesson 2 — Embeddings & Similarity (Co-occurrence → SVD)
**Goal:** Build *your own* word embeddings from a mini corpus using co-occurrence counts + SVD, then explore nearest neighbors with cosine similarity.

**Why this matters**
- LLMs don’t understand words by dictionary definitions—they use vectors (lists of numbers) that capture meaning based on context.
- Word embeddings are the bridge from raw tokens to math. Similar words should land near each other in vector space.
- By building embeddings from scratch you see how "meaning" emerges from simple counting.

**Vocabulary check**
- **Embedding:** A dense vector (e.g., length 16) that represents a word. Distance between vectors tells you how related two words are.
- **Co-occurrence window:** Slide a window over the text and count how often words appear next to each other. More shared neighbors → stronger relationship.
- **SVD (Singular Value Decomposition):** A matrix factorization trick that compresses big count tables into smaller, informative vectors—like picking the main themes in your text.
- **Cosine similarity:** A way to measure how close two vectors point. 1.0 = same direction, 0 = unrelated, -1 = opposite.

In [None]:

from pathlib import Path
import re, math
import numpy as np
import matplotlib.pyplot as plt

data_dir = Path("../data")
text = ""
for fname in ["space.txt","animals.txt","minecraft.txt"]:
    text += (data_dir / fname).read_text(encoding="utf-8") + "\n"

# basic cleanup
tokens = re.findall(r"[a-zA-Z']+", text.lower())
vocab = sorted(set(tokens))
word2idx = {w:i for i,w in enumerate(vocab)}
idx2word = {i:w for w,i in word2idx.items()}
len(vocab), tokens[:20]


## Build the co-occurrence matrix
1. Choose a **window size** (start with 2). For each center word, look `window` words to the left and right.
2. Count every pair (center, neighbor). This builds a big square matrix where entry `(i, j)` is "how often word *i* sees word *j*."
3. Normalize if you want (optional). For a beginner run, raw counts are fine.

**Things to notice**
- Common words like "the" co-occur with almost everything—later we’ll see how SVD tames that.
- If you add Minecraft text, words like "pickaxe" and "diamond" should co-occur frequently and strengthen their connection.

In [None]:

window = 2
V = len(vocab)
cooc = np.zeros((V,V), dtype=np.float32)

for i, w in enumerate(tokens):
    wi = word2idx[w]
    for j in range(max(0,i-window), min(len(tokens), i+window+1)):
        if j == i: continue
        wj = word2idx[tokens[j]]
        cooc[wi, wj] += 1.0

print("Co-occurrence built:", cooc.shape)


## Use SVD to get low-dimensional embeddings
1. Feed the co-occurrence matrix into a truncated SVD (e.g., keep the top 16 components).
2. The resulting matrix gives you a vector for each word.
3. You can plot the first two dimensions to see clusters on a 2D scatter plot.

**Intuition booster**
- Think of SVD as asking: "If I could describe each word using only a few secret themes, what would those themes be?"
- Those themes often end up being concepts like fantasy vs. science, animals vs. tools, etc., depending on your corpus.

In [None]:

# SVD
U, S, VT = np.linalg.svd(cooc + 1e-6, full_matrices=False)
dims = 16
emb = U[:, :dims] * S[:dims]

# 2D for plotting
emb2 = emb[:, :2]

# Pick some words to label (space/dogs/minecraft themed)
focus = [w for w in ["dog", "dogs", "wolf", "wolves", "creeper","village","portal",
                     "star","stars","rocket","ship","moon","falcon","cheetahs","pandas"]
         if w in word2idx]

plt.figure(figsize=(6,6))
plt.scatter(emb2[:,0], emb2[:,1], alpha=0.1)
for w in focus:
    i = word2idx[w]
    plt.text(emb2[i,0], emb2[i,1], w)
plt.title("2D Embeddings (SVD of co-occurrence)")
plt.savefig("../images/embeddings_2d.png", bbox_inches="tight")
plt.show()


## Cosine similarity & nearest neighbors
- Write a helper that computes cosine similarity between any two word vectors.
- For a chosen query word, sort all other words by similarity and print the top 5 neighbors.
- Also try negative examples: which words are least similar? (They’ll have near-zero or negative cosine.)

**Experiment ideas**
- Compare neighbors for the same word when trained on two different corpora (space vs. Minecraft).
- Pick a word with multiple meanings ("bank"). Does your tiny corpus give it one sense or mix them up?

In [None]:

def cosine(a,b):
    return float(np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))

def neighbors(query, k=8):
    if query not in word2idx:
        return []
    qi = word2idx[query]
    sims = []
    for i in range(len(vocab)):
        if i == qi: continue
        sims.append((cosine(emb[qi], emb[i]), idx2word[i]))
    sims.sort(reverse=True)
    return sims[:k]

for q in ["dog","creeper","star","village"]:
    print(q, "->", neighbors(q))


### Challenge
- **Window sweep:** Train embeddings with window sizes 1, 2, 4. Chart how neighbor lists change. Smaller windows capture grammar; larger windows capture topics.
- **Dimensionality tweak:** Try SVD dimensions of 2, 8, 32. Does the 2D plot lose information compared to 32D cosine neighbors?
- **Story mix:** Combine two themed corpora and see if the embedding space separates the themes (e.g., one cluster for space words, another for dogs).