
# Lesson 2 — Embeddings & Similarity (Co-occurrence → SVD)
**Goal:** Build *your own* word embeddings from a small corpus using co-occurrence counts + SVD, then explore neighbors with cosine similarity.

**What you'll learn**
- One-hot vs dense embeddings
- Co-occurrence windows build semantic structure
- SVD compresses information into a small vector
- Visualize 2D embeddings, find nearest neighbors


In [None]:

from pathlib import Path
import re, math
import numpy as np
import matplotlib.pyplot as plt

data_dir = Path("../data")
text = ""
for fname in ["space.txt","animals.txt","minecraft.txt"]:
    text += (data_dir / fname).read_text(encoding="utf-8") + "\n"

# basic cleanup
tokens = re.findall(r"[a-zA-Z']+", text.lower())
vocab = sorted(set(tokens))
word2idx = {w:i for i,w in enumerate(vocab)}
idx2word = {i:w for w,i in word2idx.items()}
len(vocab), tokens[:20]



## Build co-occurrence matrix
Use a symmetric window (e.g., size=2) and count neighbor appearances.


In [None]:

window = 2
V = len(vocab)
cooc = np.zeros((V,V), dtype=np.float32)

for i, w in enumerate(tokens):
    wi = word2idx[w]
    for j in range(max(0,i-window), min(len(tokens), i+window+1)):
        if j == i: continue
        wj = word2idx[tokens[j]]
        cooc[wi, wj] += 1.0

print("Co-occurrence built:", cooc.shape)



## SVD to get embeddings
We’ll do truncated SVD to, say, 2 or 16 dims. Then visualize the 2D projection.


In [None]:

# SVD
U, S, VT = np.linalg.svd(cooc + 1e-6, full_matrices=False)
dims = 16
emb = U[:, :dims] * S[:dims]

# 2D for plotting
emb2 = emb[:, :2]

# Pick some words to label (space/dogs/minecraft themed)
focus = [w for w in ["dog", "dogs", "wolf", "wolves", "creeper","village","portal",
                     "star","stars","rocket","ship","moon","falcon","cheetahs","pandas"]
         if w in word2idx]

plt.figure(figsize=(6,6))
plt.scatter(emb2[:,0], emb2[:,1], alpha=0.1)
for w in focus:
    i = word2idx[w]
    plt.text(emb2[i,0], emb2[i,1], w)
plt.title("2D Embeddings (SVD of co-occurrence)")
plt.savefig("../images/embeddings_2d.png", bbox_inches="tight")
plt.show()



## Cosine similarity & nearest neighbors


In [None]:

def cosine(a,b):
    return float(np.dot(a,b) / (np.linalg.norm(a)*np.linalg.norm(b) + 1e-9))

def neighbors(query, k=8):
    if query not in word2idx:
        return []
    qi = word2idx[query]
    sims = []
    for i in range(len(vocab)):
        if i == qi: continue
        sims.append((cosine(emb[qi], emb[i]), idx2word[i]))
    sims.sort(reverse=True)
    return sims[:k]

for q in ["dog","creeper","star","village"]:
    print(q, "->", neighbors(q))



### Challenge
- Add more themed sentences to the corpus files and re-run. Do neighbors change?
- Try different window sizes. What happens?
