# Lab 5, Module 0: Introduction to Embeddings

**Estimated time:** 20 minutes

---

## **Opening: How Does Your Brain Know Meaning?**

Think about this for a moment: How does your brain know that "cat" and "dog" are similar, but "cat" and "galaxy" are completely different? You've never explicitly been taught a rule that says "pets go together" and "space objects go separately." Yet somehow, through experience, your brain has learned these relationships.

**Modern AI systems like ChatGPT work the same way.** They don't store word definitions or rules. Instead, they represent words and sentences as **vectors**‚Äîlists of numbers‚Äîwhere similar meanings point in similar directions in a high-dimensional space.

This module gives you an intuitive, "behind the curtain" understanding of how that works.

---

# üìò **How Do Embeddings Work?**  
### *A simple, concrete explanation*

Modern AI systems (like ChatGPT and other foundation models) represent **words** and **sentences** as **vectors**‚Äîlists of numbers. These vectors encode meaning based on how language is used across millions of sentences.

---

## **1. Words become vectors using "distributed meaning"**

Every word (e.g., **"galaxy"**) becomes a long list of numbers:

$[0.12, -0.88, 0.43, ... , 0.04]$

Each number captures a tiny statistical association learned from language use.

You can think of each dimension as asking a vague, fuzzy question like:

- **"How strongly does this idea tend to appear in scientific or technical contexts?"**
- **"How much does this concept appear in discussions about animals?"**
- **"How much does this idea align with everyday activities or objects?"**

No single dimension has a clean human meaning.  
But **together**, they form a rich representation of the word's usage patterns.

Words used in similar ways end up with **similar vectors**.

---

## **2. The model learns meaning by predicting missing words**

Embedding models are typically trained with a simple game:

> **Look at a sentence with a missing word and guess what goes there.**

Example:

*"The _____ orbits the Sun every year."*

The model is rewarded for predicting words like:
- *Earth*
- *planet*
- *object*

and penalized for predicting:
- *banana*
- *giraffe*

After doing this **millions of times**, the model learns patterns such as:

- Which words appear in similar contexts  
- Which words are interchangeable in certain situations  
- How tone, topic, and structure influence meaning  

This process pulls related words together in vector space.

---

## **3. Sentences become vectors too**

Embedding models for sentences (like the one you'll use in Module 2) work by:

1. Converting each word into a vector  
2. Processing the whole sentence through a small transformer  
3. Producing one final vector that represents the meaning of the sentence  

Two sentences with the *same meaning* end up very close:

- "The Earth orbits the Sun."  
- "The Sun is orbited by the Earth."

Even though the wording is different.

---

## **4. Measuring similarity: cosine similarity**

To compare meanings, we use **cosine similarity**, which measures the angle between two vectors:

- **1.0** ‚Üí almost identical meaning  
- **0.8** ‚Üí very similar  
- **0.4** ‚Üí loosely related  
- **0.0** ‚Üí unrelated  

This is the core idea behind semantic search:  
> Find the corpus sentences whose vectors are closest to the query vector.

---

## **5. The big picture: why this matters**

- Embeddings are **numerical fingerprints of meaning**.  
- Similar meanings ‚Üí similar vectors.  
- The model learns this automatically through massive exposure to language.  
- Geometry (distances and directions) encodes semantic relationships.  
- This is the same idea as **hidden representations** from Lab 4‚Äî  
  just scaled up to hundreds of millions of parameters.

**Connection to Lab 4:** Remember how hidden layers in neural networks created new representations that made problems solvable? Embeddings do the same thing, but for language. They transform words into a space where "meaning" becomes measurable geometry.

# Building Our Own Tiny Embedding System

## Why Build from Scratch?

You might wonder: "If professional embedding models exist, why build our own?"

**Answer:** Building the simplest possible version helps you understand the core idea. Once you see how a tiny 27-sentence embedding system works, you'll understand what GloVe, BERT, and GPT are doing‚Äîjust at a much larger scale.

Think of this as building a bicycle before learning to fly a plane. Same basic principles, different scale.

---

## üß± How We Create an Embedding Matrix (Simple Example)

To understand how word embeddings work, we start by building a **co-occurrence matrix**.  
This matrix captures *how often words appear near each other* in our tiny corpus.

### 1. Build an empty matrix
We create a matrix with one row and one column for every word in the vocabulary.

If we have 92 words, this becomes a **92 √ó 92** table.

- **Rows** represent the "target" word  
- **Columns** represent the "context" word  
- Entries store **how many times** they appear together

At the start, everything is zero.

---

### 2. Fill the matrix by counting co-occurrences
For each sentence:

1. Split it into words  
2. For every pair of words in that sentence  
3. Add **+1** to the cell for (word‚ÇÅ, word‚ÇÇ)

Example sentence: *"cats chase mice"*

We add +1 to: (cats, chase), (cats, mice), (chase, cats), (chase, mice), (mice, cats), (mice, chase)

---

### 3. The matrix becomes a simple embedding
After processing all sentences:

- Row **i** contains the "context fingerprint" of word *i*  
- Words that appear in **similar contexts** have **similar rows**  
- These rows *are* our first version of embeddings

For example:

- "cats" and "dogs" may have similar rows because they appear next to "pets," "chase," etc.  
- "stars" and "galaxies" may have similar rows because both appear near science words  
- "neural" and "networks" will strongly co-occur

This creates meaningful structure without any neural networks.

---

## ‚≠ê 4. Why real embeddings reduce dimension (PCA, SVD, neural models)

Our tiny embedding matrix is **vocab_size √ó vocab_size**.  
With a vocabulary of 50,000 words, that becomes a **50,000 √ó 50,000** matrix ‚Äî far too big and noisy.

Real embedding systems (word2vec, GloVe, MiniLM, BERT, GPT) therefore:

### ‚úî Learn a **much smaller number of dimensions**  
usually **50‚Äì1000**, instead of tens of thousands.

### ‚úî Use mathematical tools to compress the co-occurrence structure:
- **SVD (Singular Value Decomposition)** in GloVe  
- **PCA-like dimensionality reduction**  
- **Neural networks (skip-gram, transformers)** that learn compact vectors directly  

### ‚úî The goal is to keep the **important patterns**  
and discard the noise.

You can think of this like:

> "Boiling down all the ways a word is used into a small, dense fingerprint of meaning."

So instead of each word having a 50,000-dimensional sparse vector,  
a model might learn a **384-dimensional** vector that captures the same semantic relationships.

This is why real embeddings:
- are smaller  
- generalize better  
- work fast  
- encode meaning in a compact geometric space  

Now let's build our tiny embedding system and see this in action!

In [None]:
!pip install plotly -q

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import cosine_similarity
import random


import plotly.express as px
import pandas as pd

---

## üìù Question 1 (Prediction)

Before running the code below, make a prediction:

**Q1.** Will "cats" and "dogs" have similar embedding vectors? Why or why not?

*Think about: Do these words appear in similar sentences in the corpus? What contexts do they share?*

**Write your prediction in the answer sheet, then run the code to see if you were correct!**

---

In [None]:
#@title ### üß± Build Your Own Tiny Word Embedding Model (Expanded Corpus + Clear Plot)


# -------------------------------------
# 1. Expanded Corpus (~3√ó larger)
# -------------------------------------
corpus = [
    # Pets / animals
    "cats are great pets",
    "dogs are loyal pets",
    "cats chase mice",
    "dogs chase balls",
    "hamsters run on wheels",
    "fish swim in aquariums",
    "birds can mimic speech",
    "rabbits eat vegetables",
    "turtles move slowly",

    # Astronomy / science
    "astronomy studies stars",
    "stars produce light",
    "galaxies contain billions of stars",
    "planets orbit the sun",
    "the moon causes ocean tides",
    "telescopes help astronomers observe galaxies",
    "gravity pulls objects together",
    "astronauts travel to space",

    # Technology / AI
    "computers run neural networks",
    "neural networks learn patterns",
    "machines can recognize images",
    "algorithms solve problems",
    "data scientists analyze information",
    "robots move using instructions",

    # Daily life / misc
    "music concerts bring people together",
    "cooking at home is relaxing",
    "running is good exercise",
    "video games can be played with friends"
]

print("Corpus size:", len(corpus), "sentences\n")


# -------------------------------------
# 2. Build vocabulary - make a list of words, an index, and note the vocabulary size
# -------------------------------------
words = sorted({w for s in corpus for w in s.split()})
word_to_idx = {w: i for i, w in enumerate(words)}
vocab_size = len(words)

print("Vocabulary size:", vocab_size)
print("Words:", words, "\n")


# -------------------------------------
# 3. Build co-occurrence matrix
# -------------------------------------
cooc = np.zeros((vocab_size, vocab_size), dtype=float)

for sentence in corpus:
    tokens = sentence.split()
    for i, w1 in enumerate(tokens):
        for j, w2 in enumerate(tokens):
            if i != j:
                cooc[word_to_idx[w1], word_to_idx[w2]] += 1


# -------------------------------------
# 4. Normalize rows ‚Üí simple embeddings
# -------------------------------------
embeddings = cooc / (cooc.sum(axis=1, keepdims=True) + 1e-6)


# -------------------------------------
# 5. Measure the similarity between sample word pairs
# -------------------------------------
def similarity(w1, w2):
    v1 = embeddings[word_to_idx[w1]].reshape(1, -1)
    v2 = embeddings[word_to_idx[w2]].reshape(1, -1)
    return cosine_similarity(v1, v2)[0, 0]




Corpus size: 27 sentences

Vocabulary size: 92
Words: ['algorithms', 'analyze', 'aquariums', 'are', 'astronauts', 'astronomers', 'astronomy', 'at', 'balls', 'be', 'billions', 'birds', 'bring', 'can', 'cats', 'causes', 'chase', 'computers', 'concerts', 'contain', 'cooking', 'data', 'dogs', 'eat', 'exercise', 'fish', 'friends', 'galaxies', 'games', 'good', 'gravity', 'great', 'hamsters', 'help', 'home', 'images', 'in', 'information', 'instructions', 'is', 'learn', 'light', 'loyal', 'machines', 'mice', 'mimic', 'moon', 'move', 'music', 'networks', 'neural', 'objects', 'observe', 'ocean', 'of', 'on', 'orbit', 'patterns', 'people', 'pets', 'planets', 'played', 'problems', 'produce', 'pulls', 'rabbits', 'recognize', 'relaxing', 'robots', 'run', 'running', 'scientists', 'slowly', 'solve', 'space', 'speech', 'stars', 'studies', 'sun', 'swim', 'telescopes', 'the', 'tides', 'to', 'together', 'travel', 'turtles', 'using', 'vegetables', 'video', 'wheels', 'with'] 



In [2]:
#@title ### üß± Experiment 1: look at how closely linked pairs of words are based on their embeddings


pairs = [
    ("cats", "dogs"),
    ("stars", "galaxies"),
    ("neural", "networks"),
    ("pets", "stars"),
    ("astronomy", "galaxies"),
    ("games", "music"),
    ("cats","galaxies"),
    ("dogs","networks"),
    
]

print("Cosine similarities:\n")
for a, b in pairs:
    print(f"{a:10s} ~ {b:10s} ‚Üí {similarity(a,b):.3f}")
print()



Cosine similarities:

cats       ~ dogs       ‚Üí 0.600
stars      ~ galaxies   ‚Üí 0.375
neural     ~ networks   ‚Üí 0.500
pets       ~ stars      ‚Üí 0.000
astronomy  ~ galaxies   ‚Üí 0.250
games      ~ music      ‚Üí 0.000
cats       ~ galaxies   ‚Üí 0.000
dogs       ~ networks   ‚Üí 0.000



## Understanding the Results

Just from the sentence context, the very simple embedding system has generated a set of vectors that show how words are related to each other. We can use cosine similarity to see that this system works in this simple context. 

**Notice the patterns:**
- "cats" and "dogs" have similarity of **0.600** ‚Äî fairly similar!
- "stars" and "galaxies" have similarity of **0.375** ‚Äî related
- "neural" and "networks" have similarity of **0.500** ‚Äî closely linked
- But "pets" and "stars" have similarity of **0.000** ‚Äî completely unrelated
- "cats" and "galaxies" also have **0.000** ‚Äî no overlap in usage

Dogs are not related to networks, but cats are related to dogs. The embedding system learned this purely from which words appear together in sentences.

---

## üìù Questions 2-3 (Observation)

**Q2.** Looking at the cosine similarities in the output above, which word pair is most similar? Does this match your intuition?

**Q3.** Why do "cats" and "galaxies" have a similarity of 0.000? What does this tell you about their co-occurrence in the corpus?

*Record your answers in the answer sheet.*

---

Now let's visualize this embedding space to see the clusters more clearly!

(92, 92)

In [9]:

# -------------------------------------
# 6. PCA for visualization (2D)
# -------------------------------------
pca = PCA(n_components=2)
points = pca.fit_transform(embeddings)

# Optional: Light topic-based coloring (simple heuristic)
def guess_topic(w):
    if w in {"cat","cats","dogs","dog","hamsters","fish","birds","rabbits","turtles","mice","balls"}:
        return "pets"
    if w in {"stars","astronomy","galaxies","planets","sun","moon","gravity","astronauts","telescopes"}:
        return "space"
    if w in {"computers","neural","networks","machines","algorithms","data","scientists","robots"}:
        return "tech"
    return "other"

colors = {
    "pets": "blue",
    "space": "red",
    "tech": "green",
    "other": "gray"
}

topic_colors = [colors[guess_topic(w)] for w in words]

# -------------------------------------
# 7. Interactive PCA Visualization with Hover Labels (Plotly)
# -------------------------------------



# Build a DataFrame for Plotly
df = pd.DataFrame({
    "word": words,
    "pc1": points[:, 0],
    "pc2": points[:, 1],
    "topic": [guess_topic(w) for w in words]
})

# Color mapping consistent with your previous colors
color_map = {
    "pets": "blue",
    "space": "red",
    "tech": "green",
    "other": "gray"
}

fig = px.scatter(
    df,
    x="pc1",
    y="pc2",
    color="topic",
    text=None,
    hover_name="word",
    hover_data={"topic": True, "pc1": False, "pc2": False},
    color_discrete_map=color_map,
    width=800,
    height=800
)

fig.update_layout(
    title="Tiny Word Embedding Space (Co-occurrence + PCA)",
    xaxis_title="PCA Component 1",
    yaxis_title="PCA Component 2",
)

fig.show()


## üîç Understanding the PCA Plot: What You're Seeing

The interactive plot shows a **2-dimensional picture** of word embeddings that actually live in a **much higher-dimensional space** (92 dimensions in our case, one per vocabulary word). Because we can't easily visualize high-dimensional geometry, we use a tool called **Principal Component Analysis**, or **PCA**, to create a simplified view.

### **What PCA Does (Intuition Only)**
PCA looks at all of the high-dimensional word vectors and asks:

> *"If I had to draw these points on a flat piece of paper, what are the two directions that preserve the most structure?"*

It then finds the two directions along which the words vary the most in meaning. These become the **x-axis** and **y-axis** of the plot.

Think of PCA as:
- flattening a crumpled map onto a table,  
- while trying to keep neighborhoods and directions as faithful as possible.

### **Why This Helps**
Even though words actually live in a 92-dimensional space, PCA lets us see:

- **clusters** of related words  
  (e.g., pets cluster together: cats, dogs, hamsters)
- **separation** between different topics  
  (e.g., "cats/dogs" far from "stars/galaxies")
- **relative similarity**  
  (close points = similar contexts; far points = different meanings)

### **What the Plot Represents**
- Each **dot** is a word.  
- Dots placed close together tend to appear in **similar contexts** in the corpus.  
- Dots far apart rarely appear in similar contexts.  
- Colors help highlight rough topic categories (pets=blue, space=red, tech=green, other=gray).

**Notice how:**
- Pet words (blue) cluster in one region
- Space/astronomy words (red) cluster in another region
- Technology words (green) form their own group
- The clusters are separated from each other!

### **Important Note**
PCA doesn't capture *all* the meaning relationships‚Äîonly the most visible ones.  
Words might be similar in dimensions we can't visualize, but PCA gives us a useful snapshot that reveals the **general structure** of the embedding space.

In short:

> **PCA gives us a 2D window into a high-dimensional world of meaning.**  
> It's not perfect, but it's extremely helpful for seeing patterns and relationships at a glance.

---

## üìù Questions 4-5 (Analysis)

**Q4.** In the PCA visualization above, which words cluster together? Why do you think they form these groups?

*Hint: Look at the colored regions. What do words in the same cluster have in common?*

**Q5.** The original embedding space has 92 dimensions (one per word). PCA reduces this to 2D. What information might be lost in this reduction?

*Think about: Can all word relationships be perfectly represented in just 2 dimensions?*

*Record your answers in the answer sheet.*

---

## üìù Question 6 (Synthesis - Connection to Lab 4)

**Q6.** How is this co-occurrence approach similar to what you learned about hidden layers in Lab 4? 

*Hint: Both create new representations. In Lab 4, hidden layers transformed input features into a new space. What does the co-occurrence matrix transform words into?*

*Record your answer in the answer sheet.*

---

## ‚úÖ Module 0 Complete!

You've just built a tiny embedding system from scratch! Here's what you learned:

- **Words become vectors** based on which other words they appear with
- **Similar contexts ‚Üí similar vectors** (cats and dogs both appear with "pets")
- **Cosine similarity** measures how close meanings are
- **PCA** lets us visualize high-dimensional spaces in 2D
- **This is the same core idea** that powers modern AI systems‚Äîjust scaled up

**In Module 1**, you'll explore pre-trained embeddings trained on BILLIONS of words from Wikipedia. You'll see how these professional embeddings capture fascinating relationships like analogies (king - man + woman = queen).

**Ready?** Move on to **Module 1: Word Embeddings & Vector Arithmetic**!