# Lab 5, Module 1: Word Embeddings & Vector Arithmetic

**Estimated time:** 25 minutes

---

## From Tiny to Massive: Scaling Up Embeddings

In Module 0, you built an embedding system from 27 sentences and 92 words. That was enough to see the core idea: words that appear in similar contexts get similar vectors.

**Now let's scale up.**

In this module, you'll work with **GloVe** (Global Vectors for Word Representation), a professional embedding model trained on:
- **6 billion tokens** from Wikipedia and web text
- **400,000 vocabulary words**
- **50-dimensional vectors** (much more compact than our 92√ó92 matrix!)

These embeddings capture fascinating patterns in language, including:
- Which dimensions correspond to concepts like "science-ness" or "formality"
- **Vector arithmetic** that solves analogies: *paris - france + italy ‚âà rome*
- Relationships between word families, tenses, and grammatical forms

**What you'll explore:**
1. Load pre-trained GloVe word vectors
2. Examine individual word embeddings
3. Investigate what individual dimensions capture
4. Use vector arithmetic to solve analogies
5. Try your own custom analogies

Let's dive in!

In [None]:
# This cell:
#   ‚Ä¢ Loads a small, free word embedding model (GloVe)
#   
#
# NOTE: The first time you run this, it will download the model (~70MB),
#       which may take up to a minute in Colab.

import numpy as np
import matplotlib.pyplot as plt

try:
    import gensim.downloader as api
except ImportError:
    !pip install -q gensim
    import gensim.downloader as api

from sklearn.metrics.pairwise import cosine_similarity

### Loading Pre-Trained Embeddings

The gensim downloader will load a large set of words that have already been embedded into vector representations. 

**This happens automatically** when you run the cell above. The first time, it downloads ~70MB (takes about a minute). After that, it's cached.

Once loaded, you'll see:
- **Vocabulary size:** ~400,000 words
- **Vector dimension:** 50 dimensions per word

Compare this to Module 0:
- Your tiny system: 92 words, 92 dimensions
- GloVe: 400,000 words, 50 dimensions!

GloVe uses dimensionality reduction (similar to PCA) to compress the co-occurrence information into just 50 numbers per word while preserving the important relationships.

---

## üìù Question 7 (Observation)

**Q7.** How many dimensions does each GloVe word vector have? How does this compare to Module 0's co-occurrence embeddings?

*Record your answer in the answer sheet.*

---

In [None]:
# ============================================================
#  Module 1 ‚Äî Activity 0: Word-level Embedding
#  DATA 1010 ‚Äì Artificial Intelligence in Action
# ============================================================



# -----------------------------
# 1. Load a pre-trained word embedding model
# -----------------------------
# Check if the model is already loaded to avoid reloading
if 'w2v' not in locals():
    print("Loading GloVe word vectors (glove-wiki-gigaword-50)...")
    w2v = api.load("glove-wiki-gigaword-50")  # 50-dimensional GloVe
    print("Model loaded!")
    print(f"Vocabulary size: {len(w2v.index_to_key):,} words")
    print(f"Vector dimension: {w2v.vector_size} dimensions\n")
else:
    print("GloVe word vectors (w2v) already loaded.\n")



### Let's take a look at a few of the words and how they are represented

In [None]:

np.set_printoptions(precision=3,linewidth=60)
w = "galaxy"
print(f"word: {w:20s} ")
print(f"Length of embedding: {len(w2v[w])}")
print(f"embedding: \n{ w2v[w]}\n")

### Visualizing the Dimensions

Now let's look at how different words are represented across the 50 dimensions.

We'll plot the vectors of a few words and see how they look on a 2D chart. Each line represents the embedding of a different word. 

**What to look for:** Notice the peak around parameter 30. That value seems to be higher for everyday words like "person" and "table" than for scientific words like "galaxy" and "atom".

Could individual dimensions capture specific concepts? Let's investigate!

---

## üìù Question 8 (Observation)

**Q8.** Looking at the dimension plots for "galaxy", "person", "table", and "atom", what do you notice about parameter 30?

*Hint: Which words have high values? Which have low values?*

*Record your answer in the answer sheet.*

---

In [None]:
science_words =  ["galaxy", "person", "table", "atom"]
x = list(range(50))
np.set_printoptions(precision=3,linewidth=60)
for w in science_words:
    print(f"word: {w:20s} ")
    y = w2v[w]
    plt.plot(x,y, label=w)
    
plt.annotate('Peak', xy=(30, 2), xytext=(25, 2.5),
             arrowprops=dict(facecolor='black', shrink=0.05),
             ha='right', va='bottom')
plt.legend ( )
plt.xlabel( "Dimension")
plt.ylabel("value")

### Let's Explore Parameter 33

We noticed something interesting in the plot above. Now let's investigate more systematically.

**Experiment:**
1) Take a bunch of science words - find all their vectors and average them
2) Take a bunch of non-science words - find all their vectors and average them
3) Subtract the non-science from the science, and plot the results

**Question:** Will we find a dimension that consistently captures "science-ness"?

Let's find out!

In [None]:
science_words = science_words + ["galaxy", "atom", "molecule", "quantum","telescope", "cell", "nucleus", "research", "experiment"]
nonscience_words = ["cat", "dog", "pizza", "music", "tree", "happy", "running", "house"]


science_average = np.zeros(50)
science_ct = 0
for w in science_words:
  word_vector = w2v[w]
  science_average = science_average + np.array(word_vector)
  science_ct = science_ct + 1
science_average = science_average / science_ct
  
nonscience_average = np.zeros(50)
nonscience_ct = 0
for w in nonscience_words:
  word_vector = w2v[w]
  nonscience_average = nonscience_average + np.array(word_vector)
  nonscience_ct = nonscience_ct + 1
nonscience_average = nonscience_average / nonscience_ct
  
science_displacement = science_average - nonscience_average
x = np.array(list(range(50)))

plt.plot(x,science_displacement,"*")
plt.xlabel("embedding parameter")
plt.ylabel("displacement from non-science words to science words")


### What happened?

There are clear differences between science words and non-science words.  However, the biggest displayment seems to be parameter 33 where the diference science words average about 1.7 lower than non-science words.

Let's explore this with some other science words and see what happens with parameter 33 and a few other random parameters.

In [None]:
new_science_words = ["economics", "microbiology","zoology","biochemistry","oceanography","science","chemistry","physics","biology","meteorology","geology","mathematics","astronomy","astrophysics"]

p1 = np.random.randint(32)
p2 = np.random.randint(16) + 34

print(f"  word                  P: 33     P: {p1:02d}    P:{p2:02d}")
wlist = []
for w in new_science_words:
  wa = w2v[w]
  wlist.append(wa)
  print(f"{w:20s} {wa[33]-nonscience_average[33]:>8.3f}  " \
    + f"{wa[p1]-nonscience_average[p1]:>8.3f} " \
    + f"{wa[p2]-nonscience_average[p2]:>8.3f} ")

### What Did We Find?

**Key observation:** Parameter 33 shows a clear pattern. Science words consistently have LOWER values than non-science words at dimension 33.

- Physics: very "science-ey" (P33 ‚âà -2.6)
- Economics: moderately "science-ey" (P33 ‚âà -1.5)
- Most science fields: negative displacement

**But is parameter 33 literally "science-ness"?**

No. The actual meaning of parameter 33 is more complex and abstract. It just happens to correlate somewhat with what we humans call "science." 

The model learned this dimension automatically by analyzing billions of words. It discovered that certain words cluster together in usage patterns, and parameter 33 captures part of that structure.

**The important insight:** Even though we can't perfectly interpret each dimension, they collectively encode meaningful semantic relationships.

---

## üìù Question 9 (Analysis)

**Q9.** Why do science words have lower values at parameter 33 compared to non-science words? What does this dimension seem to capture?

*Think about: Is it exactly "science"? Or something more abstract that correlates with science?*

*Record your answer in the answer sheet.*

---

# Vector Arithmetic & Analogies

## The Magic of Embedding Geometry

Here's where embeddings get really interesting. Because words are represented as vectors in a geometric space, we can do **arithmetic** on them!

**The key insight:** Relationships between words are preserved as directional patterns in the embedding space.

For example:
- The direction from "france" to "paris" (country ‚Üí capital)
- Should be similar to the direction from "italy" to "rome"

So if we compute: **paris - france + italy**, we should get a vector close to **rome**!

## How Analogies Work

We use analogies of the form: **A ‚àí B + C  ‚âà  ?**

where:
- **A** is a *changed* form (plural, past, comparative, capital, etc.)
- **B** is the *base* form (singular, present, base adjective, country, etc.)
- **C** is a *new base* you want to transform

Examples of relationships that work well:
- **Country ‚Üî Capital:** paris - france + italy ‚âà rome  
- **Comparatives:** smaller - small + big ‚âà bigger  
- **Verb tenses:** walked - walk + swim ‚âà swam  
- **Pluralization:** children - child + person ‚âà people  
- **Family roles:** aunt - uncle + brother ‚âà sister  

---

## üìù Question 10 (Prediction)

Before running the code below, make a prediction:

**Q10.** What word should complete this analogy: **paris - france + italy = ?**

*Think about the relationship between paris and france. Apply that same relationship to italy.*

**Write your prediction in the answer sheet, then run the code to check!**

---

In [None]:
# ============================================================
#  Module 2 ‚Äî Activity 3: Vector Arithmetic & Analogies
#  DATA 1010 ‚Äì Artificial Intelligence in Action
# ============================================================

# This cell:
#   ‚Ä¢ Demonstrates classic analogies like: king - man + woman ‚âà queen
#   ‚Ä¢ Lets you try your own word analogies
#

# -----------------------------
# 1. Load a pre-trained word embedding model
# -----------------------------
# Check if the model is already loaded to avoid reloading
if 'w2v' not in locals():
    print("Make sure to execute the top of the notebook before trying this cell.")
    exit()
else:
    print("GloVe word vectors (w2v) already loaded.\n")

# -----------------------------
# 2. Helper function: show analogy
# -----------------------------
def show_analogy(word_a, word_b, word_c, topn=5):
    """
    Compute:  word_a - word_b + word_c  ‚âà  ?
    and print the top similar words.
    """
    print("===============================================")
    print(f"Analogy:  {word_a}  -  {word_b}  +  {word_c}  ‚âà  ?")
    print("===============================================")

    # Check vocabulary
    for w in [word_a, word_b, word_c]:
        if w not in w2v:
            print(f"  ‚Ä¢ The word '{w}' is not in the model vocabulary.")
            return

    # Vector arithmetic
    result_vec = w2v[word_a] - w2v[word_b] + w2v[word_c]

    # Find most similar words to result_vec
    sims = w2v.similar_by_vector(result_vec, topn=topn)

    for rank, (word, score) in enumerate(sims, start=1):
        print(f"{rank}. {word:15s}  (cosine similarity: {score:.4f})")

    print("\n")




# -----------------------------
# 3. Reliable Example Analogies
# -----------------------------
print("### Reliable Analogy Examples ###\n")

# Capital‚ÄìCountry (A = capital, B = country, C = new country)
show_analogy("paris",   "france",  "italy")     # ‚Üí rome
show_analogy("berlin",  "germany", "spain")     # ‚Üí madrid

# Comparatives (A = comparative, B = base adj, C = new base adj)
show_analogy("smaller", "small",   "big")       # ‚Üí bigger
show_analogy("colder",  "cold",    "warm")      # ‚Üí warmer

# Verb tenses (A = past, B = present, C = new present)
show_analogy("walked",  "walk",    "swim")      # ‚Üí swam
show_analogy("made",    "make",    "think")     # ‚Üí thought

# Plurals (A = plural, B = singular, C = new singular)
show_analogy("children","child",   "person")    # ‚Üí people
show_analogy("dogs",    "dog",     "cat")       # ‚Üí cats

# Family roles (A = female, B = male, C = new male)
show_analogy("aunt",    "uncle",   "brother")   # ‚Üí sister
show_analogy("mother",  "father",  "son")       # ‚Üí daughter



### What Makes Analogies Work (and Why Some Fail)

**This model is particularly good at:**
- Capital‚Äìcountry relationships
- Singular‚Äìplural transformations
- Present‚Äìpast verb forms
- Base‚Äìcomparative adjectives
- Family relationships (gender pairs)

**It is not very good at:**
- Animal ‚Üí sound (e.g., dog ‚Üí bark)
- Abstract "vibes" (e.g., cozy, spooky)
- Pop culture or memes
- Very specific or rare relationships

**Why?** Analogies work when the relationship is represented as a consistent directional pattern across many examples in the training data. If a relationship isn't encoded consistently in the corpus, the model won't capture it.

So if your custom analogy fails, it doesn't mean you did it wrong‚Äîit often means the relationship isn't represented as a simple line in this embedding space.

---

## üìù Questions 11-15 (Analysis & Experimentation)

**Q11.** After running the analogy code, was your prediction for Q10 correct? What was the top result?

**Q12.** Try your own analogy in the interactive section below. Record your input (A, B, C) and the top result. Did it work as expected?

**Q13.** Which type of analogy works best in this model: capital-country, comparative adjectives, verb tenses, or family relationships? Look at the cosine similarity scores.

**Q14.** Why do you think some analogies (like animal ‚Üí sound) don't work well in word embeddings?

*Hint: Think about how often these relationships appear in similar sentence contexts.*

**Q15.** How does vector arithmetic (like king - man + woman = queen) demonstrate that embeddings capture meaning relationships rather than just word similarity?

*Think about: What does the direction from "man" to "king" represent? Why does adding that direction to "woman" give "queen"?*

*Record your answers in the answer sheet.*

---

Now try your own analogy below!

In [None]:
# -----------------------------
# 4. Try your own analogy
# -----------------------------
print("Now try your own analogy!")
print("Enter three words to compute:  A - B + C  ‚âà  ?")
print("Example:  A = king, B = man, C = woman\n")

word_a = input("Enter word A: ").strip().lower()
word_b = input("Enter word B: ").strip().lower()
word_c = input("Enter word C: ").strip().lower()

show_analogy(word_a, word_b, word_c)


## ‚úÖ Module 1 Complete!

Excellent work! You've explored pre-trained word embeddings and discovered some fascinating patterns:

- **GloVe vectors** are trained on billions of words but use only 50 dimensions
- **Individual dimensions** capture abstract concepts (like the "science-ness" of parameter 33)
- **Vector arithmetic** solves analogies: paris - france + italy ‚âà rome
- **Geometric relationships** encode meaning: similar directions = similar relationships
- **Not all analogies work** - only relationships consistently represented in training data

**What's next?**

In Module 0, you embedded **individual words**.
In Module 1, you explored **word-level relationships**.

**In Module 2**, you'll embed **entire sentences** and build a semantic search engine that finds documents by meaning, not just keywords. This is the foundation of modern information retrieval systems and RAG (Retrieval-Augmented Generation).

**Ready?** Move on to **Module 2: Sentence Embeddings & Semantic Search**!