# VL06 – Embeddings and Vector Semantics

In this seminar, we move beyond sparse count-based models (like Bag-of-Words and TF-IDF)  
to explore **dense vector representations** of language.  

We will experiment hands-on with pre-trained embeddings to:
- measure **similarity** and **analogy** between words,  
- visualize semantic neighborhoods, and  
- build simple **document representations** using embedding averages and TF-IDF weighting.

## 1. Loading Pre-trained Word Embeddings

Pre-trained embeddings are word vectors that have already been learned from a very large corpus (e.g., Wikipedia, Gigaword).  
We’ll use them to represent words as **dense vectors** without training anything ourselves.

**Step 1 – Install and prepare Gensim**

Activate your Conda environment and install **Gensim**, a popular Python library for NLP vector models:

```bash
$ pip install gensim
```

**Step 2 – Download the model (only once)**
If your notebook **has internet access**, you can download directly with Gensim’s built-in downloader.  
Otherwise, download from the terminal using the provided script:

```bash
$ python scripts/download_gensim.py glove-wiki-gigaword-50 models/
```

This script saves the model under `models/glove-wiki-gigaword-50/`.

**Step 3 – Set the environment variable**
We tell Gensim where to find the downloaded models by setting the environment variable `GENSIM_DATA_DIR`. We do this in the code below right before importing gensim.

In [None]:
import os

# Set the GEMSIM_DATA_DIR to the expected local path
GENSIM_DATA_DIR = os.path.abspath("../../models")
os.environ["GENSIM_DATA_DIR"] = GENSIM_DATA_DIR

#Verify if the model file is locally present
model_name = "glove-wiki-gigaword-50"
model_path = os.path.join(GENSIM_DATA_DIR, model_name)

if os.path.exists(model_path):
    print(f"Model found at: {model_path}. We will load the local model.")
else:
    print(f"Model not found at {model_path}. Attempting to download  `{model_name}`.")


# Loading the model from its files
import gensim.downloader as api

model = api.load(model_name)
model.most_similar("apricot")

## 2. Inspecting the Model and Vectors

We’ll quickly **summarize the embedding model** (vocab size, dimensionality, memory), then **probe the vocabulary** and **peek at a word vector**.  
Where available, we’ll also check the **token count** stored by the model.

In [None]:
# Summary helper: prints key facts about a Gensim KeyedVectors model
import numpy as np

def summarize_kv(model, name="(unknown)"):    
    print(f"=== Model info: {name} ===")
    print("Type:            ", type(model).__name__)
    print("Vector size (d): ", model.vector_size)
    print("Vocab size (|V|):", len(model))
    
    # matrix shapes (if present)
    if hasattr(model, "vectors"):
        print("Matrix shape:    ", model.vectors.shape, "(rows = |V|, cols = d)")
        print("Memory (approx): ", f"{model.vectors.nbytes/1e6:.1f} MB (vectors only)")
    # first few tokens (sorted by frequency order if available)
    print("Top 10 tokens:   ", model.index_to_key[:10])

    # norms cache info: is the model pre-computing ||v||
    try:
        nv = model.get_normed_vectors()  # computes/caches if missing
        print("Norms cached:    ", "yes" if nv is not None else "no")
    except Exception:
        print("Norms cached:     (n/a)")

summarize_kv(model, model_name)

### 2.1 Is a word in the vocabulary?

Check presence and, if present, its integer index in the model’s vocab.

In [None]:
word = "apricot"  #  <- try changing this
in_vocab = word in model.key_to_index

print(f"'{word}' in vocabulary?  {in_vocab}")
if in_vocab:
    print("Index:", model.key_to_index[word])

### 2.2 Access the vector
Vectors are NumPy arrays of length **d**. Let's access the previous 'word' vector representation.

In [None]:
# Let's access it's vector representation
v = model[word]
print (v)
print("Vector shape:", v.shape)

### 2.3 (If available) How frequent is this token?

Some pretrained packages include token **counts** from the training corpus.

In [None]:
try:
    count = model.get_vecattr(word, "count")
    print(f"Word '{word}' appears {count} times in the training corpus")
except Exception:    
    print ("No counts available")

## 2. Similarity

We measure how similar two word vectors are by the **cosine of the angle** between them:  
$$
\cos(\theta)=\frac{\mathbf{v}_a \cdot \mathbf{v}_b}{\|\mathbf{v}_a\|\,\|\mathbf{v}_b\|}
$$
Cosine ≈ 1 → very similar (vectors point in the same direction); Cosine ≈ 0 → unrelated.

### 2.1 Computing cosine similarity between vectors

In [None]:
# Manual cosine to demystify what the library computes
def cosine_manual(a: np.ndarray, b: np.ndarray) -> float:
    dot = float(np.dot(a, b))
    na  = float(np.linalg.norm(a))
    nb  = float(np.linalg.norm(b))
    return dot / (na * nb)

w1, w2 = "apricot", "peach"    # <- try changing to ("apricot","car")
v1, v2 = model[w1], model[w2]

cos_man = cosine_manual(v1, v2)
print(f"Manual cosine({w1}, {w2}) = {cos_man:.4f}")

**Check against Gensim’s API.**  
`model.similarity(a, b)` computes the same cosine using the model’s (possibly cached) normalized vectors.

In [None]:
# Let's compare it to the built-in function
cos_api = model.similarity(w1, w2)
print(f"Gensim similarity({w1}, {w2}) = {cos_api:.4f}")

# They should match up to tiny floating-point differences:
np.testing.assert_allclose(cos_man, cos_api, rtol=1e-6, atol=1e-7) # if not, it will throw an assertion error
print("Manual and built-in cosine match.")

### 2.2 Nearest Neighbors

Given a word, `most_similar()` returns the **top-N nearest neighbors** by cosine similarity.

In [None]:
# neareast neighboards
model.most_similar("king", topn=10)

## 3. Analogies and Simple Geometry

“A : B :: C : ?” ≈ find the word nearest to **v(B) − v(A) + v(C)** (using cosine similarity).  
Many relations appear as **consistent directions** in the embedding space (e.g., gender, capital–country, plant–produce).

For example, given the analogy *man : king :: woman : ?*, we can express it as vector operations:

$$
v(\text{king}) - v(\text{man}) = v(x) - v(\text{woman})
$$

$$
v(x) = v(\text{king}) - v(\text{man}) + v(\text{woman})
$$

In a Gensim model, this can be computed as:

```python
model.most_similar(positive=["king", "woman"], negative=["man"])
```

This retrieves the word(s) whose vector **v(x)** is closest (by cosine similarity) to the computed target vector. We expect to see **queen** as a result — illustrating how linear relations can capture semantic structure.


In [None]:
model.most_similar(positive=["king","woman"], negative=["man"])

### 3.1 Inspect the following relationships

| Vector Operation | Geometric Expression | Relation Type | Expected Result |
|------------------|----------------------|----------------|-----------------|
| `["grape", "tree"] – ["apple"]` | **v(grape) + v(tree) − v(apple)** | “X grows on Y” → *grape : vine* as *apple : tree* | **vine** |
| `["king", "woman"] – ["man"]` | **v(king) − v(man) + v(woman)** | **Gender** swap | **queen** |
| `["paris", "italy"] – ["france"]` | **v(paris) − v(france) + v(italy)** | **Capital–country** | **rome** |
| `["doctor", "woman"] – ["man"]` | **v(doctor) − v(man) + v(woman)** | **Gender** direction applied to occupation | *(may show stereotype)* |

---

**Interpretation tips:**
- Results reflect **usage patterns** (distributional similarity).  
- Unexpected outputs can indicate **polysemy**, **frequency effects**, or **bias** in the training data.

In [None]:
tests = [
    # --- PLANT / PRODUCE RELATION (grow-on) ---
    (["grape", "tree"], ["apple"]),        # grape : vine :: apple : tree  → expect "vine"

    # --- GENDER DIRECTION / ROLE SWAP ---
    (["king", "woman"], ["man"]),          # gender swap -> expect "queen"
    (["doctor", "woman"], ["man"]),        # occupation with gender direction -> may reveal bias

    # --- GEO-POLITICAL RELATIONS ---
    (["paris", "italy"], ["france"]),      # capital–country -> expect "rome"
    (["german", "spain"], ["germany"]),    # demonym/language transfer -> expect "spanish" (or "spaniard")

    # --- MORPHOLOGY / GRAMMAR ---
    (["faster", "slow"], ["fast"]),        # comparative transfer -> expect "slower"
    (["walked", "run"], ["walk"]),         # past tense transfer -> expect "ran"
    (["cups", "chair"], ["cup"]),          # pluralization transfer -> expect "chairs"

    # --- SEMANTIC HIERARCHY / HYPERNYM TRANSFER (noisier) ---
    (["animal", "tree"], ["dog"]),         # dog -> animal applied to tree -> expect "plant" 
    (["apple", "iphone"], ["fruit"]),      # tricky one with static vectors (apple has two meanings but one vector)

    # --- PART–WHOLE / MERONYMY TRANSFER (often noisy) ---
    (["wheel", "arm"], ["car"]),           # car→wheel applied to arm -> expect "hand" 
]

for pos, neg in tests:
    ans = model.most_similar(positive=pos, negative=neg, topn=5)
    print(f"\n{pos} - {neg}  ->  {[f'{w}:{s:.2f}' for w,s in ans]}")

### 3.2 Reflect on analogies
1. Which types of relations (e.g., gender, grammar, geography) seem to work best in these results, and which ones break down or produce noisy answers?  
2. Some results (like *doctor–man+woman → nurse*) reflect social or cultural patterns rather than logic.  
   What does this tell us about the data used to train word embeddings?  
3. When the model makes errors (e.g., *wheel–arm–car → sling*), what does that reveal about what word embeddings capture — and what they *don’t*?

## 4. From words to documents

**Goal.** Turn each FAQ into a single **document vector** so we can do semantic retrieval.  
We’ll try two approaches:

1) **Mean pooling** of word embeddings  
2) **TF-IDF–weighted** mean pooling (to emphasize informative terms)

We’ll use spaCy for tokenization/lemmatization and Gensim for word vectors.

In [None]:
import pandas as pd

import spacy
nlp = spacy.load("en_core_web_sm")

faqs = [
    "How do I reset my password, and are passwords case-sensitive?", 
    "How to update my email address, and is updating necessary for security?", 
    "I can't log in to my account. What are the common login issues?", 
    "How do I change my payment method for the upcoming billing cycle?", 
    "I want to cancel my active monthly subscription and stop future payments.",
    "What is the return policy if a product is still under warranty?",
    "How to change my username if I am currently logging in with my email?",
    "How do I enable two-factor authentication for my account?",
    "How to update the billing address associated with my credit card.",
    "How do I account for changes in my order? Can I cancel it?",
    "I forgot my credentials—how can I recover access to my account?",
    "How do I change the card on file for future charges?",
    "How can I stop auto-renewal and end my membership?",
    "Do you offer refunds or exchanges, and how are returns handled?",
    "Can I switch the email associated with my profile?",
    "Where can I activate two-step verification for extra security?",
    "How do I modify the invoice address linked to my account?",
    "My sign-in keeps failing—how do I troubleshoot login problems?",
    "Can I change the name that appears on my account?",
    "I need to update my payment details before the next bill—how do I do that?"
]


def tokenize(text):
    doc = nlp(text)
    return [t.lemma_.lower() for t in doc
            if not t.is_stop and not t.is_punct and t.is_alpha]

### 4.1 Mean-Pooled Embedding per Document

**Idea.** Average all in-vocab word vectors in a document.  
**Pros:** simple, fast. **Cons:** treats all words equally.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

def doc_vector(text, model=model):
    toks = tokenize(text)
    vecs = [model[t] for t in toks if t in model.key_to_index]
    if not vecs:
        return np.zeros(model.vector_size, dtype=np.float32)
    return np.mean(vecs, axis=0)

# Build matrix once
emb_matrix = np.vstack([doc_vector(d) for d in faqs])

def embed_search(query_text, topk=3):
    q_vec = doc_vector(query_text)
    sims = cosine_similarity(q_vec.reshape(1, -1), emb_matrix)[0]
    return pd.DataFrame({"similarity": sims, "faq": faqs}).sort_values("similarity", ascending=False).head(topk)


embed_search("How do I change my password?", topk=3)

### 4.2 TF-IDF–Weighted Embedding per Document

**Idea.** Weight each word vector by its **TF-IDF** before averaging:  
$$
v_{\text{doc}} = \frac{\sum_{w \in d} \text{tfidf}(w)\, v(w)}{\sum_{w \in d} \text{tfidf}(w)}
$$
**Benefit.** Emphasizes informative terms (e.g., *refund*, *authentication*) while keeping semantic generalization.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Fit TF-IDF on the same corpus (important: fit before reading idf values)
vectorizer_tfidf = TfidfVectorizer(tokenizer=tokenize, token_pattern=None)
tfidf_matrix_dummy = vectorizer_tfidf.fit_transform(faqs)  # fit to populate vocabulary + IDF
idf = dict(zip(vectorizer_tfidf.get_feature_names_out(), vectorizer_tfidf.idf_))


def doc_vector_tfidf_weighted(text, model=model, idf=idf):
    toks = tokenize(text)
    vecs, weights = [], []
    for t in toks:
        if t in model.key_to_index and t in idf:
            vecs.append(model[t])
            weights.append(idf[t])
    if not vecs:
        return np.zeros(model.vector_size)
    weights = np.array(weights)
    return np.average(vecs, axis=0, weights=weights)

# Precompute hybrid document vectors
emb_tfidf_matrix = np.vstack([doc_vector_tfidf_weighted(doc) for doc in faqs])

# Query function
def hybrid_search(query_text, topk=3):
    q_vec = doc_vector_tfidf_weighted(query_text)
    sims = cosine_similarity(q_vec.reshape(1, -1), emb_tfidf_matrix)[0]
    return pd.DataFrame({"similarity": sims, "faq": faqs}).sort_values("similarity", ascending=False).head(topk)

### 4.3 Compare Mean vs. TF-IDF–Weighted

Try a few queries that have **low keyword overlap** but **high semantic overlap**.

In [None]:
def compare_doc_retrieval(query_text, topk=3):
    print(f"\nQUERY: {query_text}\n" + "-"*60)
    print("[Embeddings: mean pooling]")
    display(embed_search(query_text, topk))
    print("\n[Embeddings: TF-IDF weighted]")
    display(hybrid_search(query_text, topk))

queries = [
    "I can’t remember my password and need to get back into my account.",
    "Please end my membership so I’m not charged again next month.",
    "How can I turn on additional security when signing in?",
    "Set up two-step verification for logins"
]

# Example
for q in queries:    
    compare_doc_retrieval(q, topk=3)

## 5. (Extra) Visualizing Embeddings
Each word in the model we worked with in this seminar, is a point in a **50-dimensional** space. We can’t plot 50D, so we use **PCA (Principal Component Analysis)** to find the **two directions** that explain the most variation in these points, and then **project** the points onto those two directions. This gives a 2D picture that preserves as much of the global structure as possible with a *linear* projection.

**How PCA works (intuition).**
- Center the data - vectors of words provided (subtract the mean).
- Find orthogonal axes that capture the **largest variance** in the data (PC1, then PC2).
- Project each 50D vector onto these two axes to get 2D coordinates.

In the resulting graph, points that are **close** in 2D are often semantically related (e.g., animals vs. vehicles). But note that with only a small set of words, the view can change a lot depending on which words you include.

### 5.1 Semantic Neighborhoods 

We project 50-D word vectors to 2-D with **PCA** to glimpse their geometry. Notice how **animals**, **vehicles**, and **people/royalty** form separate clusters—nearer points ≈ higher semantic similarity (with some distortion from the 2-D projection).

In [None]:
# Semantic similarity between different words

from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt

words = ["dog","cat","wolf","tiger","car","truck","bus","train","king","queen","man","woman"]
X = np.vstack([model[w] for w in words])
Z = PCA(n_components=2).fit_transform(X)

plt.figure(figsize=(6,5))
plt.scatter(Z[:,0], Z[:,1])
for (x,y), w in zip(Z, words):
    plt.text(x+0.01, y+0.01, w, fontsize=9)
plt.title("PCA projection of selected words")
plt.show()

### 5.2 Visualizing a Semantic Direction
We can project a few words onto 2D (using PCA) to see if certain relationships form clear geometric patterns.  
Here, we visualize the **gender direction** in the embedding space — words like *man–woman* or *king–queen* often align along a similar axis.

In [None]:
# Visualizing the "gender direction" in embedding space

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np

# Choose words along the gender / role axis
words = ["king", "queen", "man", "woman", "doctor", "nurse", "actor", "actress", "father", "mother"]

# Stack their vectors
X = np.vstack([model[w] for w in words])

# Project from 50D → 2D with PCA
Z = PCA(n_components=2).fit_transform(X)

# Plot
plt.figure(figsize=(6,5))
plt.scatter(Z[:,0], Z[:,1], color="steelblue")

# Annotate
for (x,y), w in zip(Z, words):
    plt.text(x+0.01, y+0.01, w, fontsize=9)

plt.title("Gender Direction in Embedding Space (PCA Projection)")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.axhline(0, color="gray", lw=0.5)
plt.axvline(0, color="gray", lw=0.5)
plt.show()

## References
- Gensim API reference: https://radimrehurek.com/gensim/apiref.html