# Week 6: From Autoencoders to Embeddings - Homework

**ML2: Advanced Machine Learning**

**Estimated Time**: 1 hour

---

This homework combines programming exercises and knowledge-based questions to reinforce this week's concepts.

## Setup

Run this cell to import necessary libraries:

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print('✓ Libraries imported successfully')

---
## Part 1: Programming Exercises (60%)

Complete the following programming tasks. Read each description carefully and implement the requested functionality.

### Exercise 1: Experiment: Word Embeddings Capture Semantics

**Time**: 10 min

Explore how word embeddings encode semantic relationships through vector arithmetic.

In [None]:
import numpy as np

# Simplified word embeddings (in reality, these are 300-dim, but using 3-dim for clarity)
embeddings = {
    'king': np.array([0.9, 0.1, 0.8]),
    'queen': np.array([0.9, 0.9, 0.7]),
    'man': np.array([0.8, 0.1, 0.3]),
    'woman': np.array([0.8, 0.9, 0.2]),
    'prince': np.array([0.85, 0.15, 0.75]),
    'princess': np.array([0.85, 0.85, 0.65])
}

def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

def find_closest(target_vec, embeddings, exclude=[]):
    best_word, best_sim = None, -1
    for word, vec in embeddings.items():
        if word in exclude:
            continue
        sim = cosine_similarity(target_vec, vec)
        if sim > best_sim:
            best_word, best_sim = word, sim
    return best_word, best_sim

# Analogy: king - man + woman ≈ ?
result_vec = embeddings['king'] - embeddings['man'] + embeddings['woman']
word, similarity = find_closest(result_vec, embeddings, exclude=['king', 'man', 'woman'])

print(f"king - man + woman ≈ {word} (similarity: {similarity:.3f})")

# TODO: Try other analogies. What pattern emerges?

---
## Part 2: Knowledge Questions (40%)

Answer the following questions to test your conceptual understanding.

### Question 1 (Short Answer)

**Question 1 - Distributional Hypothesis**

"You shall know a word by the company it keeps" - Firth (1957)

Word embeddings are learned from word co-occurrence patterns.

Explain:
1. Why do words with similar MEANING end up with similar VECTORS?
2. How does this differ from a one-hot encoding?
3. What does this tell you about what language models learn?

**Hint**: Words used in similar contexts (appear near similar words) get similar embeddings.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 2 (Short Answer)

**Question 2 - Vector Arithmetic Magic**

king - man + woman ≈ queen

This works because embeddings capture semantic relationships as directions in vector space.

Explain:
1. What direction in vector space does (king - man) represent?
2. Why does adding 'woman' give you 'queen'?
3. What does this reveal about how embeddings encode relationships?

**Hint**: (king - man) ≈ 'royalty' direction. Adding 'woman' = royal + female.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 3 (Multiple Choice)

**Question 3 - Static vs Contextual Embeddings**

Word2Vec gives the word "bank" the SAME embedding whether it means "river bank" or "financial bank".

What problem does this create?

A) Too much memory usage
B) Loss of word-sense disambiguation
C) Slower computation
D) Unable to handle rare words

A) Too much memory usage
B) Loss of word-sense disambiguation
C) Slower computation
D) Unable to handle rare words

**Hint**: Static = one vector per word, ignoring context. This is a problem for polysemous words.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 4 (Short Answer)

**Question 4 - Skip-gram vs CBOW**

Word2Vec has two architectures:
- Skip-gram: Predict context from target word
- CBOW: Predict target word from context

Explain:
1. Which would work better for rare words?
2. Which is computationally more efficient?
3. Why might they learn slightly different embeddings?

**Hint**: Skip-gram gets more training examples per occurrence (multiple context words).

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 5 (Short Answer)

**Question 5 - Negative Sampling Efficiency**

Naive Word2Vec would compute softmax over 50,000+ vocabulary words for each training example. This is slow.

Negative sampling: Instead, sample a few negative examples and use binary classification.

Explain: How does this make training tractable?

**Hint**: Binary classification on 5-10 words vs softmax over 50,000 words.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 6 (Multiple Choice)

**Question 6 - Embedding Dimension Selection**

Word2Vec embeddings are typically 300-dimensional. What happens if you use 5 dimensions? What about 10,000?

A) 5-dim loses semantic information, 10,000-dim overfits
B) 5-dim is better (simpler), 10,000-dim is worse
C) Dimension doesn't matter
D) Larger is always better

A) 5-dim loses semantic information, 10,000-dim overfits
B) 5-dim is better (simpler), 10,000-dim is worse
C) Dimension doesn't matter
D) Larger is always better

**Hint**: Too small = can't capture complexity. Too large = overfitting and computational cost.

**Your Answer**: [Write your answer here - e.g., 'B']

**Explanation**: [Explain why this is correct]

### Question 7 (Short Answer)

**Question 7 - GloVe vs Word2Vec**

Word2Vec: Learn from local context windows
GloVe: Learn from global co-occurrence statistics

Explain: What's the conceptual difference? Why might GloVe capture certain relationships better?

**Hint**: Global statistics = counts of how often words appear together across entire corpus.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 8 (Short Answer)

**Question 8 - Bias in Embeddings**

Word embeddings trained on web text show gender bias:
"doctor - man + woman ≈ nurse"

Explain:
1. Why do embeddings encode societal biases?
2. Is this a problem? When?
3. How might you mitigate this?

**Hint**: Embeddings learn from data. If data contains bias, embeddings will too.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 9 (Short Answer)

**Question 9 - Subword Embeddings (FastText)**

FastText learns embeddings for CHARACTER N-GRAMS, not just whole words.

Example: "running" = ["run", "runn", "unni", "nnin", "ning"]

Explain: How does this help with rare words or typos?

**Hint**: Even if you've never seen 'joggen', you can compose it from n-grams.

**Your Answer**:

[Write your answer here in 2-4 sentences]

### Question 10 (Short Answer)

**Question 10 - Real Application**

Google Translate uses word embeddings as a first step before translation.

Explain:
1. Why are embeddings better than one-hot encodings for translation?
2. How do embeddings help with zero-shot translation (translating between languages you didn't train on)?

**Hint**: Embeddings capture semantic similarity. Similar concepts in different languages cluster together.

**Your Answer**:

[Write your answer here in 2-4 sentences]

---
## Submission

Before submitting:
1. Run all cells to ensure code executes without errors
2. Check that all questions are answered
3. Review your explanations for clarity

**To Submit**:
- File → Download → Download .ipynb
- Submit the notebook file to your course LMS

**Note**: Make sure your name is in the filename (e.g., homework_01_yourname.ipynb)