# Word Embeddings: Classical vs. Contextual

**An introductory guide to understanding different types of word embeddings**

This notebook will help you understand:
- What word embeddings are and why they matter
- The difference between classical (Word2Vec) and contextual (BERT) embeddings
- When to use each approach
- Where to go for deeper dives into each method

## Part 1: Introduction to Word Embeddings

### What are Word Embeddings?

Word embeddings are **numerical representations of words**. Instead of treating words as discrete symbols, we represent them as vectors (lists of numbers) that capture their meaning.

### Why Do We Need Vector Representations?

Computers can't understand text directly. We need to convert words into numbers that:
1. **Capture meaning**: Similar words should have similar vectors
2. **Enable computation**: We can measure similarity, perform arithmetic, and use them in machine learning models
3. **Are efficient**: Better than older methods like one-hot encoding

### Historical Evolution

```
One-Hot Encoding (1950s-2000s)
    ↓
    "cat" = [1, 0, 0, 0, ...]
    "dog" = [0, 1, 0, 0, ...]
    Problem: No notion of similarity!

Word2Vec (2013)
    ↓
    "cat" = [0.2, -0.5, 0.8, ...]
    "dog" = [0.3, -0.4, 0.7, ...]
    Better: Similar words have similar vectors!
    Problem: Same vector for all contexts

BERT (2018)
    ↓
    "bank" in "river bank" = [0.1, 0.9, -0.3, ...]
    "bank" in "savings bank" = [0.8, -0.2, 0.5, ...]
    Best: Different vectors depending on context!
```

### Key Concepts

- **Semantic Similarity**: Words with similar meanings have similar vectors
- **Vector Arithmetic**: We can do math with words ("king" - "man" + "woman" ≈ "queen")
- **Dimensionality**: Embeddings typically have 100-768 dimensions
- **Context**: Whether the meaning changes based on surrounding words

## Part 2: Classical Embeddings (Word2Vec)

### How Word2Vec Works (High-Level)

Word2Vec learns embeddings by predicting words from their context:

**Two approaches:**
1. **CBOW (Continuous Bag of Words)**: Predict center word from surrounding words
   - Input: "The cat sat on the" → Output: "mat"

2. **Skip-gram**: Predict surrounding words from center word
   - Input: "cat" → Output: "The", "sat", "on"

### Key Feature: Static Embeddings

**One vector per word**, regardless of context:
- "bank" always gets the same vector
- "bank" in "river bank" = same as "bank" in "savings bank"
- This is both a strength (simple, fast) and limitation (misses context)

In [None]:
# Install required packages (run once)
# !pip install gensim numpy matplotlib scikit-learn transformers torch

In [None]:
# Simple Word2Vec Demo
import gensim.downloader as api
import numpy as np

# Load pre-trained Word2Vec model (this may take a minute)
print("Loading Word2Vec model...")
word2vec_model = api.load('word2vec-google-news-300')
print("Model loaded!")

In [None]:
# Example 1: Find similar words
print("Words most similar to 'cat':")
for word, similarity in word2vec_model.most_similar('cat', topn=5):
    print(f"  {word}: {similarity:.3f}")

In [None]:
# Example 2: Word arithmetic (the famous king - man + woman = queen)
result = word2vec_model.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)
print("king - man + woman =")
for word, similarity in result:
    print(f"  {word}: {similarity:.3f}")

In [None]:
# Example 3: The "bank" problem
# Word2Vec gives the same vector for "bank" regardless of context
bank_vector = word2vec_model['bank']
print(f"Vector for 'bank': {bank_vector[:5]}...  (showing first 5 of 300 dimensions)")
print(f"\nThis is the SAME vector whether we're talking about:")
print("  - A river bank")
print("  - A savings bank")
print("  - A blood bank")
print("\nThis is a limitation of static embeddings!")

In [None]:
# Example 4: Simple t-SNE visualization
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Select some words to visualize
words = ['cat', 'dog', 'kitten', 'puppy', 'animal', 'pet',
         'king', 'queen', 'prince', 'princess', 'royal', 'crown',
         'bank', 'river', 'water', 'money', 'finance', 'loan']

# Get vectors for these words
vectors = [word2vec_model[word] for word in words]

# Reduce to 2D for visualization
tsne = TSNE(n_components=2, random_state=42, perplexity=5)
vectors_2d = tsne.fit_transform(vectors)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(vectors_2d[:, 0], vectors_2d[:, 1], alpha=0.5)

for i, word in enumerate(words):
    plt.annotate(word, xy=(vectors_2d[i, 0], vectors_2d[i, 1]), 
                fontsize=12, alpha=0.8)

plt.title('Word2Vec Embeddings Visualized (t-SNE)')
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.grid(True, alpha=0.3)
plt.show()

print("Notice how semantically related words cluster together!")

### Want to Learn More About Word2Vec?

See **[Word2Vec.ipynb](./Word2Vec.ipynb)** for a comprehensive deep dive including:
- Training your own Word2Vec model from scratch
- Advanced visualization techniques
- Using Word2Vec for text classification
- Document embeddings
- Performance optimization

## Part 3: Contextual Embeddings (BERT)

### How BERT Embeddings Differ

BERT (Bidirectional Encoder Representations from Transformers) creates **context-dependent embeddings**:

- The same word gets **different vectors** in different contexts
- BERT looks at the entire sentence to understand each word
- More computationally expensive but much more powerful

### The "Bank" Example Solved

```
Word2Vec:
  "bank" → [0.1, 0.5, -0.3, ...]  (always the same)

BERT:
  "I sat by the river bank"  → "bank" = [0.2, 0.8, -0.1, ...]
  "I deposited money at the bank" → "bank" = [0.7, -0.3, 0.5, ...]
  
Different contexts → Different embeddings!
```

In [None]:
# Simple BERT Demo
from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model
print("Loading BERT model...")
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
print("Model loaded!")

In [None]:
# Function to get BERT embedding for a word in context
def get_word_embedding(sentence, word):
    """Get BERT embedding for a specific word in a sentence."""
    # Tokenize
    inputs = tokenizer(sentence, return_tensors='pt')
    tokens = tokenizer.tokenize(sentence)
    
    # Get word position
    word_idx = tokens.index(word) + 1  # +1 for [CLS] token
    
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
    
    # Extract embedding for the specific word
    word_embedding = outputs.last_hidden_state[0, word_idx, :]
    return word_embedding.numpy()

# Example: "bank" in different contexts
sentence1 = "I sat by the river bank"
sentence2 = "I deposited money at the bank"

embedding1 = get_word_embedding(sentence1, 'bank')
embedding2 = get_word_embedding(sentence2, 'bank')

print(f"Embedding for 'bank' in '{sentence1}':")
print(f"  {embedding1[:5]}... (showing first 5 of 768 dimensions)\n")

print(f"Embedding for 'bank' in '{sentence2}':")
print(f"  {embedding2[:5]}... (showing first 5 of 768 dimensions)\n")

# Calculate cosine similarity
from numpy.linalg import norm
similarity = np.dot(embedding1, embedding2) / (norm(embedding1) * norm(embedding2))
print(f"Similarity between the two 'bank' embeddings: {similarity:.3f}")
print("Note: They're somewhat similar but NOT identical!")

In [None]:
# Example: Sentence-level embeddings
def get_sentence_embedding(sentence):
    """Get BERT embedding for entire sentence using [CLS] token."""
    inputs = tokenizer(sentence, return_tensors='pt')
    with torch.no_grad():
        outputs = model(**inputs)
    # Use [CLS] token (first token) as sentence representation
    return outputs.last_hidden_state[0, 0, :].numpy()

# Compare sentence similarity
sentences = [
    "The cat sat on the mat",
    "The kitten rested on the rug",
    "I need to deposit money at the bank"
]

embeddings = [get_sentence_embedding(s) for s in sentences]

print("Sentence similarity matrix:\n")
for i, sent1 in enumerate(sentences):
    for j, sent2 in enumerate(sentences):
        if i <= j:  # Only show upper triangle
            sim = np.dot(embeddings[i], embeddings[j]) / (norm(embeddings[i]) * norm(embeddings[j]))
            print(f"'{sent1}' <-> '{sent2}'")
            print(f"  Similarity: {sim:.3f}\n")

### Want to Learn More About BERT?

Check out these notebooks:
- **[BERT-For-Humanists_Word-Similarity_English-Public-Domain-Poetry.ipynb](./BERT-For-Humanists_Word-Similarity_English-Public-Domain-Poetry.ipynb)** - Word similarity analysis with BERT on poetry
- **[Transformers-Complete-Guide.ipynb](../transformers/Transformers-Complete-Guide.ipynb)** - Comprehensive guide to transformer models including BERT, GPT, and more

## Part 4: Side-by-Side Comparison

| Feature | Word2Vec (Classical) | BERT (Contextual) |
|---------|---------------------|-------------------|
| **Embedding Type** | Static | Dynamic/Contextual |
| **Context Awareness** | No - same vector always | Yes - changes with context |
| **Vector per Word** | One | Many (depends on context) |
| **Dimensions** | 100-300 | 768-1024 |
| **Speed** | Fast | Slower |
| **Memory** | Low | High |
| **Training** | Unsupervised on corpus | Pre-trained on massive data |
| **Word Arithmetic** | Yes (king - man + woman) | Not directly |
| **Sentence Embeddings** | By averaging words | Native support |
| **Best For** | Large vocabularies, speed | Ambiguous words, context |

### Visual Comparison

```
Word2Vec: One Representation
┌─────────────────────────────────┐
│ "bank" → [0.1, 0.5, -0.3, ...]  │
└─────────────────────────────────┘
        ↓         ↓         ↓
   river bank | savings bank | blood bank
   (all use the same vector)

BERT: Context-Dependent
┌────────────────────────────────────────────┐
│ "I sat by the river bank"                  │
│    → "bank" = [0.2, 0.8, -0.1, ...]        │
├────────────────────────────────────────────┤
│ "I deposited money at the bank"            │
│    → "bank" = [0.7, -0.3, 0.5, ...]        │
├────────────────────────────────────────────┤
│ "The blood bank needs donors"              │
│    → "bank" = [0.4, 0.1, 0.9, ...]         │
└────────────────────────────────────────────┘
(different vectors for different contexts)
```

In [None]:
# Direct comparison: Word2Vec vs BERT for polysemous words
print("=" * 60)
print("COMPARING Word2Vec vs BERT: The 'bank' example")
print("=" * 60)

# Word2Vec: Always the same
print("\n1. WORD2VEC (Static Embedding)")
print("-" * 60)
w2v_bank = word2vec_model['bank']
print(f"'bank' vector (always): {w2v_bank[:5]}...")
print("\n  Used in: 'river bank', 'savings bank', 'blood bank'")
print("  Result: IDENTICAL vector in all contexts")

# BERT: Context-dependent
print("\n2. BERT (Contextual Embedding)")
print("-" * 60)

contexts = [
    "I sat by the river bank",
    "I deposited money at the bank",
    "The blood bank needs donors"
]

bert_embeddings = []
for ctx in contexts:
    emb = get_word_embedding(ctx, 'bank')
    bert_embeddings.append(emb)
    print(f"\n  '{ctx}'")
    print(f"  'bank' vector: {emb[:5]}...")

# Calculate pairwise similarities for BERT embeddings
print("\n3. BERT Embedding Similarities")
print("-" * 60)
for i in range(len(contexts)):
    for j in range(i + 1, len(contexts)):
        sim = np.dot(bert_embeddings[i], bert_embeddings[j]) / \
              (norm(bert_embeddings[i]) * norm(bert_embeddings[j]))
        print(f"\n  Context {i+1} vs Context {j+1}: {sim:.3f}")

print("\n" + "=" * 60)
print("KEY INSIGHT: BERT creates different embeddings for 'bank'")
print("based on context, while Word2Vec always uses the same one.")
print("=" * 60)

## Part 5: Practical Decision Guide

### Use Word2Vec When:

1. **You need fast embeddings**
   - Real-time applications
   - Processing large volumes of text
   - Limited computational resources

2. **Your vocabulary is fixed**
   - Domain-specific terminology
   - You're working with a specialized corpus

3. **You're working with large corpora**
   - Training on millions of documents
   - News articles, books, Wikipedia

4. **You need word arithmetic**
   - Analogies (king - man + woman = queen)
   - Exploring semantic relationships
   - Word algebra operations

5. **Simple similarity is sufficient**
   - Finding related words
   - Basic semantic search
   - Document clustering

### Use BERT When:

1. **Context matters**
   - Polysemous words (multiple meanings)
   - Disambiguating word sense
   - Understanding nuanced text

2. **You need sentence-level representations**
   - Semantic similarity between sentences
   - Text classification
   - Question answering

3. **You have computational resources**
   - GPU available
   - Can handle longer processing times
   - Memory is not a constraint

4. **You're fine-tuning for specific tasks**
   - Named entity recognition
   - Sentiment analysis
   - Custom classification tasks

5. **State-of-the-art performance is required**
   - Production systems
   - Research applications
   - Competitive benchmarks

### Rule of Thumb:

```
Start with Word2Vec if:
  - You're exploring data
  - Speed matters more than accuracy
  - Your use case is straightforward

Move to BERT if:
  - Word2Vec isn't accurate enough
  - Context is crucial for your task
  - You need state-of-the-art results
```

## Part 6: Next Steps

### Deep Dive Notebooks

1. **[Word2Vec.ipynb](./Word2Vec.ipynb)**
   - Complete guide to Word2Vec
   - Training your own models
   - Advanced techniques and optimizations
   - Document embeddings with Doc2Vec
   - Visualization and analysis

2. **[BERT-For-Humanists_Word-Similarity_English-Public-Domain-Poetry.ipynb](./BERT-For-Humanists_Word-Similarity_English-Public-Domain-Poetry.ipynb)**
   - BERT word similarity on literary texts
   - Working with poetry and historical texts
   - Practical examples for humanities research

3. **[Transformers-Complete-Guide.ipynb](../transformers/Transformers-Complete-Guide.ipynb)**
   - Comprehensive transformer tutorial
   - BERT, GPT, T5, and more
   - Fine-tuning for custom tasks
   - Advanced transformer techniques

### Related Topics

- **Topic Modeling**: Use embeddings for better topic discovery
- **Text Classification**: Leverage embeddings as features
- **Semantic Search**: Build search engines with embeddings
- **Information Retrieval**: Find similar documents efficiently

### Learning Path

```
Beginner:
  1. This notebook (concepts and basics)
  2. Word2Vec.ipynb (classical embeddings)
  3. BERT-For-Humanists (simple BERT examples)

Intermediate:
  4. Transformers-Complete-Guide.ipynb
  5. Apply embeddings to your specific domain
  6. Experiment with fine-tuning

Advanced:
  7. Train custom embeddings on your corpus
  8. Implement hybrid approaches
  9. Optimize for production use
```

## Summary

### Key Takeaways

1. **Word embeddings convert text to numbers** that capture meaning

2. **Word2Vec (classical)**:
   - Fast and efficient
   - One vector per word
   - Great for exploration and simple tasks
   - Limited by lack of context

3. **BERT (contextual)**:
   - Slower but more accurate
   - Different vectors for different contexts
   - Handles ambiguity well
   - Requires more resources

4. **Choose based on your needs**:
   - Speed vs. accuracy trade-off
   - Context importance
   - Available computational resources

5. **Both have their place**:
   - Word2Vec isn't obsolete
   - BERT isn't always necessary
   - Use the right tool for the job

### Where to Go From Here

Start with the notebook that matches your immediate need:
- Need fast, simple embeddings? → **Word2Vec.ipynb**
- Working with ambiguous text? → **BERT-For-Humanists**
- Want comprehensive knowledge? → **Transformers-Complete-Guide.ipynb**

Happy embedding!