# Understanding Word2Vec Mathematics



### The Big Picture: What Are We Computing?

Before diving into formulas, let's establish a mental framework. Word2Vec is essentially asking: "How likely is it that two words appear together?" But instead of counting, we're predicting - and this prediction process teaches our model the meaning of words.

### The Fundamental Question

Given a target word `w` and a potential context word `c`, we want to compute:

**P(+|w,c)** = Probability that c is actually a context word for w

Think of this like your brain recognizing patterns. When you see "coffee," your brain automatically activates related concepts like "cup," "hot," and "morning." Word2Vec mimics this neural activation pattern.

## Step 1: From Words to Numbers (Embeddings)

First, we need to convert words into vectors. This is like how your brain encodes concepts - each word gets a unique pattern of activation across neurons.

```
Word "apple" → Vector [0.2, -0.5, 0.8, ...]
Word "fruit" → Vector [0.3, -0.4, 0.7, ...]
```

**Neuroscience Hack**: Visualize each dimension as a different "neural feature detector." One dimension might respond to "edibility," another to "color," etc.

## Step 2: Measuring Word Similarity

The key insight is using the **dot product** to measure similarity:

**Similarity(w,c) ≈ c·w**

Let's break this down:
- If vectors point in similar directions → large positive dot product
- If vectors are perpendicular → dot product near zero
- If vectors point in opposite directions → negative dot product

**Visual Analogy**: Imagine two flashlight beams. The dot product measures how much they overlap:
- Parallel beams = maximum overlap (high similarity)
- Perpendicular beams = no overlap (no relationship)
- Opposite directions = negative overlap (opposite meanings)

## Step 3: Converting Similarity to Probability

The dot product gives us a number from -∞ to +∞, but we need a probability (0 to 1). Enter the **sigmoid function**:

**σ(x) = 1 / (1 + exp(-x))**

**Neuroscience Hack**: The sigmoid is exactly how biological neurons work! They convert continuous input into a firing probability. Here's what happens:
- Large positive input (x >> 0) → σ(x) ≈ 1 (neuron fires)
- Zero input (x = 0) → σ(x) = 0.5 (uncertain)
- Large negative input (x << 0) → σ(x) ≈ 0 (neuron silent)

So our probability becomes:

**P(+|w,c) = σ(c·w) = 1 / (1 + exp(-c·w))**

## Step 4: The Learning Process

Word2Vec learns by adjusting vectors to increase P(+|w,c) for actual context pairs and decrease it for random pairs.

### The Loss Function

The loss function is what we minimize during training:

**L = -log σ(cpos·w) + Σ log σ(-cneg·w)**

Let's decode this:
1. **-log σ(cpos·w)**: We want to maximize probability for real context words
   - Taking -log converts "maximize probability" to "minimize loss"
   - When probability is high (near 1), -log is small (near 0)
   
2. **Σ log σ(-cneg·w)**: We want to minimize probability for random words
   - The negative sign in (-cneg·w) flips our similarity measure
   - We sum over k negative samples

**Memory Trick**: Think "PUSH-PULL"
- PUSH real context words closer (minimize first term)
- PULL random words away (minimize second term)

## Step 5: Gradient Descent Updates

The update rules tell us how to adjust our vectors:

### For positive context word:
**∂L/∂cpos = [σ(cpos·w) - 1]w**

**Intuition**: 
- If σ(cpos·w) ≈ 1 (already good), gradient ≈ 0 (small update)
- If σ(cpos·w) ≈ 0 (bad), gradient ≈ -w (big update toward w)

### For negative samples:
**∂L/∂cneg = [σ(cneg·w)]w**

**Intuition**:
- If σ(cneg·w) ≈ 0 (already good), gradient ≈ 0 (small update)
- If σ(cneg·w) ≈ 1 (bad), gradient ≈ w (big update away from w)

### The Update Rules:
```
cpos(t+1) = cpos(t) - η[σ(cpos·w) - 1]w
cneg(t+1) = cneg(t) - η[σ(cneg·w)]w
w(t+1) = w(t) - η{[σ(cpos·w) - 1]cpos + Σ[σ(cneg·w)]cneg}
```

Where η (eta) is the learning rate - think of it as "step size" in learning.

## Negative Sampling: The Efficiency Trick

Instead of computing probabilities over all words (expensive!), we:
1. Take one positive example
2. Sample k negative examples
3. Update only these k+1 vectors

The sampling probability uses a special formula:

**Pα(w) = count(w)^α / Σ count(w')^α**

With α = 0.75, this:
- Boosts rare words (gives them more training)
- Reduces very common words (they already get enough training)

**Brain Analogy**: Your brain pays more attention to unusual events (rare words) than everyday occurrences (common words).

## Quiz Time! 🧠

### Question 1: Dot Product Intuition
If two word vectors have a dot product of -5, what does this suggest about their relationship?

<details>
<summary>Click for answer</summary>

A dot product of -5 (large negative value) suggests the words have opposite or contrasting meanings. They point in nearly opposite directions in the vector space, indicating semantic opposition (like "hot" and "cold").
</details>

### Question 2: Sigmoid Function
What is σ(0), and what does this value mean in the context of Word2Vec?

<details>
<summary>Click for answer</summary>

σ(0) = 1/(1+exp(0)) = 1/(1+1) = 0.5

This means when the dot product between two word vectors is 0 (they're perpendicular/unrelated), the probability that one is a context word for the other is exactly 0.5 - complete uncertainty.
</details>

### Question 3: Gradient Understanding
If σ(cpos·w) = 0.9 for a positive example, will the gradient update be large or small? Why?

<details>
<summary>Click for answer</summary>

The gradient will be small. The gradient is [σ(cpos·w) - 1] = [0.9 - 1] = -0.1.

Since we're already close to the target (probability near 1), we only need a small adjustment. This prevents overshooting and helps convergence.
</details>

### Question 4: Negative Sampling
Why do we use count(w)^0.75 instead of just count(w) for negative sampling?

<details>
<summary>Click for answer</summary>

Using count(w)^0.75 makes the distribution more uniform:
- It increases the relative probability of sampling rare words
- It decreases the relative probability of sampling very common words
- This gives rare words more training opportunities and prevents common words from dominating the negative samples
</details>

### Question 5: Loss Function Interpretation
In the loss function L = -log σ(cpos·w) + Σ log σ(-cneg·w), why do we use -cneg·w (with a negative sign) in the second term?

<details>
<summary>Click for answer</summary>

The negative sign flips the similarity measure. We want:
- High probability for positive pairs: σ(cpos·w) → 1
- Low probability for negative pairs: σ(cneg·w) → 0

Using -cneg·w in σ(-cneg·w) effectively computes 1 - σ(cneg·w), giving us the probability that cneg is NOT a context word. We want this to be high (near 1) for negative samples.
</details>

## Key Takeaways

1. **Dot products measure semantic similarity** - the foundation of Word2Vec
2. **Sigmoid converts similarities to probabilities** - just like neural activation
3. **The loss function implements "attract-repel"** - pull related words together, push unrelated apart
4. **Negative sampling makes training efficient** - update only a few vectors per step
5. **Gradients guide learning** - bigger errors lead to bigger updates

Remember: Word2Vec is teaching a neural network to predict context, and the side effect is learning meaningful word representations!