# What is Embedding

## Summary: Embeddings in NLP (Traditional ‚Üí Neural ‚Üí Transformer)

### 1. What is an Embedding?
An **embedding** is a numerical (dense vector) representation of a discrete object (e.g., a word) that captures useful information for machine learning models.

- Purpose: Convert symbols ‚Üí numbers
- Meaning is encoded in **geometry (distances & directions)**, not individual values

---

### 2. What is a Word Embedding?
A **word embedding** maps each word to a vector such that:
- Semantically similar words are close in vector space
- Dissimilar words are far apart

Examples:
- Word2Vec
- GloVe
- FastText

---

### 3. Embedding Dimension: Minimum and Maximum

#### Minimum
- **Theoretical minimum:** 1 (not useful)
- **Practical minimum:** ~10‚Äì50 (toy or simple tasks)

#### Maximum
- **Theoretical maximum:** Unlimited
- **Practical range:**
  - Static embeddings: 100‚Äì300
  - Transformers: 768‚Äì12,000+

Trade-off:
- Too small ‚Üí insufficient capacity
- Too large ‚Üí overfitting, inefficiency

---

### 4. Meaning of Each Number in an Embedding

Key truth:
> **Individual embedding dimensions have no fixed human-interpretable meaning.**

- Embeddings are **distributed representations**
- Meaning emerges from:
  - Distances
  - Angles
  - Directions between vectors
- Dimensions are **rotation-invariant** and arbitrary

Meaning lives in **relationships**, not coordinates.

---

### 5. How Are Embeddings Trained?

#### Core principle
**Distributional hypothesis**:
> Words appearing in similar contexts have similar meanings.

#### Neural embeddings (most common)
- Trained via **gradient descent**
- Objective: Predict context or next word
- Examples:
  - Word2Vec (Skip-gram, CBOW)
  - FastText
  - BERT, GPT

#### Non-gradient embeddings
- LSA (SVD-based)
- Spectral embeddings
- TF-IDF (not really embeddings)

---

### 6. Traditional NLP vs Transformer NLP

#### Traditional NLP
- Representations:
  - Bag-of-Words
  - TF-IDF
  - N-grams
- Position:
  - N-grams
  - Sliding windows
- Interaction:
  - Hand-crafted features
  - RNN/LSTM recurrence
- Pipelines were **modular and non end-to-end**

#### Transformer NLP
- Token embeddings
- Positional embeddings
- Attention mechanism
- End-to-end learned representations
- Contextual embeddings

**Key difference:**
> Traditional NLP separates meaning, position, and interaction; transformers learn them jointly.

---

### 7. Are All Embeddings Trained with Gradient Descent?

**No.**

| Embedding Type | Gradient Descent |
|---|---|
| Word2Vec | Yes |
| FastText | Yes |
| GloVe | Yes |
| BERT / GPT | Yes |
| LSA | No |
| TF-IDF | No |
| Spectral embeddings | No |
| Random embeddings | No |

Modern NLP is dominated by gradient-based methods due to scalability and flexibility.

---

### 8. Embeddings in RNN / LSTM Models

Key point:
> **Embeddings are crucial in RNN/LSTM-based NLP models.**

- Embeddings provide semantic signal
- RNN/LSTM:
  - Models order
  - Aggregates information over time
- Cannot compensate for poor embeddings (GIGO principle)

Empirical finding:
- Pre-trained embeddings often contribute more than the RNN itself

---

### 9. Pre-trained Embeddings: Frozen vs Fine-Tuned

Pre-trained embeddings are **initializations**, not fixed by default.

#### Option 1: Frozen
- No updates during training
- Good for small datasets
- Faster, more stable

#### Option 2: Fine-tuned
- Updated via backpropagation
- Adapts to task/domain
- Risk of overfitting or forgetting

Best practice:
- Small data ‚Üí freeze
- Larger or domain-specific data ‚Üí fine-tune (often with smaller LR)

---

### 10. Big Picture Takeaways

- Embeddings turn language into geometry
- Individual dimensions are meaningless; geometry is everything
- Gradient descent dominates modern embedding learning
- In RNN/LSTM models, embeddings carry most semantic power
- Transformers reduce reliance on static embeddings via attention
- Pre-trained embeddings can (and often should) be fine-tuned

---

### One-Sentence Summary

**Embeddings are learned geometric representations of language; how powerful your NLP model is largely depends on how well those vectors encode meaning and context.**




# Skip-gram with a Simple Numerical Example

I‚Äôll tell it like it is:

Real skip-gram training uses gradient descent and softmax over large vocabularies. Doing full training by hand is ugly.  
So we‚Äôll use a **tiny toy corpus**, **2D embeddings**, and show **one concrete skip-gram update intuition with numbers**.

---

## 1Ô∏è‚É£ Tiny corpus (toy data)

**Sentence:**

> ‚ÄúI like cats‚Äù

**Vocabulary (index ‚Üí word):**





**Window size = 1**

### Skip-gram pairs (center ‚Üí context)

| Center | Context |
|------|--------|
| like | I |
| like | cats |

So the model learns:

> If the center word is **‚Äúlike‚Äù**, it should predict **‚ÄúI‚Äù** and **‚Äúcats‚Äù**

---

## 2Ô∏è‚É£ What skip-gram actually learns (no magic)

Skip-gram learns **two embeddings per word**:

- **Input vector** (center word)
- **Output vector** (context word)

We‚Äôll use **2D embeddings** to visualize easily.

---

## 3Ô∏è‚É£ Initialize embeddings (random, small numbers)

### Input vectors (V)

| Word | Vector |
|----|------|
| I | `[0.2, 0.1]` |
| like | `[0.0, 0.3]` |
| cats | `[0.4, 0.2]` |

### Output vectors (U)

| Word | Vector |
|----|------|
| I | `[0.1, 0.0]` |
| like | `[0.0, 0.2]` |
| cats | `[0.3, 0.1]` |

---

## 4Ô∏è‚É£ Skip-gram prediction (core calculation)

We take **center word = "like"**

### Step 1: Dot product with each context word

**Formula:**

\[
\text{score}(w_c, w_o) = \mathbf{v}_{w_c} \cdot \mathbf{u}_{w_o}
\]

### Dot products

**score(like, I):**

\[
[0.0, 0.3] \cdot [0.1, 0.0] = 0.0
\]

**score(like, cats):**

\[
[0.0, 0.3] \cdot [0.3, 0.1] = 0.03
\]

**score(like, like):**

\[
[0.0, 0.3] \cdot [0.0, 0.2] = 0.06
\]

---

## 5Ô∏è‚É£ Convert scores to probabilities (softmax)

\[
P(w_o \mid w_c) =
\frac{e^{\text{score}}}{\sum e^{\text{scores}}}
\]

### Exponentials

| Word | exp(score) |
|----|----|
| I | \(e^0 = 1\) |
| cats | \(e^{0.03} \approx 1.03\) |
| like | \(e^{0.06} \approx 1.06\) |

**Sum ‚âà 3.09**

### Probabilities

| Word | Probability |
|----|----|
| I | \(1 / 3.09 \approx 0.32\) |
| cats | \(1.03 / 3.09 \approx 0.33\) |
| like | \(1.06 / 3.09 \approx 0.35\) |

‚ùå **Problem:**  
The model predicts **‚Äúlike‚Äù** as context, which is wrong.

---

## 6Ô∏è‚É£ What learning does (plain English)

Skip-gram will:

- Pull **‚Äúlike‚Äù** closer to **‚ÄúI‚Äù** and **‚Äúcats‚Äù**
- Push **‚Äúlike‚Äù** away from unrelated words

After many updates, embeddings might look like this:

---

## 7Ô∏è‚É£ Final learned 2D embeddings (intuitive result)

### Input embeddings (after training)

| Word | Vector |
|----|------|
| I | `[-0.2, 0.3]` |
| like | `[0.0, 0.5]` |
| cats | `[0.2, 0.4]` |

### Visual intuition (2D space)




‚úî Words appearing together are **geometrically close**

---

## 8Ô∏è‚É£ Simple analogy (remember this)

- Skip-gram is like **learning a map of words**
- Words that appear together are **neighbors on the map**
- Training = repeatedly nudging words **closer or farther apart**

No linguistics.  
No grammar rules.  
Just **geometry + statistics**.

---

## 9Ô∏è‚É£ Key takeaway (don‚Äôt sugar-coat it)

- Word embeddings are **not semantic by design**
- Meaning **emerges from co-occurrence**
- Skip-gram is just:

> ‚ÄúAdjust vectors so dot products predict nearby words‚Äù

---

## üìö References (foundational, authoritative)

- Mikolov et al., *Efficient Estimation of Word Representations in Vector Space*, 2013  
- Goldberg & Levy, *word2vec Explained*, 2014  
- Jurafsky & Martin, *Speech and Language Processing*, Chapter on Vector Semantics  

---

If you want next, I can:

- Show **negative sampling numerically**
- Compare **CBOW vs Skip-gram**
- Plot this in **Python**
- Explain why embeddings capture **analogies (king ‚àí man + woman)**
