# 📚 Word Embeddings & Representation Learning

---

## 🧠 What Are Word Embeddings?

- **Embeddings** are dense vector representations of discrete tokens (e.g., words, subwords, items).
- They capture **semantic similarity**: similar words end up close together in vector space.
- Used to convert categorical/textual data into **numeric form** for neural networks.

---

## 🔢 From Text to Embeddings: Step-by-Step

### 1. **Tokenization**
Break text into tokens:
- Word-level: `"the cat sleeps"` → `["the", "cat", "sleeps"]`
- Subword-level (e.g. BERT): `"sleeping"` → `["sleep", "##ing"]`

### 2. **Vocabulary Mapping**
Each token maps to a unique integer ID using a vocab dictionary:
```python
vocab = {
  "[PAD]": 0,
  "[UNK]": 1,
  "the": 2,
  "cat": 3,
  "sleeps": 4
}
```

### 3. **Embedding Lookup**
Tokens → IDs → vectors:
```python
nn.Embedding(num_embeddings=10000, embedding_dim=300)
```
- Input: `[2, 3, 4]`
- Output: shape `(3, 300)`

---

## 🧊 Frozen vs 🧠 Contextual Embeddings

| Feature                     | Frozen Embeddings              | Contextual Embeddings                  |
|----------------------------|--------------------------------|----------------------------------------|
| Meaning changes by context | ❌ No                          | ✅ Yes                                 |
| Token granularity          | Word-level                    | Subword-level                         |
| Examples                   | Word2Vec, GloVe, FastText      | BERT, GPT, RoBERTa                     |
| Computation                | Precomputed (lookup)          | Computed dynamically (transformers)    |
| Representation type        | Static vector per token       | Depends on full sentence context       |

### 🔄 Example (semantic vector math):
If the model captures relational meaning:
$$
\text{embedding}("king") - \text{embedding}("man") + \text{embedding}("woman") \approx \text{embedding}("queen")
$$

---

## 🛠 Special Tokens

| Token     | Purpose                                        |
|-----------|------------------------------------------------|
| `[PAD]`   | Used to equalize sequence lengths in batching  |
| `[UNK]`   | Represents unknown/out-of-vocab tokens         |

---

## 📏 Similarity Measures

To measure similarity between embeddings:
- **Cosine similarity**:
$$
\text{cosine\_similarity}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}
$$

---

## 🧠 Embedding Layer Behavior

- It's just a **lookup table** with learnable weights:
$$
\text{EmbeddingMatrix} \in \mathbb{R}^{V \times D}
$$
Where:
- $V$ = vocabulary size  
- $D$ = embedding dimension

Each token ID $i$ retrieves the vector from row $i$ of this matrix.

---

## ✅ Summary

- Embeddings convert symbolic tokens into trainable dense vectors.
- They allow models to **learn internal representations** of language, categories, users, etc.
- Frozen embeddings = static vectors from pretraining  
- Contextual embeddings = dynamic vectors based on full sentence meaning  
- Modern models tokenize flexibly (subwords) to minimize `[UNK]` tokens


# 🔁 Transfer Learning, Pretraining & Fine-Tuning

---

## 🧠 What is Pretraining?

- **Pretraining** = train a model on a large general-purpose task (e.g., masked language modeling, next-token prediction).
- Models learn broad features like syntax, grammar, and real-world facts before being used for specific tasks.

---

## 🧰 Leveraging Pretrained Models

### 1. **Pretrained Embeddings Only**
- Use pretrained static embeddings like Word2Vec, GloVe.
- Can be frozen or fine-tuned.

### 2. **Pretrained Transformers (BERT, GPT, etc.)**
- Use the full stack (embeddings + encoder layers).
- Typically followed by fine-tuning on your downstream task (e.g., classification, NER).

---

## 🔧 Fine-Tuning Pretrained Models

### 🔹 Workflow:
1. Load a pretrained model.
2. Add a task-specific head (e.g., linear classifier).
3. Fine-tune the full model (or parts of it) using your labeled data.

### ✅ When to fine-tune:
- You have enough labeled data.
- Your task is different from pretraining (domain-specific, e.g., medical text).

---

## 🆚 Feature Extraction vs Fine-Tuning

| Strategy           | Updates Pretrained Weights? | Use Case                        |
|--------------------|-----------------------------|----------------------------------|
| Feature extraction | ❌ No                        | Fast, low-data, prototyping      |
| Fine-tuning        | ✅ Yes                       | Max performance, custom tasks   |

---

## ⚙️ Learning Rate Best Practices

- Fine-tuning typically uses **lower LRs** (e.g., $2e\text{-}5$ to $5e\text{-}5$).
- Too high = catastrophic forgetting.
- Too low = underfitting or slow training.

---

## 🔥 Warmup: What & Why

- **Warmup** = gradually increase the LR at the start of training.
- Prevents unstable updates early on.
- Commonly used with transformers.

### 🔢 Linear Warmup Formula:

$$
\text{LR}(t) = \text{initial\_lr} \times \frac{t}{\text{warmup\_steps}} \quad \text{for } t \leq \text{warmup\_steps}
$$

---

## 📉 Learning Rate Decay Strategies

### 🔸 Linear Decay:
$$
\text{LR}(t) = \text{initial\_lr} \times \left(1 - \frac{t}{T}\right)
$$

### 🔸 Cosine Decay:
$$
\text{LR}(t) = \text{initial\_lr} \times \frac{1}{2} \left(1 + \cos\left(\frac{t \pi}{T}\right)\right)
$$

| Strategy       | Behavior                          | Use Case                       |
|----------------|-----------------------------------|--------------------------------|
| Linear         | Steady decrease                   | Most common default            |
| Cosine         | Gentle start, sharper end         | Better generalization on some tasks |

---

## 🧊 Layer Freezing Strategies

### ✅ Why Freeze Layers?
- Preserve general knowledge from pretraining.
- Reduce overfitting and speed up training.

| Layer Type           | Freeze?  | Reason                        |
|----------------------|----------|-------------------------------|
| Embeddings / Early   | ✅ Often | Store general syntax/semantics |
| Middle / Top Layers  | ❌       | Needed for task adaptation    |
| Classifier Head      | ❌       | Task-specific output layer    |

---

## 🔁 Progressive Unfreezing (Optional)

- Start with all layers frozen.
- Gradually unfreeze top → bottom one layer at a time.

---

## 🎯 Layer-wise Learning Rate Decay (LLRD)

### 🔹 Concept:
Use **lower learning rates** for lower layers and **higher LRs** for upper layers and the head.

### 🔢 Example:
If base LR is $2e\text{-}5$ and decay factor is $0.95$:

$$
\text{LR}_i = \text{base\_lr} \times (0.95)^{\text{depth}_{\text{top}} - i}
$$

Where $i$ is the layer index.

| Layer     | Learning Rate Scaling |
|-----------|------------------------|
| Bottom    | $0.25 \times$ base LR |
| Middle    | $0.5 \times$ base LR  |
| Top       | $1.0 \times$ base LR  |
| Classifier| $2.0 \times$ base LR  |

> 🎯 This gives you **fine-grained control** over how much each layer learns.

---

## 🧠 Freezing vs LLRD: A Mental Model

| Concept       | Characteristic         |
|---------------|------------------------|
| Freezing      | Binary (train or don't)|
| LLRD          | Smooth control (scale) |
| ReLU : Freezing | All-or-nothing        |
| GELU : LLRD     | Smooth, probabilistic |

---

## ✅ TL;DR

- Fine-tuning pretrained models requires **careful learning rate control**.
- Combine **warmup**, **decay**, **freezing**, and **LLRD** to train safely and effectively.
- Try different combinations and use validation loss to guide strategy.



# 🔄 Tensor Shape Manipulation: `unsqueeze()` vs `squeeze()`

---

## 🧠 Overview

- **`unsqueeze(dim)`** → adds a new dimension of size `1` at the specified `dim` index
- **`squeeze(dim=None)`** → removes any dimension of size `1` (or a specific one, if you provide `dim`)

These functions are used to **reshape tensors**, especially for:
- Batching
- Broadcasting
- Compatibility with model input/output shapes

---

## 🔹 `unsqueeze(dim)` — Add dimension

### 📌 Usage:
```python
x = torch.tensor([1, 2, 3])        # Shape: [3]
x.unsqueeze(0).shape               # [1, 3] — adds a batch dimension
x.unsqueeze(1).shape               # [3, 1] — makes it a column vector
```

### 🧠 Intuition:
- Think of `unsqueeze(0)` as: "wrap this tensor in a batch"
- Think of `unsqueeze(1)` as: "turn this 1D row into a 2D column"

### 📐 Shape Rule:
If tensor has shape `[D1, D2, ..., Dn]`:
- You can insert a new dimension at any index from `0` to `n`
- Negative indices count from the end (just like Python lists)

| Input Shape | Operation          | Output Shape | Meaning                      |
|-------------|--------------------|--------------|------------------------------|
| `[3]`       | `.unsqueeze(0)`    | `[1, 3]`     | Add batch dimension          |
| `[3]`       | `.unsqueeze(1)`    | `[3, 1]`     | Convert to column shape      |
| `[3, 4]`     | `.unsqueeze(-1)`   | `[3, 4, 1]`   | Add trailing singleton dim   |

---

## 🔻 `squeeze(dim=None)` — Remove dimension(s)

### 📌 Usage:
```python
x = torch.tensor([[[1], [2], [3]]])  # Shape: [1, 3, 1]
x.squeeze().shape                    # [3] — removes all singleton dimensions
x.squeeze(0).shape                   # [3, 1] — removes only dim 0 if it's size 1
```

### 🧠 Intuition:
- Use `.squeeze()` to get rid of unnecessary `[1]`s in your tensor shape
- Common when reducing outputs like `[1, 1, N]` → `[N]`

---

## 🧩 When to use

| Use Case                      | Recommendation            |
|-------------------------------|----------------------------|
| One input → batch model       | Use `.unsqueeze(0)`        |
| Fix shape mismatch errors     | Use `.unsqueeze()` or `.squeeze()` |
| Collapse outputs              | Use `.squeeze()` to simplify |

---

## ✅ TL;DR

- `unsqueeze(n)` = **insert** new axis of size `1` at index `n`
- `squeeze(n)` = **remove** axis of size `1` (or all of them if no dim is given)
- These are your tools for making tensor shapes compatible with model input/output expectations
