# üß© Word Embeddings ‚Äî From Words to Vectors

---

## 1Ô∏è‚É£ What & Why ‚Äî High-level Intuition

**Goal:** Represent each word $ w $ in a vocabulary $ V $ as a vector so that **semantic** and **syntactic** relationships between words are captured geometrically (e.g., similar words are close in space).

Two broad families of feature extraction techniques:

1. **Count / Frequency-based Representations**
   - **Bag of Words (BoW):** Counts of token occurrences.
   - **One-Hot Encoding:** A unique binary vector per token.
   - **TF‚ÄìIDF:** Weighted frequency emphasizing informative terms.

2. **Neural / Deep Learning-based Representations**
   - **Word2Vec:** Learns dense word vectors by predicting context (CBOW) or target (Skip-Gram).

---

## 2Ô∏è‚É£ Conceptual Hierarchy of Word Embeddings

A clear hierarchical view showing the evolution of word embedding techniques ‚Äî from frequency-based to deep learning-based models.

$$
\begin{array}{c}
\boxed{\Large \textbf{Word Embedding}} \\[10pt]

% Main outer arrows to both grouped boxes
\begin{array}{cc}
\swarrow & \searrow
\end{array}
\\[8pt]

% Two main grouped boxes side-by-side
\begin{array}{cc}

% ---------- LEFT GROUP ----------
\color{#2E8B57}{
\boxed{
\begin{array}{c}
\textbf{Count / Frequency-based Models} \\[8pt]
\Downarrow \\[6pt]
\boxed{\text{One-Hot}}
\;\longrightarrow\;
\boxed{\text{BoW}}
\;\longrightarrow\;
\boxed{\text{TF‚ÄìIDF}}
\end{array}
}}

&
% ---------- RIGHT GROUP ----------
\color{#4682B4}{
\boxed{
\begin{array}{c}
\textbf{Deep Learning-based Models} \\[8pt]
\Downarrow \\[6pt]
\boxed{\text{Word2Vec}} \\[6pt]
\Downarrow \\[4pt]
\boxed{\text{CBOW (Continuous BoW)}} 
\;\longleftrightarrow\;
\boxed{\text{Skip-Gram}}
\end{array}
}}
\end{array}
\\[12pt]


% connecting label
\color{gray}{
\text{(Traditional statistical approaches} \;\;\longrightarrow\;\; \text{Learned neural representations)}
}
\end{array}
$$


---

### üîç Explanation

#### üåø **Count / Frequency-based Models**
| Model | Description |
|--------|--------------|
| **One-Hot** | Each word represented by a unique binary vector (no relation captured). |
| **BoW** | Counts word occurrences per document (order ignored). |
| **TF‚ÄìIDF** | Weights terms by importance ‚Äî downweights frequent but less informative words. |

‚û°Ô∏è These are **sparse**, **non-contextual**, and depend purely on frequency statistics.

---

#### üí† **Deep Learning-based Models**
| Model | Description |
|--------|--------------|
| **Word2Vec** | Learns word embeddings using a shallow neural network. |
| **CBOW (Continuous BoW)** | Predicts a word from its context words. |
| **Skip-Gram** | Predicts context words from the center word. |

‚û°Ô∏è These are **dense**, **semantic**, and **context-aware** ‚Äî capturing relationships like:  
$$
\text{king} - \text{man} + \text{woman} \approx \text{queen}
$$

---

### üß© Summary of the Evolution
1. **Start:** Discrete vectors ‚Üí *(One-Hot)*  
2. **Add frequency context:** *(BoW)*  
3. **Add weighting for importance:** *(TF‚ÄìIDF)*  
4. **Add learning and semantics:** *(Word2Vec)*  
5. **Add directionality/context modeling:** *(CBOW / Skip-Gram)*  

---

## 3Ô∏è‚É£ Formal Definitions

### üîπ One-Hot Encoding
For vocabulary size $ |V| $, the one-hot vector for word $ w $ is defined as:

$$
\mathbf{x}_w \in \{0,1\}^{|V|}, \quad
(\mathbf{x}_w)_i =
\begin{cases}
1, & \text{if } i = \operatorname{index}(w) \\
0, & \text{otherwise}
\end{cases}
$$

- **Pros:** Simple, unambiguous.
- **Cons:** Extremely sparse; no similarity between words.

---

### üîπ Bag-of-Words (BoW)
A document $ d $ is represented as a count vector:

$$
\mathbf{c}_d \in \mathbb{R}^{|V|}, \quad
(\mathbf{c}_d)_i = \text{count of token } v_i \text{ in } d.
$$

- **Pros:** Straightforward and effective for linear models.
- **Cons:** Ignores word order and context.

---

### üîπ TF‚ÄìIDF (Term Frequency ‚Äì Inverse Document Frequency)
Weights terms based on their importance across documents:

$$
\operatorname{tfidf}(t, d) = \operatorname{tf}(t, d) \cdot \operatorname{idf}(t)
$$
$$
\operatorname{idf}(t) = \log\!\left( \frac{N}{1 + \operatorname{df}(t)} \right)
$$

where:
- $ N $: total number of documents  
- $ \operatorname{df}(t) $: number of documents containing term $ t $

**Common similarity metric:**

$$
\cos(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a}^\top \mathbf{b}}{\lVert \mathbf{a}\rVert_2 \lVert \mathbf{b}\rVert_2}
$$

- **Pros:** Reduces weight of common words.
- **Cons:** Sparse, still ignores semantic meaning.

---

### üîπ Word2Vec (Learned Dense Embeddings)
Word2Vec learns dense vector representations through a neural network.

Each word $ w $ has:
- **Input vector:** $ \mathbf{v}_w $
- **Output vector:** $ \mathbf{u}_w $

---

#### üü¢ Skip-Gram Model
Predicts context words $ w_c $ from a center word $ w_t $:

$$
\max_\Theta \sum_{t} \sum_{w_c \in \mathcal{C}_t} 
\log p(w_c | w_t)
$$

$$
p(w_c | w_t) =
\frac{\exp(\mathbf{u}_{w_c}^{\top}\mathbf{v}_{w_t})}
{\sum_{w' \in V}\exp(\mathbf{u}_{w'}^{\top}\mathbf{v}_{w_t})}
$$

---

#### üü£ Continuous Bag-of-Words (CBOW)
Predicts the target word from the average of its surrounding context words:

$$
\bar{\mathbf{v}}_{\mathcal{C}_t} =
\frac{1}{|\mathcal{C}_t|} \sum_{w_c \in \mathcal{C}_t} \mathbf{v}_{w_c}
$$

$$
\max_\Theta \sum_t \log p(w_t | \bar{\mathbf{v}}_{\mathcal{C}_t})
$$

---

#### ‚öôÔ∏è Negative Sampling (Efficient Approximation)

$$
\mathcal{L}_{\text{NS}} =
- \Big[
\log \sigma(\mathbf{u}_{w_c}^{\top}\mathbf{v}_{w_t})
+ \sum_{i=1}^{k} \log \sigma(-\mathbf{u}_{w_i^{-}}^{\top}\mathbf{v}_{w_t})
\Big]
$$
$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

- **Pros:** Dense, compact, captures analogies and semantics.  
- **Cons:** Needs training, one vector per word sense.

---

## 4Ô∏è‚É£ Conceptual Comparison (Feature Types)

| **Aspect** | **One-Hot** | **BoW** | **TF‚ÄìIDF** | **Word2Vec (CBOW/SG)** |
|-------------|--------------|----------|-------------|---------------------------|
| **Dimensionality** | $ |V| $ | $ |V| $ | $ |V| $ | $ d \ll |V| $ |
| **Sparsity** | Very high | High | High | **Low (dense)** |
| **Context Awareness** | ‚úó | ‚úó | ‚úó | **‚úì** |
| **Semantic Similarity** | ‚úó | Limited | Limited | **‚úì‚úì‚úì** |
| **Training Needed** | No | No | No | **Yes** |
| **Order / Syntax** | ‚úó | ‚úó | ‚úó | **Partial (via context window)** |

---

## 5Ô∏è‚É£ Mathematical Relationships (KaTeX Visuals)

**Cosine Similarity:**
$$
\text{similarity}(w_i, w_j) = 
\cos(\mathbf{e}_{w_i}, \mathbf{e}_{w_j}) =
\frac{\mathbf{e}_{w_i}^{\top}\mathbf{e}_{w_j}}
{\lVert \mathbf{e}_{w_i}\rVert_2 \lVert \mathbf{e}_{w_j}\rVert_2}
$$

**Analogy Relationship:**
$$
\mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}} + \mathbf{e}_{\text{woman}}
\approx \mathbf{e}_{\text{queen}}
$$

**Context Window ( $ m $) around token $ t $:**
$$
\mathcal{C}_t = \{ w_{t-m}, \ldots, w_{t-1}, w_{t+1}, \ldots, w_{t+m} \}
$$

---

## 6Ô∏è‚É£ Practical Notes

- Use **TF‚ÄìIDF** for classical ML models (SVM, Logistic Regression) when interpretability matters.  
- Use **Word2Vec** for deep learning tasks requiring semantic understanding.  
- Hybrid approaches combine both:  
  $$
  \mathbf{v}_{\text{doc}} = [\text{TF‚ÄìIDF} \; \| \; \text{Mean(Word2Vec)}]
  $$




# üß© Word Embeddings ‚Äî From Words to Vectors (with Examples)

---

## 1Ô∏è‚É£ Understanding the Landscape

Word embeddings transform **words ‚Üí numeric vectors**, enabling models to understand semantics and similarity.

Two major categories:

- **Count/Frequency-based Representations**
  - Represent text using counts or weighted frequencies.
- **Neural/Deep Learning-based Representations**
  - Learn continuous dense vectors using training objectives.

---

## 2Ô∏è‚É£ Step-by-Step Examples with Substituted Values

Let‚Äôs take a **toy corpus** of two simple sentences:

> üóÇÔ∏è Corpus:  
> D‚ÇÅ = "I love dogs"  
> D‚ÇÇ = "I love cats"

Vocabulary $ V = \{ \text{I}, \text{love}, \text{dogs}, \text{cats} \} \Rightarrow |V| = 4 $

---

### üîπ Example 1 ‚Äî One-Hot Encoding

Each word is represented as a binary vector of size $ |V| = 4 $.

$$
\begin{aligned}
\text{I}      &\rightarrow [1, 0, 0, 0] \\
\text{love}   &\rightarrow [0, 1, 0, 0] \\
\text{dogs}   &\rightarrow [0, 0, 1, 0] \\
\text{cats}   &\rightarrow [0, 0, 0, 1]
\end{aligned}
$$

Each position corresponds to a vocabulary index.
No relationship between ‚Äúdogs‚Äù and ‚Äúcats‚Äù is captured ‚Äî they‚Äôre orthogonal.

---

### üîπ Example 2 ‚Äî Bag-of-Words (BoW)

Represent each document as counts over the vocabulary:

| Term  | I | love | dogs | cats |
|--------|---|------|------|------|
| D‚ÇÅ     | 1 | 1 | 1 | 0 |
| D‚ÇÇ     | 1 | 1 | 0 | 1 |

$$
\mathbf{c}_{D_1} = [1, 1, 1, 0], \quad
\mathbf{c}_{D_2} = [1, 1, 0, 1]
$$

---

### üîπ Example 3 ‚Äî TF‚ÄìIDF Calculation (with substitution)

Compute TF‚ÄìIDF weights for each word:

$$
\operatorname{tfidf}(t, d) = \operatorname{tf}(t, d) \times \operatorname{idf}(t)
$$
$$
\operatorname{idf}(t) = \log\!\left( \frac{N}{1 + \operatorname{df}(t)} \right)
$$

Where $ N = 2 $ (number of documents).

| Term | D‚ÇÅ tf | D‚ÇÇ tf | df(t) | idf(t) = log(2 / (1 + df)) |
|------|--------|--------|-------|-----------------------------|
| I | 1 | 1 | 2 | log(2 / 3) = -0.176 |
| love | 1 | 1 | 2 | log(2 / 3) = -0.176 |
| dogs | 1 | 0 | 1 | log(2 / 2) = 0 |
| cats | 0 | 1 | 1 | log(2 / 2) = 0 |

$$
\text{TF‚ÄìIDF}(D_1) = [1 \times (-0.176), 1 \times (-0.176), 1 \times 0, 0 \times 0]
$$
$$
\text{TF‚ÄìIDF}(D_2) = [1 \times (-0.176), 1 \times (-0.176), 0, 1 \times 0]
$$

Result (rounded):

$$
\mathbf{v}_{D_1} = [-0.18, -0.18, 0, 0], \quad
\mathbf{v}_{D_2} = [-0.18, -0.18, 0, 0]
$$

‚ö†Ô∏è Both documents look identical for ‚ÄúI love X‚Äù ‚Üí TF‚ÄìIDF can‚Äôt distinguish meaning (semantic gap).

---

### üîπ Example 4 ‚Äî Word2Vec (Conceptual Substitution)

**Idea:** Learn embedding vectors such that context predicts target (Skip-Gram) or vice versa (CBOW).

Let‚Äôs assume:
$$
\text{Sentence: "I love dogs"}
$$
and window size $ m = 1 $.

**Context windows:**

| Target | Context |
|---------|----------|
| I | [love] |
| love | [I, dogs] |
| dogs | [love] |

---

#### (a) Skip-Gram Objective Example

For pair $ (w_t = \text{love}, w_c = \text{dogs}) $:

$$
p(w_c | w_t) = \frac{\exp(\mathbf{u}_{w_c}^{\top}\mathbf{v}_{w_t})}
{\sum_{w' \in V} \exp(\mathbf{u}_{w'}^{\top}\mathbf{v}_{w_t})}
$$

Assume 2D embeddings:

$$
\mathbf{v}_{\text{love}} = [0.2, 0.8], \quad
\mathbf{u}_{\text{dogs}} = [0.3, 0.9]
$$

Then:

$$
\mathbf{u}_{\text{dogs}}^{\top}\mathbf{v}_{\text{love}} =
(0.3)(0.2) + (0.9)(0.8) = 0.06 + 0.72 = 0.78
$$

$$
p(\text{dogs}|\text{love}) \propto e^{0.78} = 2.18
$$

After normalization over the vocabulary, the probability of ‚Äúdogs‚Äù appearing after ‚Äúlove‚Äù becomes high ‚Äî exactly what we want semantically.

---

#### (b) CBOW Objective Example

Predict center word $ w_t = \text{love} $ from context [I, dogs]:

$$
\bar{\mathbf{v}}_{\mathcal{C}_t} =
\frac{\mathbf{v}_{\text{I}} + \mathbf{v}_{\text{dogs}}}{2}
$$

Assume:
$$
\mathbf{v}_{\text{I}} = [0.1, 0.2], \quad
\mathbf{v}_{\text{dogs}} = [0.3, 0.9]
$$

Then:

$$
\bar{\mathbf{v}}_{\mathcal{C}_t} = \frac{[0.1 + 0.3, 0.2 + 0.9]}{2} = [0.2, 0.55]
$$

The model then predicts the word ‚Äúlove‚Äù using this averaged context embedding.

---

## 4Ô∏è‚É£ Summary Table ‚Äî Intuition and Representation

| **Aspect** | **One-Hot** | **BoW** | **TF‚ÄìIDF** | **Word2Vec** |
|-------------|--------------|----------|-------------|---------------|
| **Representation** | Binary vector | Count vector | Weighted count | Learned dense vector |
| **Captures Context?** | ‚úó | ‚úó | ‚úó | ‚úì |
| **Handles Synonyms?** | ‚úó | ‚úó | ‚úó | ‚úì |
| **Dimensionality** | $ |V| $ | $ |V| $ | $ |V| $ | $ d \ll |V| $ |
| **Sparsity** | High | High | High | Low |
| **Example Dim (|V|=4)** | [1,0,0,0] | [1,1,1,0] | [-0.18,-0.18,0,0] | [0.2,0.8] |
| **Semantic Power** | ‚ùå | ‚ö™Ô∏è | ‚ö™Ô∏è | ‚úÖ‚úÖ‚úÖ |

---

## 5Ô∏è‚É£ Takeaway

- **Count-based** methods rely on surface frequency ‚Äî simple, interpretable, but lack meaning.  
- **Word2Vec** learns meaning from context ‚Äî geometrically encoding relationships:
  $$
  \mathbf{e}_{\text{king}} - \mathbf{e}_{\text{man}} + \mathbf{e}_{\text{woman}}
  \approx \mathbf{e}_{\text{queen}}
  $$
- This allows downstream models to generalize better and understand analogy.

---
