## **📊 One Hot Encoding:**

| Document | Text              |
|-----------|------------------|
| D1 | The food is good |
| D2 | The food is bad |
| D3 | Pizza is amazing |

---

## 🧱 Vocabulary (Unique Words)

['The', 'food', 'is', 'good', 'bad', 'Pizza', 'amazing']

| The | food | is | good | bad | Pizza | amazing |
|-----|------|----|------|-----|--------|----------|
|  1  |  0   | 0  |  0   | 0   |  0     | 0        |
|  0  |  1   | 0  |  0   | 0   |  0     | 0        |
|  0  |  0   | 1  |  0   | 0   |  0     | 0        |
|  0  |  0   | 0  |  1   | 0   |  0     | 0        |
|  0  |  0   | 0  |  0   | 1   |  0     | 0        |
|  0  |  0   | 0  |  0   | 0   |  1     | 0        |
|  0  |  0   | 0  |  0   | 0   |  0     | 1        |


---

## 🔢 One-Hot Encoding Representation

### 📄 D1: “The food is good”

| Word | The | food | is | good | bad | Pizza | amazing |
|------|-----|------|----|------|-----|--------|----------|
|      | 1   | 0    | 0  | 0    | 0   | 0      | 0        |
|      | 0   | 1    | 0  | 0    | 0   | 0      | 0        |
|      | 0   | 0    | 1  | 0    | 0   | 0      | 0        |
|      | 0   | 0    | 0  | 1    | 0   | 0      | 0        |

**Shape:** `4 × 7`

---

### 📄 D2: “The food is bad”

| Word | The | food | is | good | bad | Pizza | amazing |
|------|-----|------|----|------|-----|--------|----------|
|      | 1   | 0    | 0  | 0    | 0   | 0      | 0        |
|      | 0   | 1    | 0  | 0    | 0   | 0      | 0        |
|      | 0   | 0    | 1  | 0    | 0   | 0      | 0        |
|      | 0   | 0    | 0  | 0    | 1   | 0      | 0        |

**Shape:** `4 × 7`

---

### 📄 D3: “Pizza is amazing”

| Word | The | food | is | good | bad | Pizza | amazing |
|------|-----|------|----|------|-----|--------|----------|
|      | 0   | 0    | 0  | 0    | 0   | 1      | 0        |
|      | 0   | 0    | 1  | 0    | 0   | 0      | 0        |
|      | 0   | 0    | 0  | 0    | 0   | 0      | 1        |

**Shape:** `3 × 7`

---

## 🧮 Summary

- Total Vocabulary Size: **7**
- One-hot vector length per word = **7**
- Each document is represented as a **matrix** where:  
  ➤ **Rows = Words in document**  
  ➤ **Columns = Vocabulary terms**

---

## 💡 Insight

> One-Hot Encoding helps represent each word as a binary vector based on the vocabulary.  
> However, it becomes **inefficient for large datasets** — hence modern NLP models use embeddings like **Word2Vec, GloVe, and BERT**.



---

## ⚙️ How It Works
Each unique word in the vocabulary is represented as a **binary vector**.  
- The length of each vector equals the total number of words in the vocabulary.
- A `1` indicates the presence of the word at that index, and `0` indicates absence.

---

## ✅ Advantages

1. **Easy to Implement** — works well with Python libraries such as  
   → `sklearn.preprocessing.OneHotEncoder`, `pandas.get_dummies()`

2. **Simple Representation** — no complex math required.

---

## ❌ Disadvantages

1. **Sparse Matrix Problem** — results in high-dimensional vectors with many zeros, leading to **overfitting**.  
2. **Fixed Input Size** — machine learning models require all vectors to have the same length.  
3. **No Semantic Understanding** — words like *“good”* and *“great”* are treated as completely unrelated.  
4. **OOV Issue (Out-of-Vocabulary)** — fails when a new word appears that wasn’t in the training vocabulary.

> 🧠 **Example:**  
> If the vocabulary = [Food, Pizza, Burger],  
> then the new word *"Fries"* cannot be encoded because it doesn’t exist in the predefined vocabulary.

---

## 💡 Summary

- **Type:** Categorical Text Representation  
- **Use Case:** Small datasets or quick prototyping  
- **Limitations:** Doesn’t scale well for large corpora  
- **Better Alternatives:** TF-IDF, Word2Vec, GloVe, BERT

---
"""