# 🧠 Word Embedding in NLP

## 📖 Introduction

In **Natural Language Processing (NLP)**, a **word embedding** is a technique that represents words as numerical vectors.  
Instead of treating words as discrete symbols, embeddings capture the **semantic meaning** — words with similar meanings end up having **similar vector representations** in a continuous vector space.

In simple terms:
> **Word Embedding = Representation of words as numerical vectors that capture meaning and relationships.**

---

## 🧩 Why Do We Need Word Embedding?

Computers can’t understand text directly — they work with numbers.  
Traditional text representations (like one-hot encoding) assign a unique index to each word, but this has limitations:
- It doesn’t capture **meaning** or **context**.
- It produces **sparse**, high-dimensional vectors.
- It can’t handle **semantic similarity** between words (e.g., *king* and *queen* are unrelated numerically).

Word embeddings solve these issues by mapping words to **dense, low-dimensional vectors** that reflect their meanings.

---

## 🧮 Word Embedding Approaches

There are two major categories:

### 1️⃣ **Count or Frequency-Based Methods**

These are **statistical** techniques that analyze word occurrence frequency in documents.

#### a) 🟩 One-Hot Encoding

- Each word is represented as a **binary vector**.
- If a word exists in the vocabulary, its position is marked as `1`, otherwise `0`.

Example:

| Word  | Pizza | Burger | Pasta |
|--------|--------|--------|--------|
| Pizza  | 1 | 0 | 0 |
| Burger | 0 | 1 | 0 |
| Pasta  | 0 | 0 | 1 |

**Advantages:**
- Simple and easy to implement.

**Disadvantages:**
- **No semantic meaning captured** (e.g., "dog" and "cat" are equally distant).
- **Sparse vectors** → leads to inefficient computation.
- Vocabulary grows rapidly → large dimensionality.

---

#### b) 🟨 Bag of Words (BoW)

- Represents text as the **frequency of words** in a document.
- Order of words is **ignored**.

Example:

> Sentence 1: “I love pizza”  
> Sentence 2: “I love burger”

Vocabulary = [I, love, pizza, burger]

| Sentence | I | love | pizza | burger |
|-----------|---|------|--------|---------|
| 1 | 1 | 1 | 1 | 0 |
| 2 | 1 | 1 | 0 | 1 |

**Advantages:**
- Easy to create and interpret.
- Works well for simple models.

**Disadvantages:**
- Ignores word order and context.
- Produces sparse and high-dimensional data.
- No semantic meaning captured.

---

#### c) 🟦 TF-IDF (Term Frequency – Inverse Document Frequency)

TF-IDF improves upon BoW by considering the **importance** of a word.

**Formula:**

\[
\text{TF-IDF} = \text{TF} \times \text{IDF}
\]

Where:
- **TF (Term Frequency)** = Frequency of a word in a document.  
  \[
  TF = \frac{\text{No. of times word appears in a document}}{\text{Total words in the document}}
  \]
- **IDF (Inverse Document Frequency)** = Importance of the word across documents.  
  \[
  IDF = \log_e\left(\frac{\text{Total number of documents}}{\text{Number of documents containing the word}}\right)
  \]

**Advantages:**
- Reduces the weight of common words like “the”, “is”, etc.
- Highlights important and rare words.

**Disadvantages:**
- Still sparse and ignores semantic relationships.
- Cannot capture context or word order.

---

### 2️⃣ **Deep Learning-Based Models (Neural Embeddings)**

Unlike count-based models, these use **neural networks** to learn embeddings that capture **semantic and syntactic relationships**.

---

#### 🌐 Word2Vec

Developed by **Google (Mikolov et al., 2013)**, Word2Vec converts words into dense vectors based on their **context** in sentences.

It uses a **shallow neural network** to learn embeddings in one of two ways:

---

##### a) 🧩 CBOW (Continuous Bag of Words)

**Goal:** Predict the **target word** using the **context words**.

Example:  
> “The cat sat on the ___.”

CBOW tries to predict the missing word “mat” from the surrounding words ["The", "cat", "sat", "on", "the"].

**Advantages:**
- Faster and efficient for large datasets.
- Works well for frequent words.

---

##### b) ⚙️ Skip-Gram

**Goal:** Predict **context words** given a **target word**.

Example:  
> For the target word “cat”, predict nearby words like ["The", "sat", "on"].

**Advantages:**
- Performs better for rare words.
- Captures complex word relationships.

---

### 📊 Visualization of Word2Vec

Word embeddings map words into vector space where **semantic relationships** are preserved.

For example:

```

vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")

```

This means the model understands **gender and relational context** between words.

---

## 🧱 Summary Table

| Method | Type | Captures Meaning | Handles Context | Dimensionality | Example |
|---------|------|------------------|-----------------|----------------|----------|
| One-Hot Encoding | Count-Based | ❌ | ❌ | High | [1, 0, 0, 0, ...] |
| Bag of Words | Count-Based | ❌ | ❌ | High | [2, 1, 0, ...] |
| TF-IDF | Count-Based | ⚠️ Partial | ❌ | High | [0.45, 0.00, 0.87, ...] |
| Word2Vec (CBOW / SkipGram) | Neural | ✅ | ✅ | Low | [0.23, -0.12, 0.54, ...] |

---

## 💬 Conclusion

Word embeddings have revolutionized NLP by providing a **numerical way to represent text meaningfully**.  
They form the foundation for modern NLP tasks like:
- Sentiment Analysis  
- Machine Translation  
- Text Classification  
- Chatbots & Question Answering  

As NLP evolves, embeddings have become even more powerful through **contextual models** like **BERT**, **GPT**, and **ELMo**, which capture the **meaning of words based on context** in sentences.

---

⭐ **In summary:**
> Word embeddings bridge the gap between human language and machine understanding — transforming words into meaningful mathematical representations.