# 🌐 Word2Vec — Word Embedding Using Neural Networks

## 📖 Introduction

**Word2Vec** is a powerful technique for natural language processing developed by **Tomas Mikolov and his team at Google in 2013** for learning **word embeddings** — vector representations of words that capture **semantic meaning and relationships**.

The **Word2Vec** alogorithm uses a neural synonymous words or suggest additional words for a **partial sentence**. As the name implies, word2vec represents each distinct word with a particular list of numbers called a vector.

It is a **shallow, two-layer neural network** that takes a large corpus of text as input and produces a **vector space** (typically 100–300 dimensions), where each unique word is represented by a corresponding vector.

In this space:
- Words that **share common contexts** in the corpus are located **close to one another**.
- Similar words (e.g., *car* and *vehicle*) have similar vector representations.

---

## 💡 Intuition Behind Word2Vec

Traditional models like **Bag of Words (BoW)** or **TF-IDF** represent words as independent features — they **don’t capture relationships** between words.

Word2Vec overcomes this by learning **from context**:
> “You shall know a word by the company it keeps.” — *J.R. Firth (1957)*

In other words, the **meaning of a word** can be derived from its **surrounding words (context)**.

---

## 🧠 Working Principle

Word2Vec learns embeddings by **predicting words based on their context** (or vice versa).  
There are two main model architectures:

### 1️⃣ Continuous Bag of Words (CBOW)

**Goal:** Predict the **target word** given its **context words**.

---

#### 🧩 Example

> Sentence: “The cat sat on the mat”

If we choose a **window size of 2**, then for the target word **"sat"**, the context words are:
```

["The", "cat", "on", "the"]

```

The model tries to **predict “sat”** based on these surrounding words.

---

#### ⚙️ How It Works

1. Input: Context words  
2. Output: Target word  
3. The model averages the context word vectors to predict the target.  
4. During training, weights are adjusted so that words appearing in similar contexts have similar vectors.

---

#### 🧮 Example in Action

| Context Words | Target Word |
|----------------|-------------|
| ["The", "cat", "on", "the"] | sat |
| ["cat", "sat", "the", "mat"] | on |

Over time, the network learns that **“cat”**, **“mat”**, and **“sat”** appear together frequently — so their embeddings move closer in vector space.

---

#### ⚡ Advantages
- Fast to train.
- Performs well for frequent words.

#### ⚠️ Disadvantages
- Doesn’t perform as well on rare words (low frequency).

---

### 2️⃣ Skip-Gram Model

**Goal:** Predict the **context words** given a **target word**.

---

#### 🧩 Example

> Sentence: “The cat sat on the mat”

For the target word **“sat”** with a window size of 2:
```

Context Words → ["The", "cat", "on", "the"]

```

The model tries to predict each of these from the target "sat":
```

Input: "sat"
Output: ["The", "cat", "on", "the"]

````

---

#### ⚙️ How It Works

1. The model takes one **target word** as input.
2. It predicts all possible **context words** within the window.
3. The neural network adjusts weights to maximize the probability of correct predictions.

---

#### ⚡ Advantages
- Works well with **small datasets**.
- Captures **rare words** better than CBOW.

#### ⚠️ Disadvantages
- Training is **slower** for large corpora.

---

## 🧱 Word2Vec Architecture (Simplified)

Word2Vec consists of **three main layers**:

1. **Input Layer**  
   - Represents one-hot encoded target or context word.

2. **Hidden Layer**  
   - A linear layer with no activation function.
   - Learns word embeddings — each word gets a dense vector representation.

3. **Output Layer**  
   - Predicts the target or context words using **softmax** probabilities.

---

## 🧮 Mathematical Representation

Let:
- \( V \) = Vocabulary size  
- \( N \) = Embedding vector dimension  

Each word \( w \) is represented as a **one-hot vector** of size \( V \).

### Hidden Layer:
\[
h = W^T \times x
\]
where \( W \) is the weight matrix of size \( V \times N \), and \( x \) is the one-hot vector.

### Output Layer:
\[
u = W'^T \times h
\]
and the softmax function gives the probability distribution over the vocabulary:
\[
P(w_t | w_c) = \frac{e^{u_t}}{\sum_{j=1}^{V} e^{u_j}}
\]

The model learns the **weights (W, W′)** — which become the **word embeddings**.

---

## 🔎 Example of Learned Relationships

Word2Vec captures **semantic and syntactic** relationships:

| Relationship | Example | Result |
|---------------|----------|--------|
| Gender | King - Man + Woman | ≈ Queen |
| Verb tense | Walk - Walking + Swam | ≈ Swim |
| Country–Capital | France - Paris + Italy | ≈ Rome |
| Singular–Plural | Cat - Cats + Dogs | ≈ Dog |

📘 **Meaning:** Word2Vec understands the relationship “King is to Man as Queen is to Woman” **mathematically!**

---

## 🎯 Key Features

✅ Captures **semantic** (meaning-based) and **syntactic** (grammar-based) relationships.  
✅ Produces **dense** low-dimensional vectors.  
✅ Improves performance in **downstream NLP tasks** like:
- Sentiment Analysis
- Named Entity Recognition
- Machine Translation
- Chatbots and Q&A

---

## 🧰 Implementation Example (with Gensim in Python)

```python
from gensim.models import Word2Vec

# Sample sentences
sentences = [
    ["the", "cat", "sat", "on", "the", "mat"],
    ["the", "dog", "sat", "on", "the", "rug"],
    ["the", "bird", "flew", "over", "the", "mat"]
]

# Train Word2Vec model (Skip-Gram)
model = Word2Vec(sentences, vector_size=50, window=2, min_count=1, sg=1)

# Display the vector for a word
print(model.wv['cat'])

# Find most similar words
print(model.wv.most_similar('cat'))
````

---

## 📊 Visualization (Conceptual)

```
         ┌─────────────────────────────┐
         │         Word2Vec Model       │
         ├─────────────────────────────┤
         │ 1️⃣ Input Word  →  One-Hot    │
         │ 2️⃣ Hidden Layer → Embedding  │
         │ 3️⃣ Output → Context Prediction│
         └─────────────────────────────┘

Example:
Input: “King”
Output Predictions: [“Queen”, “Prince”, “Monarch”, “Royalty”, …]
```

---

## 🧩 Summary — CBOW vs Skip-Gram

| Feature       | CBOW                       | Skip-Gram                  |
| ------------- | -------------------------- | -------------------------- |
| Input         | Context Words              | Target Word                |
| Output        | Target Word                | Context Words              |
| Speed         | Faster                     | Slower                     |
| Rare Words    | Poor                       | Better                     |
| Training Data | Large Corpus               | Small Corpus               |
| Example       | Predict “sat” from context | Predict context from “sat” |

---

## 🧠 Intuitive Understanding of Context

Context in Word2Vec refers to the **surrounding words** of a target word within a fixed-size **window**.

Example sentence:

> “The dog barked loudly at night.”

With **window size = 2**, for the target word **“barked”**, the context words are:

```
["The", "dog", "loudly", "at"]
```

These contexts help Word2Vec learn that:

* “barked” often appears with “dog”.
* So, “barked” and “dog” should be **semantically close** in vector space.

---

## 🌟 Conclusion

Word2Vec revolutionized NLP by introducing **dense, meaningful word representations** that preserve **semantic relationships**.

It laid the foundation for advanced **contextual embeddings** such as:

* **GloVe (Global Vectors)**
* **FastText**
* **BERT**, **GPT**, and other Transformer-based models

> 🗣️ In short: *Word2Vec taught machines to understand the meaning of words through context.*

---

## 📚 References

* Mikolov et al., “Efficient Estimation of Word Representations in Vector Space”, Google Research (2013)
* [Gensim Word2Vec Documentation](https://radimrehurek.com/gensim/models/word2vec.html)
* [TensorFlow Word2Vec Tutorial](https://www.tensorflow.org/tutorials/text/word2vec)

# 🧩 Understanding Word Embedding Vectors — Google Word2Vec Example

When Google introduced **Word2Vec**, one of the most powerful discoveries was that the learned **word vectors** captured real-world **semantic relationships**.

For example:
> **vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")**

This means the model **learned the concept of gender and royalty** — purely from reading text, without being explicitly told about these relationships!

---

## 📘 Step-by-Step Concept

1️⃣ **Vocabulary (Unique Words)**  
   - These are all the unique words in your corpus (training text).

   Example vocabulary:
```

[Boy, Girl, King, Queen, Apple, Mango]

```

2️⃣ **Each word** is represented as a **dense vector** — not just 0s and 1s.  
- Each dimension captures a certain **hidden feature** (e.g., Gender, Royalty, Age, Food, etc.).

3️⃣ These dimensions are **not predefined** — the model **learns** them automatically during training.

---

## 🧮 Dummy Example: Word2Vec Feature Space

Below is a **simplified (dummy)** representation of how Word2Vec might internally encode meaning:

| Word  | Gender | Royalty | Age | Food | Sweetness | Objectness |
|--------|:-------:|:-------:|:----:|:----:|:----------:|:------------:|
| **Boy**   | -0.95 |  0.05 |  0.70 |  0.00 |  0.00 |  0.10 |
| **Girl**  |  0.97 |  0.04 |  0.68 |  0.00 |  0.00 |  0.12 |
| **King**  | -0.90 |  0.95 |  0.80 |  0.00 |  0.00 |  0.15 |
| **Queen** |  0.92 |  0.96 |  0.78 |  0.00 |  0.00 |  0.14 |
| **Apple** |  0.00 |  0.00 |  0.00 |  0.98 |  0.95 |  0.05 |
| **Mango** |  0.00 |  0.00 |  0.00 |  0.97 |  0.96 |  0.06 |

---

## 🧠 Interpretation

- **Gender Dimension:**  
Positive values → feminine (e.g., *Girl*, *Queen*)  
Negative values → masculine (e.g., *Boy*, *King*)

- **Royalty Dimension:**  
High for *King* and *Queen*, low for *Boy*, *Girl*, *Apple*, etc.

- **Food Dimension:**  
High for *Apple* and *Mango*, nearly zero for human-related words.

- **Sweetness Dimension:**  
High for fruits, zero for non-food items.

---

## 🧭 Semantic Relationships Captured

| Relationship | Vector Math | Meaning |
|---------------|--------------|----------|
| Gender | **King - Man + Woman ≈ Queen** | The model understands gender analogy. |
| Age | **Boy - Young + Adult ≈ Man** | Learns age-based transformation. |
| Category | **Apple - Fruit + Country ≈ Japan** | Learns contextual category shift. |
| Synonyms | **Big ≈ Large**, **Fast ≈ Quick** | Close vectors for similar words. |

---

## 📊 Visual Analogy (2D Projection Example)

Word2Vec embeddings exist in **hundreds of dimensions**, but when visualized in **2D** (via PCA or t-SNE), you can see meaningful clusters:

```

```
       Royalty ↑
            │                Queen
            │             King
            │        Prince
            │
            │
```

Gender  ────────┼────────────────────────→
│
│          Girl
│     Boy
│
│

```

- Words like *Boy–Girl* and *King–Queen* align in similar directions along the **Gender axis**.
- Words like *King–Queen–Prince* cluster along the **Royalty axis**.

---

## 💡 Key Takeaway

Word2Vec doesn’t just memorize words —  
it **understands** them through **contextual relationships**.

The model learns:
- Similar words appear in similar contexts.
- Context defines meaning.
- Arithmetic operations on vectors reveal hidden semantic patterns.

> 🗣️ Example:  
> `King - Man + Woman ≈ Queen`  
> demonstrates that **Word2Vec captures real-world logic** in numerical form.

---

## 🧱 Real-World Embedding Space Snapshot (from Google’s Paper)

Google’s trained model on **Google News dataset (100B words)** produced embeddings that naturally aligned like this:

| Relationship | Example Pairs | Observed Result |
|---------------|---------------|-----------------|
| Gender | (man → woman), (king → queen) | Consistent vector direction |
| Verb tense | (walk → walked), (eat → ate) | Captures grammatical rules |
| Country–Capital | (France → Paris), (Italy → Rome) | Geographic understanding |
| Comparative | (big → bigger), (fast → faster) | Learns adjective relationships |

---

## 🧩 Summary

| Concept | Description |
|----------|--------------|
| **Vocabulary** | Set of all unique words in the corpus. |
| **Corpus** | The entire collection of text used for training. |
| **Embedding Vector** | Numeric representation of a word in multi-dimensional space. |
| **Context** | Surrounding words that define meaning. |
| **Semantic Space** | The multi-dimensional field where similar words lie close together. |

---

✨ **In essence:**
> Word2Vec transforms raw words into a structured semantic world — where relationships, categories, and analogies are all embedded in the geometry of vector space.

# 📏 Cosine Similarity in Word Embedding

## 📖 What is Cosine Similarity?

When we represent words as **vectors**, we can measure how **similar** two words are by checking the **angle** between their vectors — not their absolute distance.

That’s where **cosine similarity** comes in.

It measures the **cosine of the angle** between two vectors in the embedding space.

---

## 🧮 Formula

For two vectors **A** and **B**, the **cosine similarity** is defined as:

\[
\text{Cosine Similarity (A, B)} = \frac{A \cdot B}{||A|| \times ||B||}
\]

Where:
- \( A \cdot B \) = dot product of vectors A and B  
- \( ||A|| \) and \( ||B|| \) = magnitude (length) of A and B

---

## 🎯 Intuition

- Cosine similarity only depends on the **angle between vectors**, not their length.
- Two words are **similar** if their vectors point in the **same direction**.

### Example:

| Pair | Cosine Similarity | Meaning |
|-------|-------------------|----------|
| King – Queen | 0.92 | Very similar (same gender and royalty context) |
| King – Man | 0.75 | Moderately similar (gender relation) |
| King – Apple | 0.12 | Not similar (different semantic fields) |

---

## 🧠 Visual Representation

```

```
   ↑ Royalty
   │
   │       Queen
   │      /
   │     /
   │    /
   │   /   (small angle → high similarity)
   │  /
   │ /      
   │/________________→ Gender
  King
```

````

In this diagram:
- The **angle** between "King" and "Queen" is **small**, so **cosine similarity ≈ 1**.
- If the vectors were perpendicular (90°), similarity ≈ 0 (no relation).
- If opposite (180°), similarity ≈ -1 (opposite meanings).

---

## ⚖️ Cosine Similarity vs Euclidean Distance

| Aspect | Cosine Similarity | Euclidean Distance |
|---------|-------------------|--------------------|
| **Definition** | Measures **angle** between two vectors | Measures **straight-line distance** between points |
| **Range** | -1 to 1 | 0 to ∞ |
| **Focus** | Orientation (direction) | Magnitude (distance) |
| **When Useful** | When vector length doesn’t matter (e.g., text) | When absolute distance matters (e.g., coordinates) |
| **In Word2Vec** | ✅ Preferred (semantic comparison) | ⚠️ Not suitable (scale varies) |

---

### 🔍 Why Word2Vec Uses Cosine Similarity

- In Word2Vec, vector magnitudes are **not important** — what matters is the **relative direction**.
- Two words with **similar meanings** (e.g., “strong” and “powerful”) appear in **similar contexts**, resulting in **vectors pointing in similar directions**.
- Thus, cosine similarity effectively measures **semantic closeness**.

---

## 🧮 Example Calculation

Let:
\[
A = [1, 2, 3], \quad B = [2, 3, 4]
\]

Then:
\[
A \cdot B = (1)(2) + (2)(3) + (3)(4) = 20
\]

\[
||A|| = \sqrt{1^2 + 2^2 + 3^2} = 3.74, \quad ||B|| = \sqrt{2^2 + 3^2 + 4^2} = 5.38
\]

\[
\text{Cosine Similarity} = \frac{20}{(3.74 \times 5.38)} ≈ 0.995
\]

✅ Result: Vectors are almost in the same direction → **Highly similar words**.

---

## 🧭 Example in Word2Vec (Python - Gensim)

```python
from gensim.models import Word2Vec

sentences = [
    ["king", "queen", "man", "woman", "apple", "mango"]
]

# Train model
model = Word2Vec(sentences, vector_size=10, min_count=1, sg=1)

# Cosine similarity between words
print(model.wv.similarity('king', 'queen'))
print(model.wv.similarity('king', 'apple'))
````

**Output:**

```
king vs queen: 0.89
king vs apple: 0.10
```

👉 Words closer in **meaning** → higher cosine similarity.

---

## 📊 Summary: Vector Distance Metrics

| Metric                 | Measures                    | Range  | Word2Vec Use | Comment                                |
| ---------------------- | --------------------------- | ------ | ------------ | -------------------------------------- |
| **Cosine Similarity**  | Angle (direction)           | -1 → 1 | ✅ Yes        | Best for comparing semantic similarity |
| **Euclidean Distance** | Magnitude difference        | 0 → ∞  | ⚠️ No        | Sensitive to vector scale              |
| **Manhattan Distance** | Sum of absolute differences | 0 → ∞  | ❌ Rare       | Not rotation invariant                 |

---

## 🌟 Real-Life Analogy

Imagine each word as a **direction** in meaning space.

* “King” and “Queen” point toward **royalty**.
* “Apple” and “Mango” point toward **fruits**.
* “King” and “Apple” point in **very different directions**, so the angle is large → low similarity.

```
               🍎 Mango
               /
              /
      👑 Queen
             \
              \
               👑 King
               \
                \
                 🍏 Apple
```

Even though all vectors have roughly the same length (magnitude), **direction** defines meaning — that’s why **cosine similarity** is ideal for **text embeddings**.

---

## 🧩 Key Takeaways

| Concept                          | Description                                                  |
| -------------------------------- | ------------------------------------------------------------ |
| **Cosine Similarity**            | Measures the angle between two word vectors.                 |
| **Higher Value (close to 1)**    | Words have similar meanings and contexts.                    |
| **Lower Value (close to 0)**     | Words are unrelated.                                         |
| **Negative Value (close to -1)** | Words have opposite meanings.                                |
| **Used In**                      | Word2Vec, Doc2Vec, BERT, GPT, and almost all NLP embeddings. |

---

> 🗣️ **In summary:**
> Cosine similarity gives Word2Vec the ability to **compare meanings geometrically**, not just by frequency — turning language into measurable mathematics.


# CBOW (Continuous Bag of Words) — Neural Network Diagram & Full Details

> Corpus example used in this document:
>
> `iNeruon company is related to data science`
>
> Window size = 5 (context size). Target word predicted from context words (CBOW).

---

## 1 — Quick overview

CBOW predicts the target word given surrounding context words. Each context word is converted to a one-hot vector (size = vocabulary size V), mapped to an embedding space (embedding size = D). The embeddings for the context words are averaged (or summed) and fed into a fully connected layer followed by a softmax over the vocabulary to produce a probability distribution for the target word.

Key components in this doc:

* Vocabulary and one-hot encoding example from your corpus
* Detailed architecture diagram (ASCII + Mermaid)
* Shapes of weight matrices and forward/backward pass math
* Loss, optimization, and training details
* Implementation sketch (Keras / PyTorch)
* Variants (negative sampling, skip-gram), tips and hyperparameters

---

## 2 — Vocabulary & One-Hot encoding (example)

Assume after tokenization/normalization our vocabulary (V = 7) is:

```
0: ineruon
1: company
2: is
3: related
4: to
5: data
6: science
```

One-hot vectors (length V = 7) for the first four vocabulary words:

```
ineruon   -> [1 0 0 0 0 0 0]
company   -> [0 1 0 0 0 0 0]
is        -> [0 0 1 0 0 0 0]
related   -> [0 0 0 1 0 0 0]
... and so on
```

> Note: Real datasets typically have much larger V (thousands to millions). For educational diagrams we use V = 7.

---

## 3 — CBOW architecture (high-level)

Mermaid diagram (can be rendered by a renderer that supports Mermaid):

```mermaid
flowchart LR
  subgraph Input [Input (context words)]
    C1[one-hot w_{t-2}]
    C2[one-hot w_{t-1}]
    C3[one-hot w_{t+1}]
    C4[one-hot w_{t+2}]
  end

  C1 -->|multiply by| E1[Embedding lookup: E (V x D)]
  C2 -->|multiply by| E2[Embedding lookup: E]
  C3 -->|multiply by| E3[Embedding lookup: E]
  C4 -->|multiply by| E4[Embedding lookup: E]

  E1 --> Avg[Average / Sum]
  E2 --> Avg
  E3 --> Avg
  E4 --> Avg

  Avg --> FC{Fully connected: W' (D x V) + b'}
  FC --> Softmax[Softmax over V]
  Softmax --> Output[(Predicted target word distribution)]
```

### ASCII-style neural net diagram (small example)

```
 Context one-hot vectors (4 context words, V=7)         Hidden/embedding layer (D=50 shown abstractly)
 [1 0 0 0 0 0 0]--> [embedding vector e1]  \            [avg e] ---> [W'.T] --> softmax (size V)
 [0 1 0 0 0 0 0]--> [embedding vector e2]  /           
 [0 0 0 1 0 0 0]--> [embedding vector e3]
 [0 0 0 0 1 0 0]--> [embedding vector e4]

 Final predicted distribution over vocabulary -> target: 'is'
```
---

## 4 — Shapes and parameters (concrete)

Let:

* V = vocabulary size (7 in our toy example)
* D = embedding size (we pick D = 50 for demonstration; you can choose 50, 100, 200, etc.)
* C = number of context words (window size minus target; for window size 5 centered, C = 4 context words)

**Weight matrices**:

* Embedding matrix `W_e` (also called `E`): shape `(V, D)` — maps one-hot (V) to embedding (D). In implementation it's typically a lookup table, not computed as full matrix multiplication for efficiency.
* Output weight matrix `W_out` (sometimes `W'`): shape `(D, V)` — maps averaged embedding to un-normalized scores for each vocabulary word.
* Bias `b_out`: shape `(V,)`.

**Forward pass math** (for one training sample):

1. Convert each context word `w_i` to one-hot `x_i` (V-dim). Embedding lookup: `e_i = W_e^T x_i` (shape D).
2. Average embeddings: `e = (1/C) * sum_{i=1..C} e_i` (shape D).
3. Score vector: `u = W_out^T e + b_out` (shape V).
4. Probability by softmax: `y_hat = softmax(u)`.
5. Loss for target word index t: `L = -log(y_hat[t])`.

**Parameter counts**:

* Embeddings: `V * D` parameters.
* Output: `D * V + V` (weights + bias).
* Total ~ `2*V*D + V`.

For toy numbers: V=7, D=50 -> embeddings = 350 params, output weights = 350 params, bias = 7 -> total ~ 707 parameters.

---

## 5 — Example forward pass using your sample

Corpus window example (window size = 5): target is center word. Example line from your prompt:

Inputs (context) -> Target

* `iNeruon, company, related, to` -> `is`
* `Company, Is, To, Data` -> `Related`
* `is, related, data, science` -> `to`

Take first example: context words indices [0,1,3,4] -> embeddings e0,e1,e3,e4 -> average `e` -> compute `u=W_out^T e + b` -> softmax -> highest probability ideally for index of `is`.

---

## 6 — Training details

**Loss**: Cross-entropy negative log-likelihood: `L = -log(y_hat[target])`.

**Optimization**: SGD, Adam, RMSprop. Typical learning rates:

* SGD: 0.01–0.5 (with careful scheduling)
* Adam: 1e-3 (good default)

**Batching**: Use minibatches (e.g., 256–2048 samples per batch depending on memory).

**Regularization**: dropout is rarely used on embeddings; L2 on output weights sometimes used.

**Speedups for large V**: softmax over very large vocab is expensive. Alternatives:

* Negative sampling (skip-gram negative sampling adapted to CBOW) — sample a few negative words and train with logistic loss.
* Hierarchical softmax — tree-based decomposition of softmax.

**Epochs**: Several passes over the corpus — often large corpora need just a few epochs; toy corpora may overfit quickly.

---

## 7 — Variants & notes

* **CBOW vs Skip-gram**: CBOW predicts word from context; skip-gram predicts context from a word. Skip-gram with negative sampling (SGNS) is popular (Word2Vec original paper).
* **Averaging vs Summation**: You can sum embeddings or average. Averaging normalizes by context size.
* **Position weighting**: Optionally weight embeddings based on distance from center word.
* **Sub-sampling frequent words**: For large datasets, subsample very frequent words (e.g., `the`, `is`) to speed up training and improve embedding quality.

---

## 8 — Implementation sketch (Keras)

```python
# Simple CBOW-like model in Keras (toy, illustrative)
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Lambda, Dense
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K

V = 10000  # vocab size (toy)
D = 50     # embedding size
C = 4      # number of context words

# Inputs: C integers (context word indices)
inputs = [Input(shape=(1,), dtype='int32') for _ in range(C)]

emb = Embedding(input_dim=V, output_dim=D, input_length=1)  # W_e
embeddings = [emb(inp) for inp in inputs]  # list of shape (None,1,D)

# remove the length-1 dimension and average
embeddings = [Lambda(lambda x: K.squeeze(x, axis=1))(e) for e in embeddings]
avg = Lambda(lambda x: K.mean(K.stack(x, axis=1), axis=1))(emb
```


![CBOW Neural Network](assets/Word2Vec-CBOW-1.png)
![CBOW Neural Network](assets/Word2Vec-CBOW-2.png)


## SkipGram
Corpus:
    iNeruon company is releted to data science.
    window size = 5

I/P                                        O/P
-> iNeruon, company, Related to             is
-> Company, Is, To, Data                    Related
-> is, related, data, science               to

One Hot encoding
iNeruon  1 0 0 0 0 0 0 
company  0 1 0 0 0 0 0
related  0 0 0 1 0 0 0
to       0 0 0 0 1 0 0

   weights
I/P    W.S.           O/P
0      0         0 0 0 0 0 0 0   5*7
0      0         0 0 0 0 0 0 0   5*7     
0      0         0 0 0 0 0 0 0   5*7
0      0         0 0 0 0 0 0 0   5*7
0      0         0 0 0 0 0 0 0   5*7
0      7*5
0


When should be apply CBOW or skipgram
- Small dataset(corpus) ->> CBOW
- Huge Dataset(corpus)  ->> SkipGram


How to improve
CBOW or Skipgram
1) Increase the training data
2) increase the window size which in leads to increase of vector dimension

Google WOrd2Vec
- 3 billion words -> google news
- feature representation of 300 dimention vectors
  cricket [............300] dimention vector

# Word2Vec — CBOW and Skip-Gram (Complete Guide)

> A compact, runnable Markdown guide covering intuition, math, training, practical tips, and a fully-connected ANN diagram to understand CBOW and Skip-Gram. Includes a small worked example (your corpus) and useful code snippets for Jupyter.

--- 

## Table of contents

1. Overview: what is Word2Vec?
2. One-hot encoding and the input/output setup
3. CBOW — intuition, math, forward/backward pass, training objective
4. Skip-Gram — intuition, math, forward/backward pass, training objective
5. Softmax, hierarchical softmax, negative sampling (why & when)
6. Worked example (your corpus) — step-by-step calculation
7. Fully connected ANN diagram (ASCII + HTML-friendly SVG snippet)
8. When to use CBOW vs Skip-Gram
9. How to improve results (hyperparams, data, tricks)
10. Google Word2Vec facts
11. Quick code examples (gensim + pure NumPy toy model)
12. Practical tips for Jupyter and embedding images in Markdown

---

## 1. Overview: what is Word2Vec?

Word2Vec is a family of two-layer neural-network models that learn dense vector representations (embeddings) for words such that words with similar contexts have similar vectors. The two primary architectures are:

* **CBOW (Continuous Bag of Words):** predict the center word from surrounding context words.
* **Skip-Gram:** predict surrounding context words given the center word.

Both map one-hot encoded words (very high-dimensional, sparse) into a low-dimensional dense vector space (e.g., 100–300 dims).

---

## 2. One-hot encoding and I/O setup

Given a vocabulary of size `V`, each word is a one-hot vector of length `V` (all zeros except a single 1 at the word index).

Example small vocab and one-hot matrix (vertical = indices):

```
Vocab: [iNeruon, company, related, to, data, science, is]
One-hot:
 iNeruon  [1 0 0 0 0 0 0]
 company  [0 1 0 0 0 0 0]
 related  [0 0 1 0 0 0 0]
 to       [0 0 0 1 0 0 0]
 data     [0 0 0 0 1 0 0]
 science  [0 0 0 0 0 1 0]
 is       [0 0 0 0 0 0 1]
```

**Weight matrices:**

* `W1` (V × N): input-to-hidden weight matrix (rows correspond to input words, columns to embedding dims).
* `W2` (N × V): hidden-to-output weight matrix.

For CBOW, multiple input one-hot vectors are averaged (or summed) to produce the hidden activation. For Skip-Gram, a single input one-hot flows through to hidden.

---

## 3. CBOW — intuition & math

**Intuition:** Given the context words (surrounding words inside a window), predict the center word. Effective for smaller datasets and faster to train.

**Forward pass (single training example):**

1. Convert each context word into its one-hot vector and multiply by `W1` to obtain embeddings, or equivalently select the corresponding rows of `W1`.
2. Compute the average (or sum) of those embeddings: `h = (1 / C) * sum_{i=1..C} v_{context_i}`.
3. Compute scores for each vocabulary word: `u = W2^T h` (shape V).
4. Apply softmax over `u` to get probabilities `y_hat`.

**Loss:** negative log-likelihood (cross-entropy):

```
L = -log( softmax(u_center) )
```

**Backprop:** compute gradient wrt `W2` and `W1` (only rows corresponding to context words get updated). When implementing, you can use efficient matrix operations to update multiple rows.

**Batching:** CBOW naturally supports batching because you average several inputs.

---

## 4. Skip-Gram — intuition & math

**Intuition:** Given the target (center) word, predict each context word (many predictions per center). Works very well with large corpora and captures rare-word representations better.

**Forward pass (one center word, multiple output predictions):**

1. Center word one-hot → select row of `W1` → `h` (embedding of center).
2. For each context position, compute `u = W2^T h` → softmax → probability for that context word.
3. Loss is sum of cross-entropies over context words.

**Training:** Because Skip-Gram predicts many outputs per input, it performs better on large corpora. However, naive softmax over full vocab is expensive — that's where negative sampling and hierarchical softmax come in.

---

## 5. Softmax, hierarchical softmax, negative sampling

**Full softmax:** computes probabilities over entire vocabulary `V`. Complexity = `O(V)` per prediction — expensive for big vocabularies.

**Hierarchical softmax:** represents words as leaves of a binary tree; computing word probability takes `O(log V)`.

**Negative sampling (very common):** for each positive center-context pair, sample `k` negative words from a noise distribution and instead of softmax use logistic loss for positive and negative pairs. Complexity becomes `O(k)` per pair (k often between 5 and 20).

Loss for negative sampling for a positive pair (center `w_c`, context `w_o`):

```
log σ(v_{w_o}^T v_{w_c}) + sum_{i=1..k} E_{w_i~P_n(w)}[ log σ(-v_{w_i}^T v_{w_c}) ]
```

where `σ` is sigmoid and `P_n(w)` is noise distribution (often unigram^0.75).

---

## 6. Worked example (your corpus)

Corpus sentence: `iNeruon company is releted to data science.` (I'll correct spelling `releted -> related`)
Window size = 2 (for demo; you wrote 5 but we'll use 2 to keep steps short). Vocabulary: `[iNeruon, company, related, to, data, science, is]` (V=7)

### Example Skip-Gram pair generation (window=2):

Take center `related` (index 3); context words within window 2 are `iNeruon, company, is, to` depending on sentence boundaries.
So training pairs: `(related -> iNeruon)`, `(related -> company)`, `(related -> is)`, `(related -> to)`

One-hot for `related` = `[0,0,1,0,0,0,0]`.

If embedding dim `N = 3` (toy):
`W1` (7×3) and `W2` (3×7) initialized randomly. Forward: `h = W1[row_index_of_related]` (shape 3). Compute `u = W2^T h` (shape 7). Apply softmax, compute loss for target e.g., `iNeruon`.

I won't run numbers here (they are straightforward matrix multiplies). If you want numerical steps, I can show a single update with random small matrices.

---

## 7. Fully-connected ANN diagram (ASCII + embeddable SVG)

Below is a simple fully-connected ANN representation for the CBOW/Skip-Gram one-hidden-layer architecture used by Word2Vec.

**ASCII diagram (CBOW, averaging 4 context words → hidden → output):**

```
Context one-hot vectors (V)   Context one-hot vectors (V)   ...   Context one-hot vectors (V)
      |                             |                              | 
      v                             v                              v
   [select row]                 [select row]                   [select row]
      \           (sum/avg)   /      
        \        /-----\     /       
          \    /  W1   \  /         
            -->| (VxN) |-->  h (N-d embedding)
                 \-----/
                    |
                    v
                  W2^T (N x V)  => raw scores (V)
                    |
                    v
                  softmax
                    |
                    v
                 output probs (V)
```

**Interpretation:** The hidden layer is the embedding. For CBOW we aggregate context embeddings; for Skip-Gram the input is a single one-hot that selects a single embedding which is used to predict multiple outputs.

**HTML/SVG snippet (paste into an HTML cell in Jupyter or an `.html` file):**

```html
<!-- Small SVG to visualize the single-hidden-layer Word2Vec network -->
<svg width="700" height="260" xmlns="http://www.w3.org/2000/svg">
  <rect x="10" y="30" width="120" height="200" fill="#f3f4f6" stroke="#ccc" />
  <text x="22" y="50">Input (one-hot)</text>
  <r
```


## 🟩 Advantages of Word2Vec

1. **Sparse Matrix → Dense Matrix**

   * Traditional models like Bag of Words or TF-IDF produce large **sparse matrices** (mostly zeros).
   * Word2Vec learns **dense, low-dimensional representations** (e.g., 100–300 dimensions), which are compact and computationally efficient.

2. **Captures Semantic Relationships**

   * Words with similar meanings or used in similar contexts have **similar vector representations**.
   * Example:

     ```
     vec("king") - vec("man") + vec("woman") ≈ vec("queen")
     ```
   * Captures semantic similarity (e.g., *good*, *honest*, *kind*) and syntactic relations (e.g., *walking*, *walked*, *walks*).

3. **Fixed Vector Dimension for All Words**

   * Every word in the vocabulary is represented by a **vector of fixed dimension** (e.g., 300-D in Google Word2Vec).
   * This consistency simplifies model input for downstream ML tasks.

4. **Handles Large Vocabulary Efficiently**

   * Word2Vec uses **negative sampling** and **hierarchical softmax** to train on millions of words efficiently.

5. **Improves NLP Performance**

   * Pretrained embeddings can significantly improve the accuracy of NLP models for classification, translation, and sentiment analysis tasks.

6. **Generalization**

   * Embeddings generalize well across tasks; pretrained models like Google’s Word2Vec can be reused for many NLP problems.

---

## 🟥 Disadvantages of Word2Vec

1. **OOV (Out-of-Vocabulary) Problem Not Truly Solved**

   * If a word wasn’t seen during training, Word2Vec **can’t produce an embedding** for it.
   * It doesn’t handle new or rare words gracefully (later models like **FastText** fixed this using subword embeddings).

2. **Context-Independent Representations**

   * Each word has **only one vector**, regardless of context.

     * Example: The word *“bank”* means different things in *“river bank”* and *“bank account”*, but Word2Vec gives it a single vector.
   * Contextual models like **BERT** or **ELMo** overcome this.

3. **Requires Large Training Data**

   * To learn meaningful embeddings, it needs **huge text corpora** (billions of words).
   * Poor results on small datasets.

4. **Cannot Handle Out-of-Domain Words**

   * If trained on a specific domain (e.g., news), it may not perform well on another (e.g., medical or legal text).

5. **Black-box Nature**

   * The embeddings are not directly interpretable — you can’t easily tell why two words are close in vector space without analysis.

---

### ✅ Summary Table

| Aspect                        | Word2Vec                |
| ----------------------------- | ----------------------- |
| **Matrix Type**               | Dense (efficient)       |
| **Semantic Meaning**          | Captured                |
| **Vector Dimension**          | Fixed (e.g., 300)       |
| **OOV Words**                 | ❌ Not handled           |
| **Context Awareness**         | ❌ Not context-dependent |
| **Training Data Requirement** | Large                   |
| **Interpretability**          | Limited                 |

---


# Word2Vec — CBOW, Skip-Gram, and Average Word2Vec (Complete Guide)

> This document explains Word2Vec architectures (CBOW and Skip-Gram), their advantages/disadvantages, and how **Average Word2Vec** is used for text classification tasks.

---

## 🧠 Recap: What is Word2Vec?

Word2Vec transforms words into **dense vector representations** (embeddings) that capture semantic and syntactic meaning. Each word is represented by a fixed-size vector (e.g., 100–300 dimensions).

Two training approaches:

* **CBOW (Continuous Bag of Words):** Predict target word from context.
* **Skip-Gram:** Predict context words from the target word.

Both learn embeddings via a shallow neural network (one hidden layer).

---

## ⚙️ Average Word2Vec

### 📖 Concept

**Average Word2Vec** is a simple yet effective way to represent an entire sentence or document as a single vector.

* Each word in the sentence is represented by its **Word2Vec vector** (e.g., 300-dim).
* The sentence vector is the **average of all word vectors** in that sentence.
* This representation is then used as input for a **machine learning model** (e.g., Logistic Regression, SVM, Neural Network) for classification tasks like sentiment analysis.

---

### 🧩 Example Dataset

| Document | Text             | Sentiment (O/P) |
| -------- | ---------------- | --------------- |
| D1       | The food is good | 1 (Positive)    |
| D2       | The food is bad  | 0 (Negative)    |
| D3       | Pizza is amazing | 1 (Positive)    |

---

### 🧮 Step-by-Step Example Using Google’s Pretrained Word2Vec (300 Dimensions)

**Input:** Google News Pretrained Word2Vec model (3 million words, 300-dimensional vectors)

| Word | Embedding (300-d vector)       | Example (truncated) |
| ---- | ------------------------------ | ------------------- |
| The  | [0.1, 0.05, -0.02, … , 0.08]   | → 300 dims          |
| food | [-0.03, 0.09, 0.15, … , -0.04] | → 300 dims          |
| is   | [0.02, 0.01, 0.00, … , 0.05]   | → 300 dims          |
| good | [0.21, 0.14, 0.06, … , 0.18]   | → 300 dims          |

#### Step 1: Get each word’s vector from pretrained model

* `v(The)` → 300-d vector
* `v(food)` → 300-d vector
* `v(is)` → 300-d vector
* `v(good)` → 300-d vector

#### Step 2: Compute Average Vector for the Sentence

For sentence `D1: The food is good`:

[
V_{D1} = \frac{v(The) + v(food) + v(is) + v(good)}{4}
]

Result: `V_D1` = a single **300-dimensional vector** representing the entire sentence.

#### Step 3: Use Average Vector as Input to Classifier

* `V_D1` → Input to ML model
* Output label = `1` (Positive)

---

### 🧠 Intuition

Averaging word embeddings captures the **overall semantic meaning** of a sentence, smoothing out noise from individual words.

* Works well for short texts.
* Simpler than RNNs or Transformers.
* Fast and effective for classical ML tasks.

---

### 🧰 Code Example (Using Gensim)

```python
from gensim.models import KeyedVectors
import numpy as np

# Load Google's pretrained model (binary format)
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

def avg_word2vec(sentence, model):
    words = [w for w in sentence.lower().split() if w in model]
    if not words:
        return np.zeros(model.vector_size)
    vectors = [model[w] for w in words]
    return np.mean(vectors, axis=0)

sentences = ["The food is good", "The food is bad", "Pizza is amazing"]
labels = [1, 0, 1]

X = np.array([avg_word2vec(s, model) for s in sentences])
print(X.shape)  # (3, 300)
```

---

### 🧩 Why Average Word2Vec?

✅ **Advantages:**

* Simple, fast, and effective for small datasets.
* Captures semantic meaning through averaged vectors.
* Fixed-length representation for variable-length sentences.

❌ **Disadvantages:**

* Loses word order information.
* Contextual meaning is averaged out.
* Not suitable for long or complex sentences.

---

### 🧮 Visualization Table Example

Below is a visual representation of how each word vector contributes to the averaged embedding.

| Word | Word2Vec Vector (300 dims) | → | Contribution to Average |
| ---- | -------------------------- | - | ----------------------- |
| The  | [x₁, x₂, …, x₃₀₀]          |   | ✔                       |
| food | [y₁, y₂, …, y₃₀₀]          |   | ✔                       |
| is   | [z₁, z₂, …, z₃₀₀]          |   | ✔                       |
| good | [g₁, g₂, …, g₃₀₀]          |   | ✔                       |

Final vector → `(v(The) + v(food) + v(is) + v(good)) / 4` → `[avg₁, avg₂, …, avg₃₀₀]`

---

## 🟩 Advantages of Word2Vec

1. **Sparse → Dense Representation** – reduces memory & improves model performance.
2. **Semantic Meaning Captured** – words like *good*, *honest*, *nice* cluster together.
3. **Fixed-Dimension Vectors** – every word has equal-length representation (e.g., 300 dims).
4. **Efficient Training** – negative sampling speeds up large-vocab training.
5. **Transfer Learning Ready** – pretrained embeddings can be reused in various NLP tasks.

---

## 🟥 Disadvantages of Word2Vec

1. **OOV Problem** – cannot represent unseen words.
2. **Context-Independent** – same vector for *“bank”* (river vs finance).
3. **Requires Large Data** – poor results on small corpora.
4. **Limited Interpretability** – embeddings are not human-readable.
5. **Domain Dependence** – may not generalize across domains.

---

### 📘 Summary

| Aspect              | Word2Vec                   | Average Word2Vec               |
| ------------------- | -------------------------- | ------------------------------ |
| **Output Type**     | Word embeddings            | Sentence embeddings            |
| **Dimension**       | Fixed per word (e.g., 300) | Fixed per sentence (e.g., 300) |
| **Handles OOV**     | ❌                          | ❌                              |
| **Preserves Order** | ❌                          | ❌                              |
| **Ease of Use**     | ⭐⭐⭐⭐                       | ⭐⭐⭐⭐⭐                          |

---

**Google Word2Vec Model:**

* Trained on **3 billion words** from Google News.
* **3 million vocabulary words**.
* **300-dimensional feature vectors.**

---
