
# 🧩 N-Gram Representation with Binary Encoding (Unigram → 4-Gram)

## 📘 What is an N-Gram?

An **N-gram** is a sequence of **N consecutive words** in a text.  
They help capture **context**, **patterns**, and **relationships** between words in Natural Language Processing (NLP).

---

## 🧠 Example Sentence

> **Sentence:** `I love natural language processing`

---

## 🔢 Step 1: Generate N-Grams

| Type | N | Extracted Tokens |
|------|---|------------------|
| **Unigram** | 1 | ['I', 'love', 'natural', 'language', 'processing'] |
| **Bigram** | 2 | ['I love', 'love natural', 'natural language', 'language processing'] |
| **Trigram** | 3 | ['I love natural', 'love natural language', 'natural language processing'] |
| **4-gram** | 4 | ['I love natural language', 'love natural language processing'] |

---

## ⚙️ Step 2: Represent N-Grams in Binary Form

Let’s consider **two sentences**:
1. **S1:** `I love natural language processing`  
2. **S2:** `I love coding in python`

We’ll now represent N-grams using **binary encoding (0 and 1)** —  
`1` means the N-gram appears in the sentence, `0` means it doesn’t.

---

### 🔹 1️⃣ Unigram Binary Representation

| Unigram | S1 | S2 |
|----------|----|----|
| I | 1 | 1 |
| love | 1 | 1 |
| natural | 1 | 0 |
| language | 1 | 0 |
| processing | 1 | 0 |
| coding | 0 | 1 |
| in | 0 | 1 |
| python | 0 | 1 |

---

### 🔹 2️⃣ Bigram Binary Representation

| Bigram | S1 | S2 |
|---------|----|----|
| I love | 1 | 1 |
| love natural | 1 | 0 |
| natural language | 1 | 0 |
| language processing | 1 | 0 |
| love coding | 0 | 1 |
| coding in | 0 | 1 |
| in python | 0 | 1 |

---

### 🔹 3️⃣ Trigram Binary Representation

| Trigram | S1 | S2 |
|----------|----|----|
| I love natural | 1 | 0 |
| love natural language | 1 | 0 |
| natural language processing | 1 | 0 |
| I love coding | 0 | 1 |
| love coding in | 0 | 1 |
| coding in python | 0 | 1 |

---

### 🔹 4️⃣ 4-Gram Binary Representation

| 4-Gram | S1 | S2 |
|--------|----|----|
| I love natural language | 1 | 0 |
| love natural language processing | 1 | 0 |
| I love coding in | 0 | 1 |
| love coding in python | 0 | 1 |

---

## ⚙️ Step 3: Python Implementation

### ✅ Using `CountVectorizer` (Scikit-learn)
```python
from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "I love natural language processing",
    "I love coding in python"
]

# Generate Unigram to 4-gram binary features
vectorizer = CountVectorizer(ngram_range=(1, 4), binary=True)
X = vectorizer.fit_transform(sentences)

print("Features (N-grams):")
print(vectorizer.get_feature_names_out())
print("\nBinary Representation:\n", X.toarray())
````

**Output (Example):**

```
Features (N-grams): 
['coding' 'coding in' 'coding in python' 'i' 'i love' 'i love coding' 
 'i love coding in' 'i love natural' 'in' 'in python' 'language' 
 'language processing' 'love' 'love coding' 'love coding in' 
 'love coding in python' 'love natural' 'love natural language' 
 'love natural language processing' 'natural' 'natural language' 
 'natural language processing' 'processing' 'python']

Binary Representation:
[[0 0 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 0]
 [1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 1]]
```

💡 Each **row** corresponds to a sentence, and each **column** corresponds to an N-gram.
`1` = present, `0` = absent.

---

### ✅ Manual Implementation (Without Libraries)

```python
def generate_ngrams(words, n):
    return [' '.join(words[i:i+n]) for i in range(len(words)-n+1)]

sentences = [
    "I love natural language processing",
    "I love coding in python"
]

# Extract all N-grams (1 to 4)
all_ngrams = set()
for s in sentences:
    words = s.split()
    for n in range(1, 5):  # 1 to 4
        ngrams = generate_ngrams(words, n)
        all_ngrams.update(ngrams)

all_ngrams = sorted(list(all_ngrams))

# Binary matrix
binary_matrix = []
for s in sentences:
    words = s.split()
    sentence_ngrams = []
    for n in range(1, 5):
        sentence_ngrams += generate_ngrams(words, n)
    row = [1 if gram in sentence_ngrams else 0 for gram in all_ngrams]
    binary_matrix.append(row)

print("All N-grams (1 to 4):")
print(all_ngrams)
print("\nBinary Matrix:")
for row in binary_matrix:
    print(row)
```

**Output Example:**

```
All N-grams (1 to 4):
['I', 'I love', 'I love coding', 'I love coding in', 'I love natural', 
 'I love natural language', 'I love natural language processing', 
 'coding', 'coding in', 'coding in python', 'in', 'in python', 
 'language', 'language processing', 'love', 'love coding', 'love coding in', 
 'love coding in python', 'love natural', 'love natural language', 
 'love natural language processing', 'natural', 'natural language', 
 'natural language processing', 'processing', 'python']

Binary Matrix:
[1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0]
[1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1]
```

---

## 📊 Step 4: Visual Summary

| N-Gram      | Captures                | Example                       |
| ----------- | ----------------------- | ----------------------------- |
| **Unigram** | Single words            | `["I", "love"]`               |
| **Bigram**  | Pair of words           | `["I love", "love natural"]`  |
| **Trigram** | Three consecutive words | `["I love natural"]`          |
| **4-gram**  | Four consecutive words  | `["I love natural language"]` |

---

## 🚀 Step 5: Applications

* ✅ **Text Prediction** (keyboard suggestions)
* ✅ **Spam Detection** ("free money", "win prize")
* ✅ **Machine Translation**
* ✅ **Search Engines** (phrase-based queries)
* ✅ **Sentiment Analysis**

---

## 🧠 Summary

| Step | Description                                                         |
| ---- | ------------------------------------------------------------------- |
| 1️⃣  | Extract N-grams from text (1 to 4)                                  |
| 2️⃣  | Create a vocabulary of unique N-grams                               |
| 3️⃣  | Represent sentences as binary vectors (1 if N-gram present, else 0) |
| 4️⃣  | Feed vectors into ML/NLP models                                     |

> 💡 **In short:**
> N-grams represent **text context as numbers**, helping machines learn patterns like humans do.

---

```


#### ***Note: For better Understanding of implementation, refer 12-Bag of words.***
> http://localhost:8888/notebooks/Gen%20AI/NLP/Text%20Preprocessing/12-Bag%20of%20Words.ipynb