#### Advanced Vectorization

 - Bag of words(Simple frequency,Relative frequency,Text Frequency and Inverse document frequency)
 - Words may be unigrams/N-grams with N being 2,3,4.......


## What Are N‑grams?

An **n‑gram** is a contiguous sequence of *n* items (tokens) from a given sample of text or speech.
- **Items** can be **words** or **characters**.
- **n** can be any positive integer (1, 2, 3, …).

---

## Why Use N‑grams?

- **Language modeling** (predict next word)
- **Text classification** (features in Naïve Bayes, SVM, etc.)
- **Plagiarism detection** (match overlapping sequences)
- **Spelling correction** (detect likely sequences)


# N‑grams example

Imagine the sentence:

> **“I love chocolate cake.”**

---

## 1‑grams (Unigrams)

Each single word by itself:

- I
- love
- chocolate
- cake

---

## 2‑grams (Bigrams)

Every pair of words in a row:

- I love
- love chocolate
- chocolate cake

---

## 3‑grams (Trigrams)

Every three words in a row:

- I love chocolate
- love chocolate cake

---

*An n‑gram is simply “n” things stuck together exactly as they appeared in the text, sliding one step at a time.This helps computers guess what comes next, or spot patterns in writing..*


In [7]:
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import nltk

# Ensure required data is downloaded
nltk.download('punkt')

# Sample sentence
text = "The quick brown fox jumps over the lazy dog"

# Tokenize into words
tokens = word_tokenize(text)

# Generate and display n-grams
for n in (1, 2, 3):
    ng = list(ngrams(tokens, n))
    print(f"{n}-grams ({len(ng)}):")
    for gram in ng:
        print(" ", gram)
    print()


1-grams (9):
  ('The',)
  ('quick',)
  ('brown',)
  ('fox',)
  ('jumps',)
  ('over',)
  ('the',)
  ('lazy',)
  ('dog',)

2-grams (8):
  ('The', 'quick')
  ('quick', 'brown')
  ('brown', 'fox')
  ('fox', 'jumps')
  ('jumps', 'over')
  ('over', 'the')
  ('the', 'lazy')
  ('lazy', 'dog')

3-grams (7):
  ('The', 'quick', 'brown')
  ('quick', 'brown', 'fox')
  ('brown', 'fox', 'jumps')
  ('fox', 'jumps', 'over')
  ('jumps', 'over', 'the')
  ('over', 'the', 'lazy')
  ('the', 'lazy', 'dog')



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ChristineKakina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Building a Bag‑of‑Words from N‑grams

Below are two ways to turn your n‑grams into a BoW representation in Python.

---

### 1. Manual BoW with `collections.Counter`



In [8]:
from collections import Counter
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
import nltk

# Download tokenizer
nltk.download('punkt')

# Sample text
text = "The quick brown fox jumps over the lazy dog"

# Tokenize
tokens = word_tokenize(text)

# Generate all 1‑, 2‑, and 3‑grams as space‑joined strings
all_ngrams = []
for n in (1, 2, 3):
    all_ngrams += [' '.join(gram) for gram in ngrams(tokens, n)]

# Build BoW counts
bow = Counter(all_ngrams)

# Display result
for gram, count in bow.items():
    print(f"{gram!r}: {count}")

'The': 1
'quick': 1
'brown': 1
'fox': 1
'jumps': 1
'over': 1
'the': 1
'lazy': 1
'dog': 1
'The quick': 1
'quick brown': 1
'brown fox': 1
'fox jumps': 1
'jumps over': 1
'over the': 1
'the lazy': 1
'lazy dog': 1
'The quick brown': 1
'quick brown fox': 1
'brown fox jumps': 1
'fox jumps over': 1
'jumps over the': 1
'over the lazy': 1
'the lazy dog': 1


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ChristineKakina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Term Frequency (TF) and TF‑IDF for N‑grams

You can compute **Term Frequency (TF)** and **TF‑IDF** over n‑grams very easily with scikit‑learn. Below are two examples using the same sample corpus.

---

### 1. Term Frequency (TF) with CountVectorizer

In [9]:


from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# Sample corpus
corpus = [
    "I love chocolate cake",
    "Chocolate cake is delicious",
    "Do you love cake too"
]

# Create a CountVectorizer for unigrams and bigrams
cv = CountVectorizer(ngram_range=(1, 2))

# Fit to corpus and transform into document-term matrix
X_counts = cv.fit_transform(corpus)

# Raw counts
feature_names = cv.get_feature_names_out()
print("Features (n‑grams):\n", feature_names, "\n")

print("Raw counts matrix:\n", X_counts.toarray(), "\n")

# Normalize to get term frequency per document
tf = X_counts.toarray().astype(float)
tf = tf / tf.sum(axis=1, keepdims=True)

print("Term Frequency (TF) matrix:\n", tf)

Features (n‑grams):
 ['cake' 'cake is' 'cake too' 'chocolate' 'chocolate cake' 'delicious' 'do'
 'do you' 'is' 'is delicious' 'love' 'love cake' 'love chocolate' 'too'
 'you' 'you love'] 

Raw counts matrix:
 [[1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0]
 [1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0]
 [1 0 1 0 0 0 1 1 0 0 1 1 0 1 1 1]] 

Term Frequency (TF) matrix:
 [[0.2        0.         0.         0.2        0.2        0.
  0.         0.         0.         0.         0.2        0.
  0.2        0.         0.         0.        ]
 [0.14285714 0.14285714 0.         0.14285714 0.14285714 0.14285714
  0.         0.         0.14285714 0.14285714 0.         0.
  0.         0.         0.         0.        ]
 [0.11111111 0.         0.11111111 0.         0.         0.
  0.11111111 0.11111111 0.         0.         0.11111111 0.11111111
  0.         0.11111111 0.11111111 0.11111111]]
