# Vetorization Techniques
| Method       | Unit of Representation                           | Captures Semantics? | Dense or Sparse |
| ------------ | ------------------------------------------------ | ------------------- | --------------- |
| **BoW**      | Document → vector (word counts)                  | ❌ No                | Sparse          |
| **TF-IDF**   | Document → vector (weighted counts)              | ❌ No                | Sparse          |
| **n-grams**  | Document → vector of n-word sequences            | ❌ No                | Sparse          |
| **Word2Vec** | Word → dense vector (pretrained or learned)      | ✅ Yes               | Dense           |
| **Doc2Vec**  | Document → dense vector                          | ✅ Yes               | Dense           |
| **BERT**     | Word, sentence, or doc → context-aware embedding | ✅ Yes               | Dense           |

Sparse vectors are **Counting-based**, while Dense vectors are **Prediction-based**.

## BoW
Bag of Words (BoW)is a vectorization technique that:
- Takes tokenized words (usually from a word-level tokenizer)
  
- Builds a vocabulary (set of unique tokens)
  
- Converts each document into a vector counting how many times each word appears

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

docs = [
    "I love NLP and machine learning",
    "NLP is fun and exciting",
    "I love coding in Python"
]

# Create BoW vectorizer
bow_vectorizer = CountVectorizer()
X_bow = bow_vectorizer.fit_transform(docs)

# Vocabulary
print("BoW Vocabulary:", bow_vectorizer.get_feature_names_out())

# BoW Matrix
bow_df = pd.DataFrame(X_bow.toarray(), columns=bow_vectorizer.get_feature_names_out())
print(bow_df)


BoW Vocabulary: ['and' 'coding' 'exciting' 'fun' 'in' 'is' 'learning' 'love' 'machine'
 'nlp' 'python']
   and  coding  exciting  fun  in  is  learning  love  machine  nlp  python
0    1       0         0    0   0   0         1     1        1    1       0
1    1       0         1    1   0   1         0     0        0    1       0
2    0       1         0    0   1   0         0     1        0    0       1


## TF-IDF
TF-IDF (Term Frequency–Inverse Document Frequency) is a vectorization technique that:
- Takes tokenized words from each document
  
- Builds a vocabulary (like BoW), but weights each word by its importance
  
- Calculates TF (how often a word appears in a document)
  
- Calculates IDF (how rare the word is across all documents)
  
- Converts each document into a vector where frequent and rare (but informative) words get higher scores

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

docs = [
    "I love NLP and machine learning",
    "NLP is fun and exciting",
    "I love coding in Python"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Vocabulary:
print(vectorizer.get_feature_names_out())

# TF-IDF Matrix:
import pandas as pd
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())


['and' 'coding' 'exciting' 'fun' 'in' 'is' 'learning' 'love' 'machine'
 'nlp' 'python']


Unnamed: 0,and,coding,exciting,fun,in,is,learning,love,machine,nlp,python
0,0.393511,0.0,0.0,0.0,0.0,0.0,0.51742,0.393511,0.51742,0.393511,0.0
1,0.373022,0.0,0.490479,0.490479,0.0,0.490479,0.0,0.0,0.0,0.373022,0.0
2,0.0,0.528635,0.0,0.0,0.528635,0.0,0.0,0.40204,0.0,0.0,0.528635


## n-grams
n-grams is a vectorization technique that:
- Extracts continuous sequences of n words (e.g., bigrams = 2 words, trigrams = 3) from each document

- Builds a vocabulary of all observed n-grams

- Converts each document into a vector by counting how many times each n-gram appears

- Captures limited word order or phrase patterns, unlike BoW and TF-IDF which treat words independently

In [3]:
# Create n-gram vectorizer (bigrams)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigrams = bigram_vectorizer.fit_transform(docs)

# Vocabulary
print("Bigram Vocabulary:", bigram_vectorizer.get_feature_names_out())

# Bigram Matrix
bigram_df = pd.DataFrame(X_bigrams.toarray(), columns=bigram_vectorizer.get_feature_names_out())
print(bigram_df)


Bigram Vocabulary: ['and exciting' 'and machine' 'coding in' 'fun and' 'in python' 'is fun'
 'love coding' 'love nlp' 'machine learning' 'nlp and' 'nlp is']
   and exciting  and machine  coding in  fun and  in python  is fun  \
0             0            1          0        0          0       0   
1             1            0          0        1          0       1   
2             0            0          1        0          1       0   

   love coding  love nlp  machine learning  nlp and  nlp is  
0            0         1                 1        1       0  
1            0         0                 0        0       1  
2            1         0                 0        0       0  


## Word2Vec
Word2Vec is a vectorization technique that:
- Learns dense vector representations of individual words based on surrounding word context

- Uses architectures like CBOW (predict word from context) or Skip-gram (predict context from word)

- Embeds similar words close together in vector space (e.g., “king” and “queen”)

- Converts text into word vectors, often averaged or pooled to represent entire sentences/documents

## Doc2Vec
Doc2Vec is a vectorization technique that:
- Extends Word2Vec to learn dense vector representations for entire documents or paragraphs

- Associates each document with a unique ID, which is trained along with word vectors

- Captures document-level semantics rather than averaging word embeddings

## BERT
BERT (Bidirectional Encoder Representations from Transformers) is a vectorization technique that:
- Uses a pretrained Transformer model to generate context-aware dense embeddings

- Represents words, sentences, or entire documents based on their surrounding context

- Unlike Word2Vec or TF-IDF, BERT considers the meaning of a word in its sentence (e.g., "bank" in finance vs river)

- Outputs high-dimensional vectors (e.g., 768-dim) that can be used for classification, similarity, clustering, etc.

## 🔍 Expanded Comparison of Text Vectorization Techniques

| Feature                   | **BoW**                           | **TF-IDF**                         | **n-grams**                     | **Word2Vec**                              | **Doc2Vec**                      | **BERT**                                            |
| ------------------------- | --------------------------------- | ---------------------------------- | ------------------------------- | ----------------------------------------- | -------------------------------- | --------------------------------------------------- |
| **Granularity**           | Document                          | Document                           | Document                        | Word                                      | Document                         | Word, Sentence, or Document                         |
| **Type**                  | Sparse                            | Sparse                             | Sparse                          | Dense                                     | Dense                            | Dense                                               |
| **Captures Word Order?**  | ❌ No                              | ❌ No                               | ✅ Partial (via sequence)        | ❌ No (context window only)                | ✅ Some via training context      | ✅ Yes (via Transformer attention)                   |
| **Captures Context?**     | ❌ No                              | ❌ No                               | ❌ No                            | ✅ Local context                           | ✅ Global context                 | ✅ Deep contextual understanding                     |
| **Vocabulary-Dependent?** | ✅ Yes                             | ✅ Yes                              | ✅ Yes                           | ✅ Yes                                     | ✅ Yes                            | ❌ No (tokenizer dependent, but model is contextual) |
| **Handles Polysemy?**     | ❌ No                              | ❌ No                               | ❌ No                            | ❌ No                                      | ❌ No                             | ✅ Yes (context-specific embeddings)                 |
| **Trainable?**            | ❌ No (purely statistical)         | ❌ No                               | ❌ No                            | ✅ Yes (unsupervised pretraining)          | ✅ Yes (unsupervised pretraining) | ✅ Yes (pretrained + fine-tuning optional)           |
| **Interpretability**      | ✅ High (clear term mapping)       | ✅ Moderate (weighted terms)        | ✅ Moderate                      | ❌ Low (black-box embeddings)              | ❌ Low                            | ❌ Low                                               |
| **Dimensionality**        | High (vocab size)                 | High (vocab size)                  | High (vocab size of n-grams)    | Lower (e.g., 100–300)                     | Lower (e.g., 100–300)            | Medium–High (e.g., 768, 1024)                       |
| **Memory Efficient?**     | ❌ No                              | ❌ No                               | ❌ No                            | ✅ Yes                                     | ✅ Yes                            | ❌ Moderate–High                                     |
| **Typical Use Cases**     | Baseline NLP, text classification | IR, keyword extraction, clustering | Phrase detection, n-gram models | Semantic similarity, clustering, NN input | Document similarity, retrieval   | QA, summarization, classification, RAG, etc.        |


### 🔑 Quick Summary
BoW/TF-IDF/n-grams:

- Simple, interpretable, but lose meaning and context

- Best for classical ML (SVM, NB, LR) on small to medium corpora

Word2Vec/Doc2Vec:

- Introduce semantics via learned vectors

- Suitable for downstream ML tasks like clustering, similarity, recommendation

BERT:

- Most powerful and flexible

- Context-aware, pretrained, adaptable to almost any NLP task

- High cost but highest performance