# Bag of Words (BoW) - Text to Vector Representation

## 📚 Table of Contents
1. Introduction to Bag of Words
2. How Bag of Words Works
3. Prerequisites and Setup
4. Basic Implementation
5. Advanced Implementations
6. Applications and Use Cases
7. Advantages and Limitations
8. Comparison with Other Techniques

---

## 1. Introduction to Bag of Words

**Bag of Words (BoW)** is one of the simplest and most widely used techniques for converting text into numerical vectors that machine learning algorithms can understand.

### Key Concepts:
- **Text Vectorization**: Converting text data into numerical format
- **Feature Extraction**: Extracting meaningful features from text
- **Vocabulary**: Collection of unique words in the corpus
- **Document-Term Matrix**: Matrix representation of documents and word frequencies

### Why Bag of Words?
- Simple to understand and implement
- Computationally efficient
- Works well for many NLP tasks
- Foundation for understanding advanced techniques

---

## 2. How Bag of Words Works

### Step-by-Step Process:

#### Step 1: **Tokenization**
Break down text into individual words (tokens)
```
Text: "The cat sat on the mat"
Tokens: ["The", "cat", "sat", "on", "the", "mat"]
```

#### Step 2: **Build Vocabulary**
Create a list of unique words from all documents
```
Vocabulary: ["The", "cat", "sat", "on", "the", "mat", "dog", "ran"]
```

#### Step 3: **Create Vector Representation**
Count the frequency of each word in each document
```
Document: "The cat sat on the mat"
Vector: [2, 1, 1, 1, 1, 0, 0]  # [the, cat, sat, on, mat, dog, ran]
```

### Important Note:
- **"Bag"** means we **ignore the order** of words
- We only care about **word frequency**
- Different documents can be compared using these vectors

---

## 3. Prerequisites and Setup

Let's import all necessary libraries for our experiments.

In [2]:
# Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import numpy as np
import pandas as pd
from collections import Counter
import re

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("✅ All libraries imported successfully!")
print("✅ NLTK data downloaded successfully!")

✅ All libraries imported successfully!
✅ NLTK data downloaded successfully!


---

## 4. Basic Implementation - Experiment 1

### 🔬 Experiment 1: Simple Bag of Words (Manual Implementation)

**Objective**: Understand the basic concept by manually creating a Bag of Words representation.

**Dataset**: Simple sentences about animals

In [3]:
# Sample documents (corpus)
documents = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are great pets",
    "I love my cat and dog"
]

print("📄 Original Documents:")
print("-" * 50)
for i, doc in enumerate(documents, 1):
    print(f"Document {i}: {doc}")

📄 Original Documents:
--------------------------------------------------
Document 1: The cat sat on the mat
Document 2: The dog sat on the log
Document 3: Cats and dogs are great pets
Document 4: I love my cat and dog


In [4]:
# Step 1: Tokenization and Preprocessing
def preprocess_text(text):
    """Convert text to lowercase and tokenize"""
    text = text.lower()  # Convert to lowercase
    tokens = word_tokenize(text)  # Tokenize using NLTK
    return tokens

# Tokenize all documents
tokenized_docs = [preprocess_text(doc) for doc in documents]

print("\n🔤 Tokenized Documents:")
print("-" * 50)
for i, tokens in enumerate(tokenized_docs, 1):
    print(f"Document {i}: {tokens}")


🔤 Tokenized Documents:
--------------------------------------------------
Document 1: ['the', 'cat', 'sat', 'on', 'the', 'mat']
Document 2: ['the', 'dog', 'sat', 'on', 'the', 'log']
Document 3: ['cats', 'and', 'dogs', 'are', 'great', 'pets']
Document 4: ['i', 'love', 'my', 'cat', 'and', 'dog']


In [5]:
# Step 2: Build Vocabulary
def build_vocabulary(tokenized_documents):
    """Create a vocabulary of unique words from all documents"""
    vocabulary = set()
    for tokens in tokenized_documents:
        vocabulary.update(tokens)
    return sorted(list(vocabulary))  # Sort for consistency

vocabulary = build_vocabulary(tokenized_docs)

print("\n📖 Vocabulary (Unique Words):")
print("-" * 50)
print(f"Total unique words: {len(vocabulary)}")
print(f"Vocabulary: {vocabulary}")


📖 Vocabulary (Unique Words):
--------------------------------------------------
Total unique words: 16
Vocabulary: ['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'great', 'i', 'log', 'love', 'mat', 'my', 'on', 'pets', 'sat', 'the']


In [6]:
# Step 3: Create Bag of Words Vectors
def create_bow_vector(tokens, vocabulary):
    """Create a BoW vector for a document"""
    vector = []
    for word in vocabulary:
        vector.append(tokens.count(word))  # Count frequency of each word
    return vector

# Create BoW vectors for all documents
bow_vectors = [create_bow_vector(tokens, vocabulary) for tokens in tokenized_docs]

print("\n🎯 Bag of Words Vectors:")
print("-" * 50)
for i, vector in enumerate(bow_vectors, 1):
    print(f"Document {i}: {vector}")


🎯 Bag of Words Vectors:
--------------------------------------------------
Document 1: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 2]
Document 2: [0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 2]
Document 3: [1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0]
Document 4: [1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0]


In [7]:
# Step 4: Create a Document-Term Matrix (DataFrame for better visualization)
bow_df = pd.DataFrame(bow_vectors, columns=vocabulary)
bow_df.index = [f"Doc {i}" for i in range(1, len(documents) + 1)]

print("\n📊 Document-Term Matrix (Bag of Words Representation):")
print("-" * 80)
print(bow_df)

# Show dimensions
print(f"\n📏 Matrix Shape: {bow_df.shape}")
print(f"   - Rows (Documents): {bow_df.shape[0]}")
print(f"   - Columns (Vocabulary): {bow_df.shape[1]}")


📊 Document-Term Matrix (Bag of Words Representation):
--------------------------------------------------------------------------------
       and  are  cat  cats  dog  dogs  great  i  log  love  mat  my  on  pets  \
Doc 1    0    0    1     0    0     0      0  0    0     0    1   0   1     0   
Doc 2    0    0    0     0    1     0      0  0    1     0    0   0   1     0   
Doc 3    1    1    0     1    0     1      1  0    0     0    0   0   0     1   
Doc 4    1    0    1     0    1     0      0  1    0     1    0   1   0     0   

       sat  the  
Doc 1    1    2  
Doc 2    1    2  
Doc 3    0    0  
Doc 4    0    0  

📏 Matrix Shape: (4, 16)
   - Rows (Documents): 4
   - Columns (Vocabulary): 16


### 📝 Observations - Experiment 1:

1. **Tokenization**: Text was broken down into individual words (tokens)
2. **Vocabulary Size**: We got unique words from all documents
3. **Vector Representation**: Each document is now represented as a numerical vector
4. **Word Order Lost**: The order of words is not preserved (that's why it's called "Bag" of words)
5. **Sparse Matrix**: Many zeros in the matrix (words that don't appear in certain documents)
6. **Word Frequency**: The numbers represent how many times each word appears in each document

**Key Insight**: Each document is now a vector of numbers, making it suitable for machine learning algorithms!

---

## 🔬 Experiment 2: Bag of Words with Stop Words Removal

**Objective**: Improve the BoW representation by removing common words (stop words) that don't add much meaning.

**Why Remove Stop Words?**
- Words like "the", "is", "on", "and" appear frequently but carry little meaning
- Removing them reduces vocabulary size and focuses on meaningful words
- Improves model efficiency and accuracy

In [8]:
# Get English stop words from NLTK
stop_words = set(stopwords.words('english'))

print("🛑 Stop Words Sample:")
print("-" * 50)
print(f"Total stop words: {len(stop_words)}")
print(f"Sample stop words: {list(stop_words)[:20]}")

🛑 Stop Words Sample:
--------------------------------------------------
Total stop words: 198
Sample stop words: ['didn', 'it', 'once', "they're", 'am', 'as', 'i', 'shouldn', 'm', 'don', "she'd", 'theirs', 'not', 'yours', "you've", "it's", 'why', "she'll", "doesn't", 'having']


In [9]:
# Function to preprocess with stop words removal
def preprocess_with_stopwords_removal(text):
    """Tokenize, lowercase, and remove stop words"""
    text = text.lower()
    tokens = word_tokenize(text)
    # Remove stop words and punctuation
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    return filtered_tokens

# Apply preprocessing with stop words removal
tokenized_docs_filtered = [preprocess_with_stopwords_removal(doc) for doc in documents]

print("\n🔤 Tokenized Documents (After Stop Words Removal):")
print("-" * 50)
for i, tokens in enumerate(tokenized_docs_filtered, 1):
    print(f"Document {i}: {tokens}")
    
print("\n📊 Comparison:")
print("-" * 50)
print(f"Original tokens in Doc 1: {len(tokenized_docs[0])}")
print(f"Filtered tokens in Doc 1: {len(tokenized_docs_filtered[0])}")


🔤 Tokenized Documents (After Stop Words Removal):
--------------------------------------------------
Document 1: ['cat', 'sat', 'mat']
Document 2: ['dog', 'sat', 'log']
Document 3: ['cats', 'dogs', 'great', 'pets']
Document 4: ['love', 'cat', 'dog']

📊 Comparison:
--------------------------------------------------
Original tokens in Doc 1: 6
Filtered tokens in Doc 1: 3


In [10]:
# Build vocabulary and create BoW vectors
vocabulary_filtered = build_vocabulary(tokenized_docs_filtered)
bow_vectors_filtered = [create_bow_vector(tokens, vocabulary_filtered) for tokens in tokenized_docs_filtered]

# Create DataFrame
bow_df_filtered = pd.DataFrame(bow_vectors_filtered, columns=vocabulary_filtered)
bow_df_filtered.index = [f"Doc {i}" for i in range(1, len(documents) + 1)]

print("\n📊 Document-Term Matrix (After Stop Words Removal):")
print("-" * 80)
print(bow_df_filtered)

print(f"\n📏 Matrix Shape Comparison:")
print(f"   - Original: {bow_df.shape} (Docs × Vocabulary)")
print(f"   - Filtered: {bow_df_filtered.shape} (Docs × Vocabulary)")
print(f"   - Vocabulary reduced by: {bow_df.shape[1] - bow_df_filtered.shape[1]} words")


📊 Document-Term Matrix (After Stop Words Removal):
--------------------------------------------------------------------------------
       cat  cats  dog  dogs  great  log  love  mat  pets  sat
Doc 1    1     0    0     0      0    0     0    1     0    1
Doc 2    0     0    1     0      0    1     0    0     0    1
Doc 3    0     1    0     1      1    0     0    0     1    0
Doc 4    1     0    1     0      0    0     1    0     0    0

📏 Matrix Shape Comparison:
   - Original: (4, 16) (Docs × Vocabulary)
   - Filtered: (4, 10) (Docs × Vocabulary)
   - Vocabulary reduced by: 6 words


### 📝 Observations - Experiment 2:

1. **Vocabulary Reduction**: Vocabulary size decreased significantly after removing stop words
2. **More Meaningful Features**: Remaining words carry more semantic meaning (cat, dog, sat, log, pets, love)
3. **Cleaner Representation**: Less noise from common words
4. **Focused Analysis**: Model can focus on words that actually differentiate documents
5. **Computational Efficiency**: Smaller vocabulary means faster processing and less memory

**Key Insight**: Stop words removal is a crucial preprocessing step in text analysis!

---

## 🔬 Experiment 3: Bag of Words with Stemming

**Objective**: Apply stemming to reduce words to their root form and further reduce vocabulary.

**What is Stemming?**
- Reduces words to their base/root form (e.g., "running" → "run", "cats" → "cat")
- Helps group related words together
- Further reduces vocabulary size

In [11]:
# Initialize stemmer
stemmer = PorterStemmer()

# Function to preprocess with stemming
def preprocess_with_stemming(text):
    """Tokenize, lowercase, remove stop words, and apply stemming"""
    text = text.lower()
    tokens = word_tokenize(text)
    # Remove stop words and punctuation
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    # Apply stemming
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    return stemmed_tokens

# Apply preprocessing with stemming
tokenized_docs_stemmed = [preprocess_with_stemming(doc) for doc in documents]

print("🔤 Tokenized Documents (After Stemming):")
print("-" * 50)
for i, tokens in enumerate(tokenized_docs_stemmed, 1):
    print(f"Document {i}: {tokens}")

# Show stemming examples
print("\n🌱 Stemming Examples:")
print("-" * 50)
example_words = ["cats", "dogs", "sitting", "loved", "running", "playing"]
for word in example_words:
    print(f"{word} → {stemmer.stem(word)}")

🔤 Tokenized Documents (After Stemming):
--------------------------------------------------
Document 1: ['cat', 'sat', 'mat']
Document 2: ['dog', 'sat', 'log']
Document 3: ['cat', 'dog', 'great', 'pet']
Document 4: ['love', 'cat', 'dog']

🌱 Stemming Examples:
--------------------------------------------------
cats → cat
dogs → dog
sitting → sit
loved → love
running → run
playing → play


In [12]:
# Build vocabulary and create BoW vectors
vocabulary_stemmed = build_vocabulary(tokenized_docs_stemmed)
bow_vectors_stemmed = [create_bow_vector(tokens, vocabulary_stemmed) for tokens in tokenized_docs_stemmed]

# Create DataFrame
bow_df_stemmed = pd.DataFrame(bow_vectors_stemmed, columns=vocabulary_stemmed)
bow_df_stemmed.index = [f"Doc {i}" for i in range(1, len(documents) + 1)]

print("\n📊 Document-Term Matrix (After Stemming):")
print("-" * 80)
print(bow_df_stemmed)

print(f"\n📏 Matrix Shape Evolution:")
print(f"   - Original:            {bow_df.shape} (Docs × Vocabulary)")
print(f"   - Stop Words Removed:  {bow_df_filtered.shape} (Docs × Vocabulary)")
print(f"   - With Stemming:       {bow_df_stemmed.shape} (Docs × Vocabulary)")


📊 Document-Term Matrix (After Stemming):
--------------------------------------------------------------------------------
       cat  dog  great  log  love  mat  pet  sat
Doc 1    1    0      0    0     0    1    0    1
Doc 2    0    1      0    1     0    0    0    1
Doc 3    1    1      1    0     0    0    1    0
Doc 4    1    1      0    0     1    0    0    0

📏 Matrix Shape Evolution:
   - Original:            (4, 16) (Docs × Vocabulary)
   - Stop Words Removed:  (4, 10) (Docs × Vocabulary)
   - With Stemming:       (4, 8) (Docs × Vocabulary)


### 📝 Observations - Experiment 3:

1. **Word Normalization**: "cats" and "cat" are now treated as the same word ("cat")
2. **Vocabulary Reduction**: Similar words are grouped under their root form
3. **Loss of Meaning**: Stemming can sometimes produce non-words (e.g., "running" → "run")
4. **Better Generalization**: Model treats related words similarly
5. **Trade-off**: Gain efficiency but may lose some semantic information

**Key Insight**: Stemming is useful for grouping related words, but it's a crude approach that may lose some meaning!

---

## 🔬 Experiment 4: Bag of Words with Lemmatization

**Objective**: Use lemmatization as a better alternative to stemming.

**What is Lemmatization?**
- Reduces words to their dictionary form (lemma)
- More intelligent than stemming - considers context and part of speech
- Produces actual words (not stems)
- Example: "running" → "run", "better" → "good"

In [13]:
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to preprocess with lemmatization
def preprocess_with_lemmatization(text):
    """Tokenize, lowercase, remove stop words, and apply lemmatization"""
    text = text.lower()
    tokens = word_tokenize(text)
    # Remove stop words and punctuation
    filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    # Apply lemmatization
    lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]
    return lemmatized_tokens

# Apply preprocessing with lemmatization
tokenized_docs_lemmatized = [preprocess_with_lemmatization(doc) for doc in documents]

print("🔤 Tokenized Documents (After Lemmatization):")
print("-" * 50)
for i, tokens in enumerate(tokenized_docs_lemmatized, 1):
    print(f"Document {i}: {tokens}")

# Compare stemming vs lemmatization
print("\n🌱 Stemming vs Lemmatization Comparison:")
print("-" * 50)
comparison_words = ["cats", "dogs", "running", "better", "caring", "leaves"]
for word in comparison_words:
    print(f"{word:12} → Stem: {stemmer.stem(word):12} | Lemma: {lemmatizer.lemmatize(word)}")

🔤 Tokenized Documents (After Lemmatization):
--------------------------------------------------
Document 1: ['cat', 'sat', 'mat']
Document 2: ['dog', 'sat', 'log']
Document 3: ['cat', 'dog', 'great', 'pet']
Document 4: ['love', 'cat', 'dog']

🌱 Stemming vs Lemmatization Comparison:
--------------------------------------------------
cats         → Stem: cat          | Lemma: cat
dogs         → Stem: dog          | Lemma: dog
running      → Stem: run          | Lemma: running
better       → Stem: better       | Lemma: better
caring       → Stem: care         | Lemma: caring
leaves       → Stem: leav         | Lemma: leaf


In [14]:
# Build vocabulary and create BoW vectors
vocabulary_lemmatized = build_vocabulary(tokenized_docs_lemmatized)
bow_vectors_lemmatized = [create_bow_vector(tokens, vocabulary_lemmatized) for tokens in tokenized_docs_lemmatized]

# Create DataFrame
bow_df_lemmatized = pd.DataFrame(bow_vectors_lemmatized, columns=vocabulary_lemmatized)
bow_df_lemmatized.index = [f"Doc {i}" for i in range(1, len(documents) + 1)]

print("\n📊 Document-Term Matrix (After Lemmatization):")
print("-" * 80)
print(bow_df_lemmatized)

print(f"\n📏 Complete Matrix Shape Evolution:")
print(f"   - Original:            {bow_df.shape} (Docs × Vocabulary)")
print(f"   - Stop Words Removed:  {bow_df_filtered.shape} (Docs × Vocabulary)")
print(f"   - With Stemming:       {bow_df_stemmed.shape} (Docs × Vocabulary)")
print(f"   - With Lemmatization:  {bow_df_lemmatized.shape} (Docs × Vocabulary)")


📊 Document-Term Matrix (After Lemmatization):
--------------------------------------------------------------------------------
       cat  dog  great  log  love  mat  pet  sat
Doc 1    1    0      0    0     0    1    0    1
Doc 2    0    1      0    1     0    0    0    1
Doc 3    1    1      1    0     0    0    1    0
Doc 4    1    1      0    0     1    0    0    0

📏 Complete Matrix Shape Evolution:
   - Original:            (4, 16) (Docs × Vocabulary)
   - Stop Words Removed:  (4, 10) (Docs × Vocabulary)
   - With Stemming:       (4, 8) (Docs × Vocabulary)
   - With Lemmatization:  (4, 8) (Docs × Vocabulary)


### 📝 Observations - Experiment 4:

1. **Better Quality**: Lemmatization produces actual dictionary words
2. **Context Aware**: Considers grammatical context (though basic lemmatization doesn't use POS tags)
3. **More Accurate**: Preserves more semantic meaning compared to stemming
4. **Computational Cost**: Slower than stemming but produces better results
5. **Preferred Method**: Generally preferred over stemming in modern NLP applications

**Key Insight**: Lemmatization is the preferred normalization technique when quality matters more than speed!

---

## 🔬 Experiment 5: Using Sklearn's CountVectorizer

**Objective**: Use scikit-learn's CountVectorizer for more efficient and feature-rich BoW implementation.

**Why CountVectorizer?**
- Industry-standard implementation
- Built-in preprocessing options
- Efficient and optimized
- Many configurable parameters
- Easy to use and integrate with ML pipelines

In [15]:
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Method 1: Basic CountVectorizer (no preprocessing)
print("=" * 80)
print("METHOD 1: Basic CountVectorizer")
print("=" * 80)

cv_basic = CountVectorizer()
bow_sklearn_basic = cv_basic.fit_transform(documents)

# Convert to DataFrame for visualization
bow_sklearn_basic_df = pd.DataFrame(
    bow_sklearn_basic.toarray(),
    columns=cv_basic.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(1, len(documents) + 1)]
)

print("\n📊 Document-Term Matrix (Basic CountVectorizer):")
print("-" * 80)
print(bow_sklearn_basic_df)

print(f"\n📏 Vocabulary size: {len(cv_basic.get_feature_names_out())}")
print(f"📖 Vocabulary: {list(cv_basic.get_feature_names_out())}")

METHOD 1: Basic CountVectorizer

📊 Document-Term Matrix (Basic CountVectorizer):
--------------------------------------------------------------------------------
       and  are  cat  cats  dog  dogs  great  log  love  mat  my  on  pets  \
Doc 1    0    0    1     0    0     0      0    0     0    1   0   1     0   
Doc 2    0    0    0     0    1     0      0    1     0    0   0   1     0   
Doc 3    1    1    0     1    0     1      1    0     0    0   0   0     1   
Doc 4    1    0    1     0    1     0      0    0     1    0   1   0     0   

       sat  the  
Doc 1    1    2  
Doc 2    1    2  
Doc 3    0    0  
Doc 4    0    0  

📏 Vocabulary size: 15
📖 Vocabulary: ['and', 'are', 'cat', 'cats', 'dog', 'dogs', 'great', 'log', 'love', 'mat', 'my', 'on', 'pets', 'sat', 'the']


In [16]:
# Method 2: CountVectorizer with Stop Words Removal
print("\n" + "=" * 80)
print("METHOD 2: CountVectorizer with Stop Words Removal")
print("=" * 80)

cv_stopwords = CountVectorizer(stop_words='english')
bow_sklearn_stopwords = cv_stopwords.fit_transform(documents)

# Convert to DataFrame
bow_sklearn_stopwords_df = pd.DataFrame(
    bow_sklearn_stopwords.toarray(),
    columns=cv_stopwords.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(1, len(documents) + 1)]
)

print("\n📊 Document-Term Matrix (With Stop Words Removal):")
print("-" * 80)
print(bow_sklearn_stopwords_df)

print(f"\n📏 Vocabulary size: {len(cv_stopwords.get_feature_names_out())}")
print(f"📖 Vocabulary: {list(cv_stopwords.get_feature_names_out())}")


METHOD 2: CountVectorizer with Stop Words Removal

📊 Document-Term Matrix (With Stop Words Removal):
--------------------------------------------------------------------------------
       cat  cats  dog  dogs  great  log  love  mat  pets  sat
Doc 1    1     0    0     0      0    0     0    1     0    1
Doc 2    0     0    1     0      0    1     0    0     0    1
Doc 3    0     1    0     1      1    0     0    0     1    0
Doc 4    1     0    1     0      0    0     1    0     0    0

📏 Vocabulary size: 10
📖 Vocabulary: ['cat', 'cats', 'dog', 'dogs', 'great', 'log', 'love', 'mat', 'pets', 'sat']


In [17]:
# Method 3: CountVectorizer with Advanced Parameters
print("\n" + "=" * 80)
print("METHOD 3: CountVectorizer with Advanced Parameters")
print("=" * 80)

cv_advanced = CountVectorizer(
    stop_words='english',
    lowercase=True,
    max_features=10,  # Keep only top 10 most frequent words
    ngram_range=(1, 2),  # Include both unigrams and bigrams
    min_df=1,  # Minimum document frequency
    max_df=1.0  # Maximum document frequency
)

bow_sklearn_advanced = cv_advanced.fit_transform(documents)

# Convert to DataFrame
bow_sklearn_advanced_df = pd.DataFrame(
    bow_sklearn_advanced.toarray(),
    columns=cv_advanced.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(1, len(documents) + 1)]
)

print("\n📊 Document-Term Matrix (Advanced Parameters with Bigrams):")
print("-" * 80)
print(bow_sklearn_advanced_df)

print(f"\n📏 Vocabulary size: {len(cv_advanced.get_feature_names_out())}")
print(f"📖 Vocabulary: {list(cv_advanced.get_feature_names_out())}")
print("\n💡 Note: Includes bigrams (2-word combinations) like 'cat dog', 'great pets', etc.")


METHOD 3: CountVectorizer with Advanced Parameters

📊 Document-Term Matrix (Advanced Parameters with Bigrams):
--------------------------------------------------------------------------------
       cat  dog  great  great pets  log  love  love cat  mat  pets  sat
Doc 1    1    0      0           0    0     0         0    1     0    1
Doc 2    0    1      0           0    1     0         0    0     0    1
Doc 3    0    0      1           1    0     0         0    0     1    0
Doc 4    1    1      0           0    0     1         1    0     0    0

📏 Vocabulary size: 10
📖 Vocabulary: ['cat', 'dog', 'great', 'great pets', 'log', 'love', 'love cat', 'mat', 'pets', 'sat']

💡 Note: Includes bigrams (2-word combinations) like 'cat dog', 'great pets', etc.


### 📝 Observations - Experiment 5:

1. **Industry Standard**: CountVectorizer is the professional way to create BoW representations
2. **Automatic Preprocessing**: Built-in lowercase conversion, tokenization, and stop words removal
3. **N-grams Support**: Can capture multi-word phrases (bigrams, trigrams, etc.)
4. **Sparse Matrix**: Returns efficient sparse matrices for large datasets
5. **Vocabulary Control**: Parameters like `max_features`, `min_df`, `max_df` help control vocabulary
6. **Easy Integration**: Works seamlessly with scikit-learn ML pipelines

**Key Parameters Explained**:
- `stop_words='english'`: Remove English stop words
- `lowercase=True`: Convert all text to lowercase
- `max_features=10`: Keep only 10 most frequent words
- `ngram_range=(1,2)`: Include single words and 2-word combinations
- `min_df=1`: Word must appear in at least 1 document
- `max_df=1.0`: Word can appear in up to 100% of documents

**Key Insight**: Always use CountVectorizer for production code - it's optimized, feature-rich, and industry-standard!

---

## 🔬 Experiment 6: Real-World Application - Sentiment Analysis Dataset

**Objective**: Apply BoW on a realistic text classification scenario.

**Dataset**: Movie reviews (positive and negative sentiments)

In [18]:
# Create a realistic movie review dataset
movie_reviews = [
    "This movie was absolutely amazing! Great acting and wonderful story.",
    "Terrible film. Waste of time and money. Very disappointing.",
    "Excellent cinematography and brilliant performances by the cast.",
    "Boring plot with poor character development. Not recommended.",
    "A masterpiece! One of the best movies I have ever seen.",
    "Awful experience. The worst movie of the year.",
    "Fantastic storyline with unexpected twists. Highly entertaining!",
    "Dull and uninteresting. Could not finish watching it."
]

# Labels: 1 = Positive, 0 = Negative
sentiments = [1, 0, 1, 0, 1, 0, 1, 0]

print("🎬 Movie Reviews Dataset:")
print("=" * 80)
for i, (review, sentiment) in enumerate(zip(movie_reviews, sentiments), 1):
    label = "POSITIVE ✅" if sentiment == 1 else "NEGATIVE ❌"
    print(f"{i}. [{label}] {review}")

🎬 Movie Reviews Dataset:
1. [POSITIVE ✅] This movie was absolutely amazing! Great acting and wonderful story.
2. [NEGATIVE ❌] Terrible film. Waste of time and money. Very disappointing.
3. [POSITIVE ✅] Excellent cinematography and brilliant performances by the cast.
4. [NEGATIVE ❌] Boring plot with poor character development. Not recommended.
5. [POSITIVE ✅] A masterpiece! One of the best movies I have ever seen.
6. [NEGATIVE ❌] Awful experience. The worst movie of the year.
7. [POSITIVE ✅] Fantastic storyline with unexpected twists. Highly entertaining!
8. [NEGATIVE ❌] Dull and uninteresting. Could not finish watching it.


In [19]:
# Create BoW representation
vectorizer = CountVectorizer(
    stop_words='english',
    lowercase=True,
    max_features=20  # Keep top 20 features
)

X = vectorizer.fit_transform(movie_reviews)
y = np.array(sentiments)

# Convert to DataFrame for visualization
bow_reviews_df = pd.DataFrame(
    X.toarray(),
    columns=vectorizer.get_feature_names_out(),
    index=[f"Review {i}" for i in range(1, len(movie_reviews) + 1)]
)

# Add sentiment column
bow_reviews_df['Sentiment'] = ['Positive' if s == 1 else 'Negative' for s in sentiments]

print("\n📊 Bag of Words Representation of Movie Reviews:")
print("=" * 100)
print(bow_reviews_df)

print(f"\n📏 Matrix Shape: {X.shape}")
print(f"📖 Vocabulary Size: {len(vectorizer.get_feature_names_out())}")
print(f"📖 Features: {list(vectorizer.get_feature_names_out())}")


📊 Bag of Words Representation of Movie Reviews:
          absolutely  masterpiece  money  movie  movies  performances  plot  \
Review 1           1            0      0      1       0             0     0   
Review 2           0            0      1      0       0             0     0   
Review 3           0            0      0      0       0             1     0   
Review 4           0            0      0      0       0             0     1   
Review 5           0            1      0      0       1             0     0   
Review 6           0            0      0      1       0             0     0   
Review 7           0            0      0      0       0             0     0   
Review 8           0            0      0      0       0             0     0   

          poor  recommended  seen  ...  storyline  terrible  time  twists  \
Review 1     0            0     0  ...          0         0     0       0   
Review 2     0            0     0  ...          0         1     1       0   
Review 3

In [20]:
# Analyze which words are most indicative of positive vs negative sentiment
positive_reviews = X[np.array(sentiments) == 1].toarray().sum(axis=0)
negative_reviews = X[np.array(sentiments) == 0].toarray().sum(axis=0)

word_sentiment_df = pd.DataFrame({
    'Word': vectorizer.get_feature_names_out(),
    'Positive_Count': positive_reviews,
    'Negative_Count': negative_reviews,
    'Total_Count': positive_reviews + negative_reviews
})

word_sentiment_df['Sentiment_Indicator'] = word_sentiment_df.apply(
    lambda row: 'POSITIVE' if row['Positive_Count'] > row['Negative_Count'] 
    else ('NEGATIVE' if row['Negative_Count'] > row['Positive_Count'] else 'NEUTRAL'),
    axis=1
)

# Sort by total count
word_sentiment_df = word_sentiment_df.sort_values('Total_Count', ascending=False)

print("\n📈 Word Frequency Analysis by Sentiment:")
print("=" * 80)
print(word_sentiment_df.to_string(index=False))

print("\n🎯 Key Sentiment Indicators:")
print("-" * 80)
positive_words = word_sentiment_df[word_sentiment_df['Sentiment_Indicator'] == 'POSITIVE']['Word'].tolist()
negative_words = word_sentiment_df[word_sentiment_df['Sentiment_Indicator'] == 'NEGATIVE']['Word'].tolist()

print(f"✅ Positive Words: {positive_words}")
print(f"❌ Negative Words: {negative_words}")


📈 Word Frequency Analysis by Sentiment:
         Word  Positive_Count  Negative_Count  Total_Count Sentiment_Indicator
        movie               1               1            2             NEUTRAL
   absolutely               1               0            1            POSITIVE
    storyline               1               0            1            POSITIVE
     watching               0               1            1            NEGATIVE
        waste               0               1            1            NEGATIVE
uninteresting               0               1            1            NEGATIVE
   unexpected               1               0            1            POSITIVE
       twists               1               0            1            POSITIVE
         time               0               1            1            NEGATIVE
     terrible               0               1            1            NEGATIVE
        story               1               0            1            POSITIVE
  masterpie

In [21]:
# Simple Machine Learning Application
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# For this small dataset, we'll use the same data for training and testing (just for demonstration)
# In real scenarios, you'd have separate train/test sets

# Train a Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X, y)

# Make predictions
predictions = classifier.predict(X)

# Evaluate
accuracy = accuracy_score(y, predictions)

print("\n🤖 Machine Learning Classification Results:")
print("=" * 80)
print(f"Model: Multinomial Naive Bayes")
print(f"Accuracy: {accuracy * 100:.2f}%")
print(f"\nClassification Report:")
print(classification_report(y, predictions, target_names=['Negative', 'Positive']))

# Test on new reviews
new_reviews = [
    "Amazing movie with great performances!",
    "Terrible waste of my time."
]

print("\n🔮 Predictions on New Reviews:")
print("=" * 80)
new_reviews_bow = vectorizer.transform(new_reviews)
new_predictions = classifier.predict(new_reviews_bow)

for review, pred in zip(new_reviews, new_predictions):
    sentiment = "POSITIVE ✅" if pred == 1 else "NEGATIVE ❌"
    print(f"Review: \"{review}\"")
    print(f"Predicted Sentiment: {sentiment}\n")


🤖 Machine Learning Classification Results:
Model: Multinomial Naive Bayes
Accuracy: 100.00%

Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00         4
    Positive       1.00      1.00      1.00         4

    accuracy                           1.00         8
   macro avg       1.00      1.00      1.00         8
weighted avg       1.00      1.00      1.00         8


🔮 Predictions on New Reviews:
Review: "Amazing movie with great performances!"
Predicted Sentiment: POSITIVE ✅

Review: "Terrible waste of my time."
Predicted Sentiment: NEGATIVE ❌



### 📝 Observations - Experiment 6:

1. **Real-World Application**: Demonstrated how BoW is used in sentiment analysis
2. **Feature Analysis**: Identified which words indicate positive vs negative sentiment
3. **ML Integration**: Successfully trained a classifier using BoW features
4. **Prediction Capability**: Model can classify new, unseen reviews
5. **Interpretability**: Easy to understand which words contribute to predictions
6. **Scalability**: This approach scales to thousands or millions of documents

**Key Insights**:
- Words like "amazing", "excellent", "great" → Positive sentiment
- Words like "terrible", "worst", "poor" → Negative sentiment
- BoW + Naive Bayes is a simple yet effective baseline for text classification
- This is the foundation for more advanced NLP applications

---

## 🔬 Experiment 7: Binary vs Frequency Representation

**Objective**: Compare binary (presence/absence) vs frequency-based BoW representations.

**Difference**:
- **Frequency**: Count how many times each word appears (1, 2, 3, ...)
- **Binary**: Only record if word is present (0 or 1)

In [22]:
# Sample documents with repeated words
repeated_docs = [
    "good good good excellent",
    "bad bad bad terrible",
    "good excellent amazing",
    "bad terrible awful"
]

print("📄 Sample Documents with Repeated Words:")
print("=" * 80)
for i, doc in enumerate(repeated_docs, 1):
    print(f"Doc {i}: {doc}")

# Frequency-based BoW
cv_frequency = CountVectorizer()
bow_frequency = cv_frequency.fit_transform(repeated_docs)

bow_frequency_df = pd.DataFrame(
    bow_frequency.toarray(),
    columns=cv_frequency.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(1, len(repeated_docs) + 1)]
)

print("\n📊 Frequency-Based Representation:")
print("=" * 80)
print(bow_frequency_df)
print("\n💡 Values represent how many times each word appears")

📄 Sample Documents with Repeated Words:
Doc 1: good good good excellent
Doc 2: bad bad bad terrible
Doc 3: good excellent amazing
Doc 4: bad terrible awful

📊 Frequency-Based Representation:
       amazing  awful  bad  excellent  good  terrible
Doc 1        0      0    0          1     3         0
Doc 2        0      0    3          0     0         1
Doc 3        1      0    0          1     1         0
Doc 4        0      1    1          0     0         1

💡 Values represent how many times each word appears


In [23]:
# Binary BoW (presence/absence only)
cv_binary = CountVectorizer(binary=True)
bow_binary = cv_binary.fit_transform(repeated_docs)

bow_binary_df = pd.DataFrame(
    bow_binary.toarray(),
    columns=cv_binary.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(1, len(repeated_docs) + 1)]
)

print("\n📊 Binary Representation:")
print("=" * 80)
print(bow_binary_df)
print("\n💡 Values are 0 (absent) or 1 (present), regardless of frequency")

# Comparison
print("\n⚖️ Comparison:")
print("=" * 80)
print("Document 1: 'good good good excellent'")
print(f"Frequency representation of 'good': {bow_frequency_df.iloc[0]['good']}")
print(f"Binary representation of 'good': {bow_binary_df.iloc[0]['good']}")
print("\nDocument 2: 'bad bad bad terrible'")
print(f"Frequency representation of 'bad': {bow_frequency_df.iloc[1]['bad']}")
print(f"Binary representation of 'bad': {bow_binary_df.iloc[1]['bad']}")


📊 Binary Representation:
       amazing  awful  bad  excellent  good  terrible
Doc 1        0      0    0          1     1         0
Doc 2        0      0    1          0     0         1
Doc 3        1      0    0          1     1         0
Doc 4        0      1    1          0     0         1

💡 Values are 0 (absent) or 1 (present), regardless of frequency

⚖️ Comparison:
Document 1: 'good good good excellent'
Frequency representation of 'good': 3
Binary representation of 'good': 1

Document 2: 'bad bad bad terrible'
Frequency representation of 'bad': 3
Binary representation of 'bad': 1


### 📝 Observations - Experiment 7:

1. **Frequency Representation**: 
   - Captures how many times a word appears
   - Useful when word frequency matters (e.g., emphasis)
   - More informative but sensitive to word repetition

2. **Binary Representation**: 
   - Only indicates presence or absence
   - Treats "good" and "good good good" the same way
   - Less sensitive to word repetition
   - Better for some classification tasks where presence matters more than frequency

3. **When to Use Each**:
   - **Frequency**: Topic modeling, information retrieval, document similarity
   - **Binary**: Text classification, spam detection (word presence often enough)

**Key Insight**: Choose binary vs frequency based on whether word repetition carries meaningful information!

---

## 🔬 Experiment 8: Handling Large Vocabulary with Parameters

**Objective**: Learn to control vocabulary size in real-world scenarios.

**Problem**: Real documents can have thousands of unique words, creating huge, sparse matrices.

**Solution**: Use `min_df`, `max_df`, and `max_features` parameters.

In [24]:
# Larger corpus with diverse vocabulary
large_corpus = [
    "Python is a great programming language for data science",
    "Machine learning requires understanding of algorithms and mathematics",
    "Deep learning is a subset of machine learning",
    "Natural language processing helps computers understand human language",
    "Data science involves statistics, programming, and domain knowledge",
    "Python libraries like NumPy and Pandas are essential for data analysis",
    "Neural networks are inspired by biological neurons",
    "Text mining extracts insights from unstructured text data"
]

print("📄 Large Corpus Sample:")
print("=" * 80)
for i, doc in enumerate(large_corpus, 1):
    print(f"{i}. {doc}")

# Method 1: No constraints (all words)
print("\n" + "=" * 80)
print("METHOD 1: No Constraints (All Words)")
print("=" * 80)

cv_all = CountVectorizer(stop_words='english')
bow_all = cv_all.fit_transform(large_corpus)

print(f"Vocabulary Size: {len(cv_all.get_feature_names_out())}")
print(f"Matrix Shape: {bow_all.shape}")
print(f"Vocabulary: {list(cv_all.get_feature_names_out())}")

📄 Large Corpus Sample:
1. Python is a great programming language for data science
2. Machine learning requires understanding of algorithms and mathematics
3. Deep learning is a subset of machine learning
4. Natural language processing helps computers understand human language
5. Data science involves statistics, programming, and domain knowledge
6. Python libraries like NumPy and Pandas are essential for data analysis
7. Neural networks are inspired by biological neurons
8. Text mining extracts insights from unstructured text data

METHOD 1: No Constraints (All Words)
Vocabulary Size: 40
Matrix Shape: (8, 40)
Vocabulary: ['algorithms', 'analysis', 'biological', 'computers', 'data', 'deep', 'domain', 'essential', 'extracts', 'great', 'helps', 'human', 'insights', 'inspired', 'involves', 'knowledge', 'language', 'learning', 'libraries', 'like', 'machine', 'mathematics', 'mining', 'natural', 'networks', 'neural', 'neurons', 'numpy', 'pandas', 'processing', 'programming', 'python', 'requir

In [25]:
# Method 2: Using min_df (minimum document frequency)
print("\n" + "=" * 80)
print("METHOD 2: min_df=2 (word must appear in at least 2 documents)")
print("=" * 80)

cv_min_df = CountVectorizer(stop_words='english', min_df=2)
bow_min_df = cv_min_df.fit_transform(large_corpus)

print(f"Vocabulary Size: {len(cv_min_df.get_feature_names_out())}")
print(f"Matrix Shape: {bow_min_df.shape}")
print(f"Vocabulary: {list(cv_min_df.get_feature_names_out())}")
print("💡 Removed rare words that appear in only 1 document")


METHOD 2: min_df=2 (word must appear in at least 2 documents)
Vocabulary Size: 7
Matrix Shape: (8, 7)
Vocabulary: ['data', 'language', 'learning', 'machine', 'programming', 'python', 'science']
💡 Removed rare words that appear in only 1 document


In [26]:
# Method 3: Using max_df (maximum document frequency)
print("\n" + "=" * 80)
print("METHOD 3: max_df=0.5 (word can appear in max 50% of documents)")
print("=" * 80)

cv_max_df = CountVectorizer(stop_words='english', max_df=0.5)
bow_max_df = cv_max_df.fit_transform(large_corpus)

print(f"Vocabulary Size: {len(cv_max_df.get_feature_names_out())}")
print(f"Matrix Shape: {bow_max_df.shape}")
print(f"Vocabulary: {list(cv_max_df.get_feature_names_out())}")
print("💡 Removed very common words that appear in >50% of documents")


METHOD 3: max_df=0.5 (word can appear in max 50% of documents)
Vocabulary Size: 40
Matrix Shape: (8, 40)
Vocabulary: ['algorithms', 'analysis', 'biological', 'computers', 'data', 'deep', 'domain', 'essential', 'extracts', 'great', 'helps', 'human', 'insights', 'inspired', 'involves', 'knowledge', 'language', 'learning', 'libraries', 'like', 'machine', 'mathematics', 'mining', 'natural', 'networks', 'neural', 'neurons', 'numpy', 'pandas', 'processing', 'programming', 'python', 'requires', 'science', 'statistics', 'subset', 'text', 'understand', 'understanding', 'unstructured']
💡 Removed very common words that appear in >50% of documents


In [27]:
# Method 4: Using max_features (keep only top N features)
print("\n" + "=" * 80)
print("METHOD 4: max_features=10 (keep only top 10 most frequent words)")
print("=" * 80)

cv_max_features = CountVectorizer(stop_words='english', max_features=10)
bow_max_features = cv_max_features.fit_transform(large_corpus)

print(f"Vocabulary Size: {len(cv_max_features.get_feature_names_out())}")
print(f"Matrix Shape: {bow_max_features.shape}")
print(f"Vocabulary: {list(cv_max_features.get_feature_names_out())}")
print("💡 Kept only the 10 most frequent words")

# Method 5: Combining parameters
print("\n" + "=" * 80)
print("METHOD 5: Combined (min_df=2, max_df=0.7, max_features=15)")
print("=" * 80)

cv_combined = CountVectorizer(
    stop_words='english',
    min_df=2,
    max_df=0.7,
    max_features=15
)
bow_combined = cv_combined.fit_transform(large_corpus)

print(f"Vocabulary Size: {len(cv_combined.get_feature_names_out())}")
print(f"Matrix Shape: {bow_combined.shape}")
print(f"Vocabulary: {list(cv_combined.get_feature_names_out())}")
print("💡 Applied multiple filters for optimal vocabulary")


METHOD 4: max_features=10 (keep only top 10 most frequent words)
Vocabulary Size: 10
Matrix Shape: (8, 10)
Vocabulary: ['data', 'language', 'learning', 'machine', 'natural', 'networks', 'programming', 'python', 'science', 'text']
💡 Kept only the 10 most frequent words

METHOD 5: Combined (min_df=2, max_df=0.7, max_features=15)
Vocabulary Size: 7
Matrix Shape: (8, 7)
Vocabulary: ['data', 'language', 'learning', 'machine', 'programming', 'python', 'science']
💡 Applied multiple filters for optimal vocabulary


### 📝 Observations - Experiment 8:

1. **min_df (Minimum Document Frequency)**:
   - Removes rare words that appear in very few documents
   - Helps eliminate typos and uncommon words
   - Reduces noise and vocabulary size

2. **max_df (Maximum Document Frequency)**:
   - Removes very common words (like domain-specific stop words)
   - Can be a number (count) or fraction (0.0 to 1.0)
   - Helps focus on distinctive words

3. **max_features**:
   - Keeps only top N most frequent words
   - Simple way to control vocabulary size
   - Good for quick prototyping

4. **Combined Approach**:
   - Best practice: combine multiple parameters
   - Balance between vocabulary size and information retention
   - Improves model performance and efficiency

**Key Insight**: Proper vocabulary filtering is crucial for real-world NLP applications!

**Recommended Settings**:
- Small dataset (< 1000 docs): `min_df=1`, `max_df=0.8`
- Medium dataset (1000-10000 docs): `min_df=5`, `max_df=0.7`, `max_features=5000`
- Large dataset (> 10000 docs): `min_df=10`, `max_df=0.5`, `max_features=10000`

---

## 📊 Summary and Comparison

### Complete Overview of All Experiments

In [28]:
# Create a comprehensive comparison summary
experiments_summary = pd.DataFrame({
    'Experiment': [
        'Exp 1: Basic BoW',
        'Exp 2: Stop Words Removal',
        'Exp 3: Stemming',
        'Exp 4: Lemmatization',
        'Exp 5: CountVectorizer',
        'Exp 6: Real-World App',
        'Exp 7: Binary vs Frequency',
        'Exp 8: Vocabulary Control'
    ],
    'Technique': [
        'Manual implementation',
        'NLTK stopwords',
        'PorterStemmer',
        'WordNetLemmatizer',
        'Sklearn CountVectorizer',
        'Sentiment Analysis',
        'Binary parameter',
        'min_df, max_df, max_features'
    ],
    'Key Learning': [
        'Basic concept & implementation',
        'Reducing noise from common words',
        'Normalizing words to root form',
        'Intelligent word normalization',
        'Professional implementation',
        'ML application with BoW',
        'Presence vs frequency matters',
        'Controlling vocabulary size'
    ],
    'Vocabulary_Reduction': [
        'Baseline',
        '↓ Medium',
        '↓ Medium',
        '↓ Medium',
        '↓ Configurable',
        'N/A',
        'Same size',
        '↓ High'
    ],
    'Use_Case': [
        'Learning/Understanding',
        'Preprocessing',
        'Text Normalization',
        'Text Normalization (Better)',
        'Production Systems',
        'Classification Tasks',
        'Spam Detection, Classification',
        'Large-scale Applications'
    ]
})

print("=" * 120)
print("COMPREHENSIVE EXPERIMENTS SUMMARY")
print("=" * 120)
print(experiments_summary.to_string(index=False))

print("\n" + "=" * 120)
print("VOCABULARY SIZE EVOLUTION (on original 4 documents)")
print("=" * 120)
vocab_comparison = pd.DataFrame({
    'Method': [
        'Original (All words)',
        'Stop Words Removed',
        'With Stemming',
        'With Lemmatization'
    ],
    'Vocabulary_Size': [
        bow_df.shape[1],
        bow_df_filtered.shape[1],
        bow_df_stemmed.shape[1],
        bow_df_lemmatized.shape[1]
    ],
    'Reduction': [
        '0%',
        f'{((bow_df.shape[1] - bow_df_filtered.shape[1]) / bow_df.shape[1] * 100):.1f}%',
        f'{((bow_df.shape[1] - bow_df_stemmed.shape[1]) / bow_df.shape[1] * 100):.1f}%',
        f'{((bow_df.shape[1] - bow_df_lemmatized.shape[1]) / bow_df.shape[1] * 100):.1f}%'
    ]
})

print(vocab_comparison.to_string(index=False))

COMPREHENSIVE EXPERIMENTS SUMMARY
                Experiment                    Technique                     Key Learning Vocabulary_Reduction                       Use_Case
          Exp 1: Basic BoW        Manual implementation   Basic concept & implementation             Baseline         Learning/Understanding
 Exp 2: Stop Words Removal               NLTK stopwords Reducing noise from common words             ↓ Medium                  Preprocessing
           Exp 3: Stemming                PorterStemmer   Normalizing words to root form             ↓ Medium             Text Normalization
      Exp 4: Lemmatization            WordNetLemmatizer   Intelligent word normalization             ↓ Medium    Text Normalization (Better)
    Exp 5: CountVectorizer      Sklearn CountVectorizer      Professional implementation       ↓ Configurable             Production Systems
     Exp 6: Real-World App           Sentiment Analysis          ML application with BoW                  N/A           

---

## 🎯 Advantages of Bag of Words

### ✅ Strengths:

1. **Simplicity**: Easy to understand and implement
2. **Interpretability**: Clear relationship between words and features
3. **Efficiency**: Fast computation, especially with sparse matrices
4. **Effectiveness**: Works well for many NLP tasks
5. **Foundation**: Base for understanding advanced techniques
6. **Scalability**: Can handle large documents with proper parameters
7. **Flexibility**: Many parameters to tune (stop words, n-grams, etc.)
8. **Tool Support**: Excellent library support (sklearn, nltk)

---

## ⚠️ Limitations of Bag of Words

### ❌ Weaknesses:

1. **Loss of Word Order**: "Dog bites man" vs "Man bites dog" are treated identically
2. **No Semantic Understanding**: Doesn't understand word meanings or context
3. **Sparse Matrices**: Large vocabularies create mostly zero values
4. **Ignores Word Relationships**: Can't capture synonyms or related words
5. **Fixed Vocabulary**: Can't handle new words not seen during training (OOV problem)
6. **No Context**: Same word in different contexts treated identically
7. **Size Issues**: Vocabulary can become very large with real datasets
8. **Frequency Bias**: Very frequent words can dominate less frequent but important words

### 💡 Solutions to Limitations:

| Limitation | Solution |
|-----------|----------|
| Loss of order | Use n-grams (bigrams, trigrams) |
| Sparse matrices | Use TF-IDF, dimension reduction |
| No semantics | Use Word2Vec, GloVe, BERT embeddings |
| Fixed vocabulary | Use subword tokenization, character n-grams |
| Frequency bias | Use TF-IDF weighting |
| Large vocabulary | Use vocabulary filtering (min_df, max_df) |

---

## 🎓 Applications and Use Cases

### Where Bag of Words Works Well:

1. **Text Classification**
   - Spam detection
   - Sentiment analysis
   - Topic categorization
   - Document categorization

2. **Information Retrieval**
   - Search engines (basic)
   - Document similarity
   - Content recommendation

3. **Baseline Models**
   - Quick prototyping
   - Benchmarking
   - Feature engineering baseline

4. **Small to Medium Datasets**
   - When you don't have millions of documents
   - When training time matters
   - When interpretability is important

### When to Use Alternatives:

- **TF-IDF**: When word importance matters more than raw frequency
- **Word Embeddings (Word2Vec, GloVe)**: When semantic similarity is crucial
- **Transformers (BERT, GPT)**: For state-of-the-art performance on complex tasks
- **Character/Subword models**: For handling OOV words, multilingual text

---

## 🚀 Best Practices and Recommendations

### ✨ Production-Ready BoW Pipeline:

```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

# Recommended configuration
vectorizer = CountVectorizer(
    lowercase=True,           # Normalize case
    stop_words='english',     # Remove common words
    min_df=5,                 # Word must appear in 5+ documents
    max_df=0.7,               # Word can't appear in >70% of documents
    max_features=5000,        # Keep top 5000 features
    ngram_range=(1, 2),       # Include unigrams and bigrams
    token_pattern=r'\b\w+\b'  # Word boundary tokenization
)

# Create ML pipeline
pipeline = Pipeline([
    ('vectorizer', vectorizer),
    ('classifier', MultinomialNB())
])
```

### 📋 Preprocessing Checklist:

1. ☑️ **Lowercase conversion**: Normalize case
2. ☑️ **Remove stop words**: Eliminate common words
3. ☑️ **Handle punctuation**: Remove or tokenize properly
4. ☑️ **Lemmatization/Stemming**: Normalize word forms
5. ☑️ **Remove special characters**: Clean text
6. ☑️ **Handle numbers**: Remove or keep based on use case
7. ☑️ **Set vocabulary limits**: Use min_df, max_df, max_features
8. ☑️ **Consider n-grams**: Include bigrams/trigrams if needed

### 🎯 Parameter Selection Guide:

| Dataset Size | min_df | max_df | max_features | ngram_range |
|-------------|--------|--------|--------------|-------------|
| Small (<1K docs) | 1 | 0.8 | 1000 | (1, 1) |
| Medium (1K-10K) | 5 | 0.7 | 5000 | (1, 2) |
| Large (>10K) | 10 | 0.5 | 10000 | (1, 2) |

---

## 📚 Comparison with Other Techniques

### BoW vs Other Text Vectorization Methods:

| Technique | Word Order | Semantics | Size | Complexity | Use Case |
|-----------|-----------|-----------|------|------------|----------|
| **Bag of Words** | ❌ No | ❌ No | Large | Low | Baseline, simple classification |
| **TF-IDF** | ❌ No | ❌ No | Large | Low | Document ranking, search |
| **Word2Vec** | ⚠️ Partial | ✅ Yes | Medium | Medium | Semantic similarity, analogies |
| **GloVe** | ⚠️ Partial | ✅ Yes | Medium | Medium | Similar to Word2Vec |
| **FastText** | ⚠️ Partial | ✅ Yes | Medium | Medium | Handles OOV, subwords |
| **BERT/Transformers** | ✅ Yes | ✅ Yes | Small | High | State-of-art, context-aware |

### Quick Comparison:

**Bag of Words**:
- ✅ Simple, fast, interpretable
- ❌ No semantics, no word order
- 🎯 Best for: Simple classification, baseline models

**TF-IDF** (Next step from BoW):
- ✅ Weights words by importance
- ❌ Still no semantics or order
- 🎯 Best for: Search, document ranking

**Word Embeddings**:
- ✅ Captures semantics and relationships
- ❌ Loses word order, fixed representation
- 🎯 Best for: Semantic tasks, document similarity

**Transformers**:
- ✅ Context-aware, understands semantics and order
- ❌ Computationally expensive
- 🎯 Best for: State-of-art performance

---

## 🎬 Final Thoughts and Key Takeaways

### 🌟 What You Learned:

1. ✅ **Core Concept**: Converting text to numerical vectors using word frequency
2. ✅ **Manual Implementation**: Built BoW from scratch to understand the fundamentals
3. ✅ **Preprocessing Techniques**: Stop words removal, stemming, lemmatization
4. ✅ **Professional Tools**: Used sklearn's CountVectorizer for production-ready code
5. ✅ **Real Applications**: Applied BoW to sentiment analysis
6. ✅ **Optimization**: Learned to control vocabulary with various parameters
7. ✅ **Trade-offs**: Understood advantages and limitations
8. ✅ **Best Practices**: Learned industry-standard approaches

### 🚀 Next Steps:

1. **TF-IDF**: Learn about term frequency-inverse document frequency
2. **Word Embeddings**: Explore Word2Vec, GloVe, FastText
3. **Advanced Techniques**: Study BERT, GPT, and transformer models
4. **Practice Projects**: Build a spam classifier, sentiment analyzer, or document classifier
5. **Large Datasets**: Work with real-world datasets (IMDB reviews, news articles)

### 💡 Remember:

> "Bag of Words is simple but powerful. It's the foundation of NLP and still widely used in production. Master the basics before moving to advanced techniques!"

### 🎯 When to Use BoW:

- ✅ Quick prototyping
- ✅ Baseline models
- ✅ When interpretability matters
- ✅ Small to medium datasets
- ✅ Simple classification tasks
- ✅ When computational resources are limited

### 📖 Key Formula:

For a document $d$ and vocabulary $V$:

$$\\text{BoW}(d) = [count(w_1), count(w_2), ..., count(w_n)]$$

Where $n = |V|$ (vocabulary size) and $count(w_i)$ is the frequency of word $w_i$ in document $d$.

---
