# Classical Text Representations: Bag-of-Words (BoW) and TF-IDF

This notebook covers classical text representation techniques that transform text into numerical vectors for machine learning.

## Topics Covered:
1. **Bag-of-Words (BoW)**
   - Basic concepts and intuition
   - Implementation from scratch
   - Using sklearn's CountVectorizer
   - Binary BoW and N-gram BoW
   
2. **TF-IDF (Term Frequency-Inverse Document Frequency)**
   - Understanding TF, IDF, and TF-IDF
   - Implementation from scratch
   - Using sklearn's TfidfVectorizer
   - Variants and normalization
   
3. **Practical Applications**
   - Text classification
   - Document similarity
   - Information retrieval
   - Feature importance analysis

In [None]:
# Install required packages (uncomment if needed)
# !pip install scikit-learn nltk pandas numpy matplotlib seaborn

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter, defaultdict
import re
import warnings
warnings.filterwarnings('ignore')

# Sklearn imports
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# NLTK imports
import nltk
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Set visualization style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")

---
## Part 1: Bag-of-Words (BoW)

### What is Bag-of-Words?

Bag-of-Words is a text representation technique that:
- Treats text as an unordered collection of words
- Ignores grammar and word order
- Represents documents as vectors of word counts

**Example:**
```
Documents:
  D1: "I love machine learning"
  D2: "I love deep learning"
  
Vocabulary: [I, love, machine, learning, deep]

BoW Vectors:
  D1: [1, 1, 1, 1, 0]
  D2: [1, 1, 0, 1, 1]
```

### Advantages:
- ✅ Simple and easy to understand
- ✅ Works well for many NLP tasks
- ✅ Fast to compute

### Disadvantages:
- ❌ Ignores word order and context
- ❌ High-dimensional and sparse
- ❌ Doesn't capture semantic meaning

### 1.1 Implementing Bag-of-Words from Scratch

In [None]:
class SimpleBagOfWords:
    """Simple Bag-of-Words implementation from scratch"""
    
    def __init__(self, lowercase=True, remove_punctuation=True):
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.vocabulary = None
        self.word_to_idx = None
    
    def preprocess(self, text):
        """Preprocess text"""
        if self.lowercase:
            text = text.lower()
        if self.remove_punctuation:
            text = re.sub(r'[^\w\s]', '', text)
        return text.split()
    
    def fit(self, documents):
        """Build vocabulary from documents"""
        vocab_set = set()
        for doc in documents:
            words = self.preprocess(doc)
            vocab_set.update(words)
        
        self.vocabulary = sorted(list(vocab_set))
        self.word_to_idx = {word: idx for idx, word in enumerate(self.vocabulary)}
        return self
    
    def transform(self, documents):
        """Transform documents to BoW vectors"""
        if self.vocabulary is None:
            raise ValueError("Vocabulary not built. Call fit() first.")
        
        vectors = []
        for doc in documents:
            words = self.preprocess(doc)
            vector = [0] * len(self.vocabulary)
            
            for word in words:
                if word in self.word_to_idx:
                    idx = self.word_to_idx[word]
                    vector[idx] += 1
            
            vectors.append(vector)
        
        return np.array(vectors)
    
    def fit_transform(self, documents):
        """Fit and transform in one step"""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self):
        """Get vocabulary words"""
        return self.vocabulary

In [None]:
# Sample documents
sample_docs = [
    "I love machine learning",
    "I love deep learning",
    "Machine learning is awesome",
    "Deep learning uses neural networks",
]

# Create and fit BoW model
bow = SimpleBagOfWords()
bow_vectors = bow.fit_transform(sample_docs)

# Display results
print("Bag-of-Words Representation\n")
print(f"Vocabulary size: {len(bow.vocabulary)}")
print(f"Vocabulary: {bow.vocabulary}\n")

# Create DataFrame for visualization
bow_df = pd.DataFrame(bow_vectors, 
                      columns=bow.get_feature_names(),
                      index=[f"Doc {i+1}" for i in range(len(sample_docs))])

print("BoW Matrix:")
print(bow_df)

# Show original documents
print("\nOriginal Documents:")
for i, doc in enumerate(sample_docs, 1):
    print(f"  Doc {i}: {doc}")

In [None]:
# Visualize BoW matrix
plt.figure(figsize=(14, 6))
sns.heatmap(bow_df, annot=True, fmt='d', cmap='YlOrRd', 
            cbar_kws={'label': 'Word Count'}, linewidths=0.5)
plt.title('Bag-of-Words Matrix Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Words (Vocabulary)', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 1.2 Using sklearn's CountVectorizer

sklearn provides a more feature-rich implementation with:
- Built-in preprocessing
- N-gram support
- Stop word filtering
- Min/max document frequency filtering

In [None]:
# Create CountVectorizer
count_vectorizer = CountVectorizer(
    lowercase=True,
    stop_words='english',  # Remove common English stop words
    max_features=20,       # Keep only top 20 features
)

# Larger corpus for demonstration
corpus = [
    "Machine learning is a subset of artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Neural networks are the foundation of deep learning",
    "Natural language processing is an important AI application",
    "Computer vision is another important AI application",
    "Reinforcement learning helps agents learn from environment",
    "Supervised learning requires labeled training data",
    "Unsupervised learning finds patterns in unlabeled data",
]

# Fit and transform
bow_matrix = count_vectorizer.fit_transform(corpus)

# Convert to DataFrame
bow_df_sklearn = pd.DataFrame(
    bow_matrix.toarray(),
    columns=count_vectorizer.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)

print("sklearn CountVectorizer Results\n")
print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Shape: {bow_matrix.shape}\n")
print("BoW Matrix:")
print(bow_df_sklearn)

In [None]:
# Visualize word frequencies across all documents
word_counts = bow_df_sklearn.sum(axis=0).sort_values(ascending=False)

plt.figure(figsize=(12, 6))
word_counts.plot(kind='bar', color='steelblue', alpha=0.8)
plt.title('Word Frequency Across All Documents', fontsize=14, fontweight='bold')
plt.xlabel('Words', fontsize=12)
plt.ylabel('Total Count', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

### 1.3 Binary Bag-of-Words

Binary BoW only tracks presence/absence of words (0 or 1), not counts.

In [None]:
# Binary BoW
binary_vectorizer = CountVectorizer(binary=True, stop_words='english')
binary_bow = binary_vectorizer.fit_transform(corpus)

# Compare regular vs binary
comparison_docs = [
    "learning learning learning machine machine",
    "learning machine"
]

regular_vec = count_vectorizer.transform(comparison_docs)
binary_vec = binary_vectorizer.transform(comparison_docs)

print("Regular BoW vs Binary BoW\n")
print("Documents:")
for i, doc in enumerate(comparison_docs, 1):
    print(f"  Doc {i}: {doc}")

print("\nRegular BoW (counts):")
print(pd.DataFrame(regular_vec.toarray(), 
                   columns=count_vectorizer.get_feature_names_out()))

print("\nBinary BoW (presence/absence):")
print(pd.DataFrame(binary_vec.toarray(), 
                   columns=binary_vectorizer.get_feature_names_out()))

### 1.4 N-gram Bag-of-Words

N-grams capture sequences of words:
- **Unigram**: Single words ("machine", "learning")
- **Bigram**: Two consecutive words ("machine learning")
- **Trigram**: Three consecutive words ("deep machine learning")

In [None]:
# N-gram examples
ngram_docs = [
    "natural language processing",
    "machine learning algorithms",
    "deep learning neural networks",
]

# Unigram (1,1)
unigram_vec = CountVectorizer(ngram_range=(1, 1))
unigram_bow = unigram_vec.fit_transform(ngram_docs)

# Bigram (2,2)
bigram_vec = CountVectorizer(ngram_range=(2, 2))
bigram_bow = bigram_vec.fit_transform(ngram_docs)

# Unigram + Bigram (1,2)
combined_vec = CountVectorizer(ngram_range=(1, 2))
combined_bow = combined_vec.fit_transform(ngram_docs)

print("N-gram Bag-of-Words Examples\n")
print("Documents:")
for i, doc in enumerate(ngram_docs, 1):
    print(f"  {i}. {doc}")

print(f"\nUnigrams (1,1): {unigram_vec.get_feature_names_out().tolist()}")
print(f"Bigrams (2,2): {bigram_vec.get_feature_names_out().tolist()}")
print(f"Combined (1,2): {combined_vec.get_feature_names_out().tolist()}")

In [None]:
# Visualize combined n-grams
combined_df = pd.DataFrame(
    combined_bow.toarray(),
    columns=combined_vec.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(ngram_docs))]
)

plt.figure(figsize=(14, 5))
sns.heatmap(combined_df, annot=True, fmt='d', cmap='Blues', 
            cbar_kws={'label': 'Count'}, linewidths=1)
plt.title('Unigram + Bigram BoW Matrix', fontsize=14, fontweight='bold')
plt.xlabel('Features', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

---
## Part 2: TF-IDF (Term Frequency-Inverse Document Frequency)

### What is TF-IDF?

TF-IDF is a statistical measure that evaluates the importance of a word in a document relative to a corpus.

### Components:

1. **Term Frequency (TF)**: How often a word appears in a document
   ```
   TF(t, d) = (Count of term t in document d) / (Total terms in document d)
   ```

2. **Inverse Document Frequency (IDF)**: How rare/common a word is across documents
   ```
   IDF(t) = log(Total documents / Documents containing term t)
   ```

3. **TF-IDF**: Combination of both
   ```
   TF-IDF(t, d) = TF(t, d) × IDF(t)
   ```

### Intuition:
- ✅ High TF-IDF: Word appears frequently in a document but rarely in others → **Important**
- ❌ Low TF-IDF: Word is common across all documents → **Less important** (e.g., "the", "is")

### 2.1 Implementing TF-IDF from Scratch

In [None]:
class SimpleTFIDF:
    """Simple TF-IDF implementation from scratch"""
    
    def __init__(self, lowercase=True, remove_punctuation=True):
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.vocabulary = None
        self.word_to_idx = None
        self.idf_values = None
        self.n_docs = 0
    
    def preprocess(self, text):
        """Preprocess text"""
        if self.lowercase:
            text = text.lower()
        if self.remove_punctuation:
            text = re.sub(r'[^\w\s]', '', text)
        return text.split()
    
    def compute_tf(self, words):
        """Compute term frequency"""
        tf = {}
        word_count = len(words)
        word_freq = Counter(words)
        
        for word, count in word_freq.items():
            tf[word] = count / word_count
        
        return tf
    
    def compute_idf(self, documents):
        """Compute inverse document frequency"""
        self.n_docs = len(documents)
        doc_freq = Counter()
        
        # Count document frequency for each word
        for doc in documents:
            words = set(self.preprocess(doc))
            doc_freq.update(words)
        
        # Compute IDF
        idf = {}
        for word, freq in doc_freq.items():
            idf[word] = np.log((self.n_docs + 1) / (freq + 1)) + 1  # Smoothed IDF
        
        return idf
    
    def fit(self, documents):
        """Build vocabulary and compute IDF"""
        vocab_set = set()
        for doc in documents:
            words = self.preprocess(doc)
            vocab_set.update(words)
        
        self.vocabulary = sorted(list(vocab_set))
        self.word_to_idx = {word: idx for idx, word in enumerate(self.vocabulary)}
        self.idf_values = self.compute_idf(documents)
        return self
    
    def transform(self, documents):
        """Transform documents to TF-IDF vectors"""
        if self.vocabulary is None:
            raise ValueError("Vocabulary not built. Call fit() first.")
        
        tfidf_vectors = []
        
        for doc in documents:
            words = self.preprocess(doc)
            tf = self.compute_tf(words)
            
            # Create TF-IDF vector
            vector = [0.0] * len(self.vocabulary)
            for word, tf_value in tf.items():
                if word in self.word_to_idx:
                    idx = self.word_to_idx[word]
                    idf_value = self.idf_values.get(word, 0)
                    vector[idx] = tf_value * idf_value
            
            tfidf_vectors.append(vector)
        
        return np.array(tfidf_vectors)
    
    def fit_transform(self, documents):
        """Fit and transform in one step"""
        return self.fit(documents).transform(documents)
    
    def get_feature_names(self):
        """Get vocabulary words"""
        return self.vocabulary

In [None]:
# Sample documents for TF-IDF
tfidf_docs = [
    "The cat sat on the mat",
    "The dog sat on the log",
    "Cats and dogs are great pets",
    "Python is a great programming language",
]

# Create and fit TF-IDF model
tfidf = SimpleTFIDF()
tfidf_vectors = tfidf.fit_transform(tfidf_docs)

# Create DataFrame
tfidf_df = pd.DataFrame(
    tfidf_vectors,
    columns=tfidf.get_feature_names(),
    index=[f"Doc {i+1}" for i in range(len(tfidf_docs))]
)

print("TF-IDF from Scratch\n")
print(f"Vocabulary size: {len(tfidf.vocabulary)}\n")
print("TF-IDF Matrix:")
print(tfidf_df.round(3))

print("\nOriginal Documents:")
for i, doc in enumerate(tfidf_docs, 1):
    print(f"  Doc {i}: {doc}")

In [None]:
# Visualize IDF values
idf_series = pd.Series(tfidf.idf_values).sort_values(ascending=False)

plt.figure(figsize=(12, 6))
idf_series.plot(kind='bar', color='coral', alpha=0.8)
plt.title('IDF Values for Each Word', fontsize=14, fontweight='bold')
plt.xlabel('Words', fontsize=12)
plt.ylabel('IDF Value', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("  - Higher IDF = Rare word (more informative)")
print("  - Lower IDF = Common word (less informative)")

### 2.2 Step-by-Step TF-IDF Calculation

In [None]:
# Detailed TF-IDF calculation for one document
example_doc = "The cat sat on the mat"
print(f"Document: '{example_doc}'\n")

words = tfidf.preprocess(example_doc)
print(f"Preprocessed words: {words}\n")

# Compute TF
tf_dict = tfidf.compute_tf(words)
print("Term Frequency (TF):")
for word, tf_val in sorted(tf_dict.items()):
    print(f"  {word:12s}: {tf_val:.3f}  (appears {words.count(word)} times out of {len(words)} words)")

# Show IDF
print("\nInverse Document Frequency (IDF):")
for word in sorted(tf_dict.keys()):
    idf_val = tfidf.idf_values.get(word, 0)
    doc_count = sum(1 for doc in tfidf_docs if word in tfidf.preprocess(doc))
    print(f"  {word:12s}: {idf_val:.3f}  (appears in {doc_count}/{len(tfidf_docs)} documents)")

# Compute TF-IDF
print("\nTF-IDF (TF × IDF):")
for word in sorted(tf_dict.keys()):
    tf_val = tf_dict[word]
    idf_val = tfidf.idf_values.get(word, 0)
    tfidf_val = tf_val * idf_val
    print(f"  {word:12s}: {tf_val:.3f} × {idf_val:.3f} = {tfidf_val:.3f}")

### 2.3 Using sklearn's TfidfVectorizer

In [None]:
# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    max_features=20,
    ngram_range=(1, 2),  # Unigrams and bigrams
    norm='l2'            # L2 normalization
)

# Use the larger corpus
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)

# Convert to DataFrame
tfidf_df_sklearn = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vectorizer.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)

print("sklearn TfidfVectorizer Results\n")
print(f"Shape: {tfidf_matrix.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}\n")
print("TF-IDF Matrix (rounded to 3 decimals):")
print(tfidf_df_sklearn.round(3))

In [None]:
# Visualize TF-IDF matrix
plt.figure(figsize=(14, 8))
sns.heatmap(tfidf_df_sklearn, annot=True, fmt='.2f', cmap='RdYlGn', 
            cbar_kws={'label': 'TF-IDF Score'}, linewidths=0.5)
plt.title('TF-IDF Matrix Heatmap', fontsize=14, fontweight='bold')
plt.xlabel('Features', fontsize=12)
plt.ylabel('Documents', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

### 2.4 Comparing BoW and TF-IDF

In [None]:
# Compare BoW vs TF-IDF on same document
comparison_corpus = [
    "machine learning is great",
    "machine learning machine learning machine learning",  # Repeated words
]

# BoW
bow_comp = CountVectorizer()
bow_result = bow_comp.fit_transform(comparison_corpus)

# TF-IDF
tfidf_comp = TfidfVectorizer()
tfidf_result = tfidf_comp.fit_transform(comparison_corpus)

# Display comparison
print("Comparison: BoW vs TF-IDF\n")
print("Documents:")
for i, doc in enumerate(comparison_corpus, 1):
    print(f"  Doc {i}: {doc}")

print("\nBag-of-Words (Raw Counts):")
bow_comp_df = pd.DataFrame(bow_result.toarray(), 
                           columns=bow_comp.get_feature_names_out(),
                           index=["Doc 1", "Doc 2"])
print(bow_comp_df)

print("\nTF-IDF (Normalized Importance):")
tfidf_comp_df = pd.DataFrame(tfidf_result.toarray(), 
                             columns=tfidf_comp.get_feature_names_out(),
                             index=["Doc 1", "Doc 2"])
print(tfidf_comp_df.round(3))

print("\nKey Observation:")
print("  - BoW: Doc 2 has much higher counts for 'machine' and 'learning'")
print("  - TF-IDF: Values are normalized, reducing the impact of repetition")

---
## Part 3: Practical Applications

### 3.1 Document Similarity with Cosine Similarity

In [None]:
# Documents for similarity comparison
docs_similarity = [
    "Machine learning is a branch of artificial intelligence",
    "Deep learning is a subset of machine learning",
    "Natural language processing deals with text and speech",
    "Computer vision enables machines to understand images",
    "AI and machine learning are transforming industries",
]

# Compute TF-IDF
tfidf_sim = TfidfVectorizer(stop_words='english')
tfidf_matrix_sim = tfidf_sim.fit_transform(docs_similarity)

# Compute cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix_sim)

# Create DataFrame
similarity_df = pd.DataFrame(
    cosine_sim,
    index=[f"Doc {i+1}" for i in range(len(docs_similarity))],
    columns=[f"Doc {i+1}" for i in range(len(docs_similarity))]
)

print("Document Similarity Matrix (Cosine Similarity)\n")
print(similarity_df.round(3))

print("\nDocuments:")
for i, doc in enumerate(docs_similarity, 1):
    print(f"  Doc {i}: {doc}")

In [None]:
# Visualize similarity matrix
plt.figure(figsize=(10, 8))
sns.heatmap(similarity_df, annot=True, fmt='.3f', cmap='coolwarm', 
            vmin=0, vmax=1, square=True, linewidths=1,
            cbar_kws={'label': 'Cosine Similarity'})
plt.title('Document Similarity Matrix\n(1.0 = Identical, 0.0 = Completely Different)', 
          fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Find most similar pairs
print("\nMost Similar Document Pairs:\n")
similarity_pairs = []
for i in range(len(cosine_sim)):
    for j in range(i+1, len(cosine_sim)):
        similarity_pairs.append((i+1, j+1, cosine_sim[i][j]))

similarity_pairs.sort(key=lambda x: x[2], reverse=True)

for rank, (doc1, doc2, score) in enumerate(similarity_pairs[:3], 1):
    print(f"{rank}. Doc {doc1} ↔ Doc {doc2}: {score:.3f}")
    print(f"   Doc {doc1}: {docs_similarity[doc1-1]}")
    print(f"   Doc {doc2}: {docs_similarity[doc2-1]}")
    print()

### 3.2 Text Classification with BoW and TF-IDF

In [None]:
# Sample dataset for classification
texts = [
    # Technology
    "Machine learning algorithms can predict outcomes",
    "Artificial intelligence is transforming industries",
    "Deep learning uses neural networks for pattern recognition",
    "Python is a popular programming language for data science",
    "Cloud computing provides scalable infrastructure",
    "Blockchain technology enables secure transactions",
    "Big data analytics helps make informed decisions",
    "IoT devices connect everyday objects to the internet",
    
    # Sports
    "The football team won the championship game",
    "Basketball players train hard for the tournament",
    "The tennis match was very exciting and competitive",
    "Olympic athletes compete for gold medals",
    "The soccer team scored three goals in the match",
    "Baseball season starts in spring every year",
    "Swimmers broke multiple world records at the event",
    "The cricket match lasted for five days",
    
    # Health
    "Regular exercise improves cardiovascular health",
    "A balanced diet is essential for good nutrition",
    "Meditation reduces stress and anxiety levels",
    "Vaccines help prevent infectious diseases",
    "Adequate sleep is crucial for mental health",
    "Yoga improves flexibility and strength",
    "Healthy lifestyle choices reduce disease risk",
    "Regular checkups help detect health issues early",
]

labels = ['tech'] * 8 + ['sports'] * 8 + ['health'] * 8

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    texts, labels, test_size=0.3, random_state=42, stratify=labels
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Categories: {set(labels)}")

In [None]:
# Compare BoW vs TF-IDF for classification
def train_and_evaluate(vectorizer, vectorizer_name):
    """Train and evaluate classifier"""
    # Transform data
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    
    # Train classifier
    classifier = MultinomialNB()
    classifier.fit(X_train_vec, y_train)
    
    # Predict
    y_pred = classifier.predict(X_test_vec)
    
    # Evaluate
    accuracy = np.mean(y_pred == y_test)
    
    print(f"\n{'='*60}")
    print(f"{vectorizer_name} + Naive Bayes Classifier")
    print(f"{'='*60}")
    print(f"Accuracy: {accuracy:.3f}\n")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    
    return y_pred, accuracy

# Test with BoW
bow_vectorizer = CountVectorizer(stop_words='english', max_features=50)
y_pred_bow, acc_bow = train_and_evaluate(bow_vectorizer, "Bag-of-Words")

# Test with TF-IDF
tfidf_vectorizer_clf = TfidfVectorizer(stop_words='english', max_features=50)
y_pred_tfidf, acc_tfidf = train_and_evaluate(tfidf_vectorizer_clf, "TF-IDF")

In [None]:
# Compare confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# BoW confusion matrix
cm_bow = confusion_matrix(y_test, y_pred_bow, labels=['tech', 'sports', 'health'])
sns.heatmap(cm_bow, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['tech', 'sports', 'health'],
            yticklabels=['tech', 'sports', 'health'])
axes[0].set_title(f'Bag-of-Words\nAccuracy: {acc_bow:.3f}', fontweight='bold')
axes[0].set_ylabel('True Label')
axes[0].set_xlabel('Predicted Label')

# TF-IDF confusion matrix
cm_tfidf = confusion_matrix(y_test, y_pred_tfidf, labels=['tech', 'sports', 'health'])
sns.heatmap(cm_tfidf, annot=True, fmt='d', cmap='Greens', ax=axes[1],
            xticklabels=['tech', 'sports', 'health'],
            yticklabels=['tech', 'sports', 'health'])
axes[1].set_title(f'TF-IDF\nAccuracy: {acc_tfidf:.3f}', fontweight='bold')
axes[1].set_ylabel('True Label')
axes[1].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

### 3.3 Feature Importance Analysis

In [None]:
# Train classifier with TF-IDF to analyze feature importance
tfidf_feat = TfidfVectorizer(stop_words='english', max_features=30)
X_train_feat = tfidf_feat.fit_transform(X_train)

classifier_feat = MultinomialNB()
classifier_feat.fit(X_train_feat, y_train)

# Get feature names
feature_names = tfidf_feat.get_feature_names_out()

# Get feature importance for each class
feature_importance = {}
for i, category in enumerate(classifier_feat.classes_):
    # Get log probabilities for this class
    log_probs = classifier_feat.feature_log_prob_[i]
    
    # Sort by importance
    top_indices = np.argsort(log_probs)[-10:][::-1]
    top_features = [(feature_names[idx], log_probs[idx]) for idx in top_indices]
    
    feature_importance[category] = top_features

# Display feature importance
print("Top 10 Most Important Features for Each Category:\n")
for category, features in feature_importance.items():
    print(f"\n{category.upper()}:")
    print("-" * 40)
    for rank, (feature, score) in enumerate(features, 1):
        print(f"  {rank:2d}. {feature:20s} (score: {score:.3f})")

In [None]:
# Visualize feature importance
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for idx, (category, features) in enumerate(feature_importance.items()):
    words = [f[0] for f in features]
    scores = [f[1] for f in features]
    
    axes[idx].barh(range(len(words)), scores, color='skyblue', alpha=0.8)
    axes[idx].set_yticks(range(len(words)))
    axes[idx].set_yticklabels(words)
    axes[idx].set_xlabel('Log Probability', fontsize=10)
    axes[idx].set_title(f'{category.upper()}\nTop Features', fontweight='bold', fontsize=12)
    axes[idx].invert_yaxis()
    axes[idx].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

### 3.4 Information Retrieval: Simple Search Engine

In [None]:
class SimpleSearchEngine:
    """Simple search engine using TF-IDF"""
    
    def __init__(self, documents):
        self.documents = documents
        self.vectorizer = TfidfVectorizer(stop_words='english')
        self.doc_vectors = self.vectorizer.fit_transform(documents)
    
    def search(self, query, top_k=3):
        """Search for most relevant documents"""
        # Transform query
        query_vector = self.vectorizer.transform([query])
        
        # Compute similarity
        similarities = cosine_similarity(query_vector, self.doc_vectors)[0]
        
        # Get top k results
        top_indices = np.argsort(similarities)[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            if similarities[idx] > 0:
                results.append({
                    'rank': len(results) + 1,
                    'doc_id': idx,
                    'document': self.documents[idx],
                    'similarity': similarities[idx]
                })
        
        return results
    
    def display_results(self, query, results):
        """Display search results"""
        print(f"Query: '{query}'\n")
        print(f"Found {len(results)} relevant documents:\n")
        print("="*80)
        
        for result in results:
            print(f"\nRank {result['rank']} (Similarity: {result['similarity']:.4f})")
            print(f"Doc {result['doc_id']}: {result['document']}")
            print("-"*80)

In [None]:
# Create search engine with all documents
search_engine = SimpleSearchEngine(texts)

# Test queries
queries = [
    "machine learning and artificial intelligence",
    "sports competition and athletes",
    "healthy lifestyle and exercise",
]

for query in queries:
    results = search_engine.search(query, top_k=3)
    search_engine.display_results(query, results)
    print("\n" + "#"*80 + "\n")

---
## Summary and Best Practices

### Bag-of-Words (BoW)

**When to use:**
- ✅ Simple baseline for text classification
- ✅ When word frequency is important
- ✅ Limited computational resources

**Limitations:**
- ❌ Ignores word order and context
- ❌ Doesn't capture semantic relationships
- ❌ High dimensionality (sparse vectors)

### TF-IDF

**When to use:**
- ✅ Information retrieval and search
- ✅ Document similarity tasks
- ✅ When rare words are more important
- ✅ Generally better than BoW for most tasks

**Advantages over BoW:**
- ✅ Reduces impact of common words
- ✅ Highlights distinctive terms
- ✅ Better for ranking and retrieval

### Key Comparisons

| Aspect | Bag-of-Words | TF-IDF |
|--------|--------------|--------|
| **Representation** | Raw counts | Weighted importance |
| **Common words** | High values | Downweighted |
| **Rare words** | Low values | Upweighted |
| **Normalization** | Optional | Built-in (L2) |
| **Best for** | Classification | Retrieval/Similarity |
| **Complexity** | Simple | Moderate |

### Best Practices

1. **Preprocessing**
   - Remove stop words ("the", "is", "and")
   - Lowercase text
   - Remove punctuation
   - Consider lemmatization/stemming

2. **Feature Selection**
   - Use `max_features` to limit vocabulary size
   - Use `min_df` and `max_df` to filter rare/common words
   - Consider n-grams for capturing phrases

3. **Normalization**
   - L2 normalization for TF-IDF (default in sklearn)
   - Consider binary BoW for presence/absence

4. **Model Selection**
   - Try both BoW and TF-IDF
   - TF-IDF usually performs better
   - For modern applications, consider word embeddings (Word2Vec, GloVe, BERT)

### Moving Beyond Classical Methods

While BoW and TF-IDF are useful, modern NLP often uses:
- **Word Embeddings**: Word2Vec, GloVe, FastText
- **Contextualized Embeddings**: BERT, GPT, RoBERTa
- **Sentence Embeddings**: Sentence-BERT, Universal Sentence Encoder

These methods capture semantic meaning and context better than classical methods.

---
## Exercises

1. **Preprocessing**: Implement a text preprocessor with stemming/lemmatization and compare results

2. **N-gram Analysis**: Experiment with different n-gram ranges and analyze their impact on classification

3. **Feature Engineering**: Create custom features combining BoW, TF-IDF, and linguistic features

4. **Search Engine**: Extend the search engine with query expansion and relevance feedback

5. **Document Clustering**: Use TF-IDF vectors with K-means clustering to group similar documents

6. **Real Dataset**: Apply these techniques to a real dataset (e.g., 20 Newsgroups, IMDB reviews)

7. **Comparison**: Compare classical methods with modern embeddings on the same task