# Text and Sparse Features in Scikit-learn

## Overview

Text data cannot be directly used by ML algorithms - it must be converted to numerical features. Sklearn provides powerful tools for text vectorization:

- **CountVectorizer**: Converts text to word count matrix
- **TfidfVectorizer**: Converts text to TF-IDF weighted features
- **HashingVectorizer**: Fast, memory-efficient hashing approach

These create **sparse matrices** - efficient representations where most values are zero.

## Why Sparse Matrices?

Text data typically has:
- **Large vocabulary**: Thousands of unique words
- **Sparse representation**: Most documents use only a small subset of vocabulary
- **Memory efficiency**: Sparse matrices store only non-zero values

Example: "The cat sat" vs 10,000 word vocabulary = 99.97% zeros!

## Setup and Sample Data

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from scipy.sparse import csr_matrix
import matplotlib.pyplot as plt

# Sample text corpus
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]

print("Sample Corpus:")
for i, doc in enumerate(corpus, 1):
    print(f"{i}. {doc}")

## 1. CountVectorizer - Bag of Words

Converts text to a matrix of token counts (word frequency).

**Key Parameters:**
- `max_features`: Limit vocabulary size to top N words
- `min_df`: Ignore words appearing in fewer than N documents
- `max_df`: Ignore words appearing in more than N% of documents
- `ngram_range`: Include n-grams (1,1)=unigrams, (1,2)=unigrams+bigrams
- `stop_words`: Remove common words ('english' or custom list)

In [None]:
# Basic CountVectorizer
count_vec = CountVectorizer()
X_counts = count_vec.fit_transform(corpus)

print("CountVectorizer Results:")
print(f"Shape: {X_counts.shape} (documents x vocabulary)")
print(f"Vocabulary size: {len(count_vec.vocabulary_)}")
print(f"Sparse matrix density: {X_counts.nnz / (X_counts.shape[0] * X_counts.shape[1]):.2%}")
print(f"\nFeature names (vocabulary): {count_vec.get_feature_names_out()}")

# Convert to dense for visualization
df_counts = pd.DataFrame(
    X_counts.toarray(),
    columns=count_vec.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)
print("\nWord Count Matrix:")
print(df_counts)

In [None]:
# Advanced CountVectorizer with parameters
count_vec_adv = CountVectorizer(
    max_features=10,        # Keep only top 10 words
    min_df=1,               # Word must appear in at least 1 doc
    max_df=0.8,             # Ignore words in >80% of docs
    ngram_range=(1, 2),     # Include unigrams and bigrams
    stop_words='english'    # Remove English stop words
)

X_counts_adv = count_vec_adv.fit_transform(corpus)

print("Advanced CountVectorizer (with bigrams, stop words removed):")
print(f"Vocabulary: {count_vec_adv.get_feature_names_out()}")
print(f"\nShape: {X_counts_adv.shape}")

df_counts_adv = pd.DataFrame(
    X_counts_adv.toarray(),
    columns=count_vec_adv.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)
print("\nWord Count Matrix (with n-grams):")
print(df_counts_adv)

## 2. TfidfVectorizer - Term Frequency-Inverse Document Frequency

**TF-IDF** weights words by their importance:
- **TF (Term Frequency)**: How often word appears in document
- **IDF (Inverse Document Frequency)**: How rare the word is across all documents

**Formula:**
\[
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t)
\]

\[
\text{IDF}(t) = \log\left(\frac{1 + n}{1 + \text{df}(t)}\right) + 1
\]

**Intuition:**
- Common words ("the", "is") → Low TF-IDF
- Rare, distinctive words → High TF-IDF

In [None]:
# Basic TfidfVectorizer
tfidf_vec = TfidfVectorizer()
X_tfidf = tfidf_vec.fit_transform(corpus)

print("TfidfVectorizer Results:")
print(f"Shape: {X_tfidf.shape}")
print(f"Vocabulary: {tfidf_vec.get_feature_names_out()}")

# Convert to DataFrame
df_tfidf = pd.DataFrame(
    X_tfidf.toarray(),
    columns=tfidf_vec.get_feature_names_out(),
    index=[f"Doc {i+1}" for i in range(len(corpus))]
)
print("\nTF-IDF Matrix (values are weighted):")
print(df_tfidf.round(3))

In [None]:
# Compare Count vs TF-IDF for a specific word
word = 'document'

if word in count_vec.vocabulary_:
    word_idx = count_vec.vocabulary_[word]
    
    print(f"Comparison for word: '{word}'")
    print("\nDocument | Count | TF-IDF")
    print("-" * 35)
    for i in range(len(corpus)):
        count_val = X_counts[i, word_idx]
        tfidf_val = X_tfidf[i, word_idx]
        print(f"Doc {i+1}    |   {count_val}   | {tfidf_val:.4f}")
    
    print("\n💡 Notice: 'document' appears frequently → lower TF-IDF weights")
    print("   Rare words get higher weights!")

In [None]:
# Advanced TfidfVectorizer
tfidf_vec_adv = TfidfVectorizer(
    max_features=50,
    min_df=1,
    max_df=0.8,
    ngram_range=(1, 2),
    sublinear_tf=True,      # Apply log scaling to TF
    use_idf=True,           # Enable IDF weighting
    smooth_idf=True,        # Smooth IDF weights
    norm='l2'               # L2 normalization
)

X_tfidf_adv = tfidf_vec_adv.fit_transform(corpus)

print("Advanced TF-IDF Configuration:")
print(f"Features: {tfidf_vec_adv.get_feature_names_out()}")
print(f"\nIDF values (higher = rarer word):")
idf_df = pd.DataFrame({
    'term': tfidf_vec_adv.get_feature_names_out(),
    'idf': tfidf_vec_adv.idf_
}).sort_values('idf', ascending=False)
print(idf_df)

## 3. HashingVectorizer - Memory-Efficient Alternative

**Advantages:**
- No vocabulary storage (memory efficient)
- Fast - no need to fit vocabulary
- Scalable to large datasets
- Supports online/streaming learning

**Disadvantages:**
- Hash collisions (multiple words → same feature)
- No inverse transform (can't retrieve original words)
- Fixed feature space size

**When to use:**
- Very large datasets (millions of documents)
- Online/streaming scenarios
- Memory constraints
- Don't need to interpret features

In [None]:
# HashingVectorizer
hash_vec = HashingVectorizer(
    n_features=20,          # Hash space size (trade-off: collisions vs memory)
    alternate_sign=True     # Reduce impact of collisions
)

X_hash = hash_vec.transform(corpus)  # No fit() needed!

print("HashingVectorizer Results:")
print(f"Shape: {X_hash.shape}")
print(f"No vocabulary stored (memory efficient!)")
print(f"\nHash Matrix (first 3 docs, first 10 features):")
print(X_hash[:3, :10].toarray())

print("\n⚠️ Note: Cannot retrieve feature names (no vocabulary)")
print("✓ But much faster and more memory efficient!")

## 4. Real-World Example: 20 Newsgroups Classification

Let's classify news articles using text vectorization:

In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Load subset of categories
categories = ['alt.atheism', 'sci.space', 'rec.sport.baseball', 'comp.graphics']

print("Loading 20 Newsgroups dataset...")
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=42
)

newsgroups_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=42
)

print(f"\nTraining samples: {len(newsgroups_train.data)}")
print(f"Test samples: {len(newsgroups_test.data)}")
print(f"Categories: {newsgroups_train.target_names}")

# Show sample
print(f"\nSample document (category: {newsgroups_train.target_names[newsgroups_train.target[0]]}):")
print(newsgroups_train.data[0][:300] + "...")

In [None]:
# Method 1: CountVectorizer + Naive Bayes
print("=" * 60)
print("Method 1: CountVectorizer + Multinomial Naive Bayes")
print("=" * 60)

count_vectorizer = CountVectorizer(
    max_features=5000,
    min_df=5,
    max_df=0.7,
    stop_words='english'
)

X_train_counts = count_vectorizer.fit_transform(newsgroups_train.data)
X_test_counts = count_vectorizer.transform(newsgroups_test.data)

print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Training matrix shape: {X_train_counts.shape}")
print(f"Sparsity: {(1 - X_train_counts.nnz / (X_train_counts.shape[0] * X_train_counts.shape[1])):.2%}")

# Train classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_counts, newsgroups_train.target)

# Predict
y_pred_nb = nb_classifier.predict(X_test_counts)
accuracy_nb = accuracy_score(newsgroups_test.target, y_pred_nb)

print(f"\nAccuracy: {accuracy_nb:.4f}")
print("\nClassification Report:")
print(classification_report(newsgroups_test.target, y_pred_nb, 
                          target_names=newsgroups_test.target_names))

In [None]:
# Method 2: TF-IDF + Logistic Regression
print("=" * 60)
print("Method 2: TF-IDF + Logistic Regression")
print("=" * 60)

tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=5,
    max_df=0.7,
    stop_words='english',
    ngram_range=(1, 2),
    sublinear_tf=True
)

X_train_tfidf = tfidf_vectorizer.fit_transform(newsgroups_train.data)
X_test_tfidf = tfidf_vectorizer.transform(newsgroups_test.data)

print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print(f"Training matrix shape: {X_train_tfidf.shape}")

# Train classifier
lr_classifier = LogisticRegression(max_iter=1000, random_state=42)
lr_classifier.fit(X_train_tfidf, newsgroups_train.target)

# Predict
y_pred_lr = lr_classifier.predict(X_test_tfidf)
accuracy_lr = accuracy_score(newsgroups_test.target, y_pred_lr)

print(f"\nAccuracy: {accuracy_lr:.4f}")
print("\nClassification Report:")
print(classification_report(newsgroups_test.target, y_pred_lr,
                          target_names=newsgroups_test.target_names))

In [None]:
# Method 3: HashingVectorizer + Logistic Regression
print("=" * 60)
print("Method 3: HashingVectorizer + Logistic Regression")
print("=" * 60)

hash_vectorizer = HashingVectorizer(
    n_features=2**12,  # 4096 features
    alternate_sign=True,
    stop_words='english'
)

X_train_hash = hash_vectorizer.transform(newsgroups_train.data)
X_test_hash = hash_vectorizer.transform(newsgroups_test.data)

print(f"Hash space size: {X_train_hash.shape[1]}")
print(f"Training matrix shape: {X_train_hash.shape}")
print("No vocabulary stored (instant processing!)")

# Train classifier
lr_hash = LogisticRegression(max_iter=1000, random_state=42)
lr_hash.fit(X_train_hash, newsgroups_train.target)

# Predict
y_pred_hash = lr_hash.predict(X_test_hash)
accuracy_hash = accuracy_score(newsgroups_test.target, y_pred_hash)

print(f"\nAccuracy: {accuracy_hash:.4f}")
print("\nClassification Report:")
print(classification_report(newsgroups_test.target, y_pred_hash,
                          target_names=newsgroups_test.target_names))

In [None]:
# Compare all methods
print("\n" + "=" * 60)
print("COMPARISON SUMMARY")
print("=" * 60)

results = pd.DataFrame({
    'Method': ['CountVec + NaiveBayes', 'TF-IDF + LogReg', 'Hashing + LogReg'],
    'Accuracy': [accuracy_nb, accuracy_lr, accuracy_hash],
    'Features': [X_train_counts.shape[1], X_train_tfidf.shape[1], X_train_hash.shape[1]]
})

print(results.to_string(index=False))
print("\n💡 Insights:")
print("  - TF-IDF often performs best for text classification")
print("  - HashingVectorizer trades slight accuracy for speed/memory")
print("  - CountVectorizer + Naive Bayes is fast and interpretable")

## 5. Analyzing Important Features

Extract most important words per category:

In [None]:
# Get top features for TF-IDF model
def show_top_features(classifier, vectorizer, categories, n=10):
    """Show top N features (words) for each category"""
    feature_names = vectorizer.get_feature_names_out()
    
    for i, category in enumerate(categories):
        if hasattr(classifier, 'coef_'):
            top_indices = np.argsort(classifier.coef_[i])[-n:][::-1]
            top_features = [feature_names[idx] for idx in top_indices]
            top_scores = classifier.coef_[i][top_indices]
            
            print(f"\n{category}:")
            for feat, score in zip(top_features, top_scores):
                print(f"  {feat:20s} {score:.4f}")

print("Top 10 Features per Category (TF-IDF + LogReg):")
print("=" * 60)
show_top_features(lr_classifier, tfidf_vectorizer, newsgroups_test.target_names)

## 6. Working with Sparse Matrices

Understanding sparse matrix operations:

In [None]:
from scipy.sparse import csr_matrix, save_npz, load_npz

# Create sample sparse matrix
vec = TfidfVectorizer(max_features=1000)
X_sparse = vec.fit_transform(newsgroups_train.data[:100])

print("Sparse Matrix Operations:")
print(f"Matrix shape: {X_sparse.shape}")
print(f"Matrix type: {type(X_sparse)}")
print(f"Number of non-zero elements: {X_sparse.nnz}")
print(f"Sparsity: {(1 - X_sparse.nnz / (X_sparse.shape[0] * X_sparse.shape[1])):.2%}")

# Memory comparison
import sys
sparse_size = (X_sparse.data.nbytes + X_sparse.indices.nbytes + X_sparse.indptr.nbytes) / 1024**2
dense_size = X_sparse.toarray().nbytes / 1024**2

print(f"\nMemory Usage:")
print(f"  Sparse: {sparse_size:.2f} MB")
print(f"  Dense:  {dense_size:.2f} MB")
print(f"  Savings: {(1 - sparse_size/dense_size):.1%}")

# Common operations
print(f"\nCommon Operations:")
print(f"  Get element: X_sparse[0, 10] = {X_sparse[0, 10]:.4f}")
print(f"  Slice row: {X_sparse[0].shape}")
print(f"  Convert to dense: {X_sparse.toarray().shape}")
print(f"  Matrix multiplication: X_sparse @ X_sparse.T shape = {(X_sparse @ X_sparse.T).shape}")

## Key Takeaways

### Choosing the Right Vectorizer

| Vectorizer | Best For | Pros | Cons |
|------------|----------|------|------|
| **CountVectorizer** | Small datasets, interpretability | Simple, fast, interpretable | Doesn't weight importance |
| **TfidfVectorizer** | Most text classification tasks | Weights word importance | Slightly slower |
| **HashingVectorizer** | Large/streaming data | Memory efficient, fast | Hash collisions, not invertible |

### Best Practices

1. **Always use stop words** (`stop_words='english'`)
2. **Set min_df** to filter rare words (noise)
3. **Set max_df** to filter very common words
4. **Use max_features** to control vocabulary size
5. **Try n-grams** (bigrams can capture phrases)
6. **Keep sparse format** (don't convert to dense unless necessary)
7. **TF-IDF usually outperforms** simple counts

### Parameter Tuning Tips

```python
# Good starting point
TfidfVectorizer(
    max_features=5000,      # Reasonable vocabulary size
    min_df=5,               # Remove very rare words
    max_df=0.7,             # Remove very common words
    ngram_range=(1, 2),     # Include bigrams
    stop_words='english',   # Remove stop words
    sublinear_tf=True       # Dampen term frequency
)
```