# Day 5: Text Vectorization & Topic Modeling
**The AI Engineer Course 2026 - Sections 24 & 25**

**Student:** Natruja

**Date:** Monday, February 16, 2026

---

## Learning Objectives
1. Understand text vectorization (converting words to numbers)
2. Learn Bag of Words (BoW) and TF-IDF
3. Understand topic modeling concepts
4. Use Latent Dirichlet Allocation (LDA)
5. Apply Non-Negative Matrix Factorization (NMF)

## Setup: Install and Import Required Libraries

In [None]:
import subprocess
import sys

# Install scikit-learn
subprocess.check_call([sys.executable, "-m", "pip", "install", "scikit-learn", "nltk", "-q"])

# Download NLTK data
import nltk
nltk.download('stopwords', quiet=True)
nltk.download('punkt', quiet=True)

print("✓ Libraries installed successfully!")

In [None]:
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation, NMF
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import numpy as np
import nltk

print("✓ All imports successful!")

## Why Convert Text to Numbers?

Machine learning algorithms work with numbers, not text. We need to convert words into numerical representations.

### The Problem:
- Algorithms need vectors (lists of numbers)
- Can't process raw text directly
- Need to preserve word meaning and relationships

### The Solution:
- **Vectorization**: Convert documents into vectors
- Each number represents a word's importance
- Algorithms can now compare documents numerically

## Method 1: Bag of Words (BoW) - CountVectorizer

**Bag of Words** counts how many times each word appears in a document.

### How It Works:
1. Create vocabulary of all unique words
2. For each document, count word occurrences
3. Create vector with word counts

### Example:
- Doc1: "I love machine learning" → [0, 1, 1, 1, 1]
- Doc2: "I love Python" → [1, 1, 0, 0, 0]

### Pros & Cons:
- Pros: Simple, fast, works well
- Cons: Ignores word order, treats "the" same as "important"

## EXAMPLE: Bag of Words

In [None]:
# Sample documents
documents = [
    "machine learning is fun",
    "machine learning is powerful",
    "deep learning is amazing",
    "python is great for machine learning"
]

# Create CountVectorizer
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

print("Bag of Words Vectorization:")
print("="*60)
print(f"Vocabulary: {feature_names}")
print(f"\nWord count vectors (shape: {bow_matrix.shape}):")
print(bow_matrix.toarray())

# Display in readable format
print("\nDetailed view:")
for i, doc in enumerate(documents):
    print(f"\nDoc {i+1}: \"{doc}\"")
    print(f"  Vector: {bow_matrix[i].toarray()[0]}")

## Method 2: TF-IDF (Term Frequency-Inverse Document Frequency)

**TF-IDF** gives more weight to important, unique words and less to common words.

### How It Works:
- **Term Frequency (TF)**: How often a word appears in a document
- **Inverse Document Frequency (IDF)**: How unique is the word across all documents
- **TF-IDF**: TF × IDF (words that appear often but are rare across docs get high scores)

### Example:
- "the" appears 100 times in every document → Low IDF, low TF-IDF
- "quantum" appears 5 times in one document → High IDF, high TF-IDF

### Why Better Than BoW:
- Reduces importance of common words (the, a, is)
- Highlights distinctive words
- Better for topic detection and classification

## EXAMPLE: TF-IDF Vectorization

In [None]:
# Same documents
documents = [
    "machine learning is fun",
    "machine learning is powerful",
    "deep learning is amazing",
    "python is great for machine learning"
]

# Create TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)

# Get feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

print("TF-IDF Vectorization:")
print("="*60)
print(f"\nTF-IDF scores (shape: {tfidf_matrix.shape}):")
print(tfidf_matrix.toarray())

# Show top words per document
print("\nTop 3 important words per document:")
for i, doc in enumerate(documents):
    print(f"\nDoc {i+1}: \"{doc}\"")
    # Get indices of top 3 values
    top_indices = tfidf_matrix[i].toarray()[0].argsort()[-3:]
    for idx in sorted(top_indices, reverse=True):
        if tfidf_matrix[i].toarray()[0][idx] > 0:
            print(f"  {feature_names[idx]}: {tfidf_matrix[i].toarray()[0][idx]:.3f}")

## Topic Modeling: Finding Hidden Themes

**Topic Modeling** discovers abstract topics in a collection of documents.

### What is a Topic?
- A collection of words that frequently appear together
- Example: {"machine", "learning", "algorithm", "data"}
- Documents can have multiple topics

### Real-World Applications:
- News categorization
- Document organization
- Discovering trends
- Understanding document collections

### Two Popular Algorithms:
1. **LDA (Latent Dirichlet Allocation)**: Probabilistic approach
2. **NMF (Non-Negative Matrix Factorization)**: Matrix decomposition

## EXAMPLE: LDA Topic Modeling

In [None]:
# Sample documents
documents = [
    "machine learning is a subset of artificial intelligence",
    "deep learning uses neural networks for data analysis",
    "python is popular for machine learning and data science",
    "neural networks are inspired by biological neurons",
    "data science involves statistics and programming",
    "natural language processing helps machines understand text"
]

# Vectorize using CountVectorizer
vectorizer = CountVectorizer(max_df=0.8, min_df=1, stop_words='english')
doc_term_matrix = vectorizer.fit_transform(documents)

# Fit LDA model (2 topics)
lda = LatentDirichletAllocation(n_components=2, random_state=42, max_iter=10)
lda.fit(doc_term_matrix)

# Get feature names
feature_names = vectorizer.get_feature_names_out()

print("LDA Topic Modeling:")
print("="*60)
print(f"\nFound 2 topics from {len(documents)} documents")
print(f"\nTop words per topic:")

# Display topics
n_top_words = 5
for topic_idx, topic in enumerate(lda.components_):
    top_words_idx = topic.argsort()[-n_top_words:][::-1]
    top_words = [feature_names[i] for i in top_words_idx]
    print(f"\nTopic {topic_idx + 1}: {', '.join(top_words)}")

# Show topic distribution per document
print(f"\n\nTopic distribution per document:")
doc_topic = lda.transform(doc_term_matrix)
for i, doc in enumerate(documents):
    print(f"\nDoc {i+1}: \"{doc[:50]}...\"")
    print(f"  Topic 1: {doc_topic[i][0]:.2f}")
    print(f"  Topic 2: {doc_topic[i][1]:.2f}")

# EXERCISES: Organized by Difficulty

## ⭐ EASY: Exercise 1 - Create CountVectorizer on Simple Docs

In [None]:
# Create a simple CountVectorizer and fit on documents

docs = [
    "I like cats",
    "I like dogs",
    "cats and dogs"
]

# TODO: Create a CountVectorizer object
vec = ___

# TODO: Fit and transform the documents
matrix = ___

print(f"Success! Matrix shape: {matrix.shape}")

## ⭐ EASY: Exercise 2 - Print Vocabulary (Feature Names)

In [None]:
# Print the vocabulary (all unique words) from a vectorizer

docs = [
    "hello world",
    "world of python",
    "python hello"
]

vec = CountVectorizer()
matrix = vec.fit_transform(docs)

# TODO: Get the feature names (vocabulary) using get_feature_names_out()
words = ___

# TODO: Print the vocabulary
print("Vocabulary:")
print(___)

## ⭐ EASY: Exercise 3 - Check Matrix Shape

In [None]:
# Understand the shape of the vectorized matrix

docs = [
    "apple orange",
    "banana apple",
    "orange banana apple",
    "apple apple banana"
]

vec = CountVectorizer()
matrix = vec.fit_transform(docs)

# TODO: Get the shape of the matrix
shape = ___

print(f"Matrix shape: {shape}")
print(f"Number of documents: {shape[0]}")
print(f"Number of unique words: {shape[1]}")

# TODO: Print what the shape means
print(f"Interpretation: We have {shape[0]} documents and {shape[1]} unique words")

## ⭐ EASY: Exercise 4 - Create TfidfVectorizer

In [None]:
# Create a TfidfVectorizer and fit on documents

docs = [
    "the quick brown fox",
    "the lazy dog",
    "quick brown dog"
]

# TODO: Create a TfidfVectorizer object
tfidf_vec = ___

# TODO: Fit and transform the documents
tfidf_matrix = ___

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"\nTF-IDF matrix (converted to array):")
print(tfidf_matrix.toarray())

## ⭐ EASY: Exercise 5 - Stop Words Comparison

In [None]:
# Compare vectorization with and without stop words

docs = [
    "the machine learning is important",
    "machine learning is powerful",
    "important learning algorithms"
]

# TODO: Create CountVectorizer WITHOUT stop words
vec_no_stop = ___
matrix_no_stop = vec_no_stop.fit_transform(docs)

# TODO: Create CountVectorizer WITH English stop words
vec_with_stop = ___
matrix_with_stop = vec_with_stop.fit_transform(docs)

print(f"Without stop words: {len(vec_no_stop.get_feature_names_out())} words")
print(f"Vocabulary: {list(vec_no_stop.get_feature_names_out())}")
print(f"\nWith stop words: {len(vec_with_stop.get_feature_names_out())} words")
print(f"Vocabulary: {list(vec_with_stop.get_feature_names_out())}")

## ⭐⭐ MEDIUM: Exercise 6 - Compare BoW vs TF-IDF on Same Docs

In [None]:
# Compare Bag of Words and TF-IDF on the same documents

docs = [
    "data science data analysis",
    "machine learning models",
    "data science and machine learning"
]

# TODO: Create and fit CountVectorizer
bow_vec = ___
bow_matrix = ___

# TODO: Create and fit TfidfVectorizer
tfidf_vec = ___
tfidf_matrix = ___

print("Bag of Words (word counts):")
print(bow_matrix.toarray())

print("\nTF-IDF (weighted importance):")
print(tfidf_matrix.toarray())

print("\nNote: TF-IDF reduces weight on 'data' because it appears in multiple docs")

## ⭐⭐ MEDIUM: Exercise 7 - Use Bigrams with ngram_range

In [None]:
# Create vectors with bigrams (2-word phrases)

docs = [
    "machine learning is fun",
    "deep learning networks",
    "machine learning algorithms"
]

# TODO: Create CountVectorizer with ngram_range for bigrams (1,2)
# This should capture both individual words and 2-word phrases
vec = ___
matrix = vec.fit_transform(docs)

# TODO: Get feature names to see the bigrams
features = ___

print(f"Features (words + bigrams): {features}")
print(f"\nTotal features: {len(features)}")
print(f"\nMatrix shape: {matrix.shape}")

## ⭐⭐ MEDIUM: Exercise 8 - Limit Vocabulary with max_features

In [None]:
# Limit vocabulary size using max_features parameter

docs = [
    "python java javascript ruby programming languages",
    "data science machine learning analytics",
    "web development frontend backend",
    "artificial intelligence neural networks deep learning"
]

# TODO: Create CountVectorizer with max_features=5 (keep only top 5 words)
vec = ___
matrix = vec.fit_transform(docs)

# TODO: Get the feature names
words = ___

print(f"Vocabulary limited to {len(words)} words:")
print(words)
print(f"\nMatrix shape: {matrix.shape}")
print("Note: Only the most frequent words are kept")

## ⭐⭐ MEDIUM: Exercise 9 - Convert Sparse Matrix to Array and Inspect

In [None]:
# Convert sparse matrix to dense array and inspect individual values

docs = [
    "cat dog pet",
    "dog animal",
    "cat pet animal"
]

vec = CountVectorizer()
sparse_matrix = vec.fit_transform(docs)

# TODO: Convert sparse matrix to dense array using .toarray()
dense_array = ___

# TODO: Print the dense array
print("Dense array:")
print(___)

# TODO: Get and print a single document's vector (first document)
first_doc_vector = ___
print(f"\nFirst document vector: {first_doc_vector}")

# TODO: Get value at position [1, 2] (second doc, third word)
value = ___
print(f"\nValue at position [1, 2]: {value}")

## ⭐⭐ MEDIUM: Exercise 10 - Apply LDA with n_components=2

In [None]:
# Apply LDA topic modeling on simple documents

docs = [
    "machine learning algorithm data",
    "neural network deep learning",
    "machine learning model training",
    "deep neural network training"
]

# TODO: Create CountVectorizer and fit_transform
vec = ___
matrix = ___

# TODO: Create LatentDirichletAllocation with n_components=2
lda = ___

# TODO: Fit the LDA model
___

# TODO: Get feature names
words = ___

print("Topics found:")
for topic_idx, topic in enumerate(lda.components_):
    # Get top 3 word indices
    top_indices = topic.argsort()[-3:]
    top_words = [words[i] for i in top_indices]
    print(f"Topic {topic_idx + 1}: {', '.join(top_words)}")

## ⭐⭐⭐ HARD: Exercise 11 - Complete Vectorization Pipeline with Stop Words + Bigrams

In [None]:
# Build a complete vectorization pipeline with multiple parameters

docs = [
    "the quick brown fox jumps over the lazy dog",
    "the lazy dog sleeps all day",
    "quick brown dogs love to run",
    "the fox is quick and clever"
]

# TODO: Create TfidfVectorizer with:
# - stop_words='english'
# - ngram_range=(1, 2)
# - max_features=10
vec = ___

# TODO: Fit and transform the documents
matrix = ___

# TODO: Get feature names
features = ___

print(f"Features (unigrams + bigrams, no stop words, max 10):")
print(features)
print(f"\nMatrix shape: {matrix.shape}")
print(f"\nTF-IDF values:")
print(matrix.toarray())

## ⭐⭐⭐ HARD: Exercise 12 - Apply LDA and Print Top Words Per Topic

In [None]:
# Apply LDA and display top N words for each topic

docs = [
    "python machine learning data science algorithms",
    "deep neural networks artificial intelligence",
    "machine learning models training data",
    "neural networks deep learning frameworks",
    "data science analytics python programming",
    "artificial intelligence machine learning"
]

# TODO: Vectorize with CountVectorizer (stop_words='english')
vec = ___
matrix = ___

# TODO: Create and fit LDA with n_components=3 (3 topics)
lda = ___
___

# TODO: Get feature names
words = ___

# TODO: Print top 5 words for each topic
print("Topic Modeling Results (LDA):")
for topic_idx, topic in enumerate(lda.components_):
    # Get top 5 word indices
    top_indices = ___
    top_words = [words[i] for i in top_indices]
    print(f"\nTopic {topic_idx + 1}: {', '.join(top_words)}")

## ⭐⭐⭐ HARD: Exercise 13 - Apply NMF and Compare with LDA

In [None]:
# Apply NMF topic modeling and compare with LDA results

docs = [
    "car vehicle automobile transportation",
    "airplane flight aircraft travel",
    "bicycle bike pedal cycle",
    "train railway locomotive transport",
    "ship vessel boat navigation",
    "helicopter aircraft sky transport"
]

# TODO: Vectorize with TfidfVectorizer (stop_words='english')
vec = ___
tfidf_matrix = ___

# TODO: Create and fit NMF with n_components=2
nmf = ___
___

# TODO: Get feature names
words = ___

# TODO: Print top 4 words for each NMF topic
print("NMF Topics:")
for topic_idx, topic in enumerate(nmf.components_):
    top_indices = ___
    top_words = [words[i] for i in top_indices]
    print(f"\nTopic {topic_idx + 1}: {', '.join(top_words)}")

## ⭐⭐⭐ HARD: Exercise 14 - Process Document Collection and Find Topics

In [None]:
# Process a complete document collection with vectorization and topic modeling

docs = [
    "artificial intelligence machine learning neural networks",
    "python programming language data science",
    "deep learning convolutional neural networks",
    "machine learning algorithms classification regression",
    "python libraries numpy pandas scikit learn",
    "natural language processing text mining",
    "data analysis statistics visualization",
    "computer vision image recognition"
]

# TODO: Vectorize with CountVectorizer
# - stop_words='english'
# - max_df=0.7 (ignore words in more than 70% of docs)
# - min_df=1 (keep words in at least 1 doc)
vec = ___
matrix = ___

# TODO: Create and fit LDA with n_components=3
lda = ___
___

# TODO: Get document-topic distribution
doc_topics = ___

# TODO: Get feature names
words = ___

print("=" * 60)
print("DOCUMENT COLLECTION ANALYSIS")
print("=" * 60)

# Print topics
print("\nTopics discovered:")
for topic_idx, topic in enumerate(lda.components_):
    top_indices = topic.argsort()[-5:]
    top_words = [words[i] for i in top_indices]
    print(f"\nTopic {topic_idx + 1}: {', '.join(top_words)}")

# Print document topic assignments
print(f"\n\nDocument-Topic Distribution:")
for i, doc in enumerate(docs):
    dominant_topic = doc_topics[i].argmax()
    print(f"\nDoc {i+1}: \"{doc[:40]}...\"")
    print(f"  Dominant Topic: {dominant_topic + 1}")

## ⭐⭐⭐ HARD: Exercise 15 - Build Topic Analysis Function

In [None]:
# Build a reusable function that performs complete topic analysis

# TODO: Create a function that takes documents, n_topics, and returns summary
def analyze_topics(documents, n_topics=2, n_top_words=5):
    """
    Analyze topics in a document collection.
    
    Args:
        documents: list of text documents
        n_topics: number of topics to extract
        n_top_words: number of top words to show per topic
    
    Returns:
        dict with 'topics', 'doc_topics', and 'vocabulary'
    """
    # TODO: Vectorize documents with CountVectorizer
    # Use: stop_words='english', max_df=0.8, min_df=1
    vec = ___
    matrix = ___
    
    # TODO: Create and fit LDA model
    lda = ___
    ___
    
    # TODO: Extract topics
    words = ___
    topics = []
    
    for topic_idx, topic in enumerate(lda.components_):
        # Get top word indices
        top_indices = ___
        # Convert to words
        top_words = ___
        topics.append(top_words)
    
    # TODO: Get document-topic distribution
    doc_topics = ___
    
    return {
        'topics': topics,
        'doc_topics': doc_topics,
        'vocabulary': words
    }

# Test the function
test_docs = [
    "apple orange banana fruit",
    "cat dog animal pet",
    "apple fruit food healthy",
    "dog pet companion animal"
]

result = analyze_topics(test_docs, n_topics=2)

print("TOPIC ANALYSIS RESULTS")
print("=" * 60)
print("\nTopics:")
for i, topic_words in enumerate(result['topics']):
    print(f"Topic {i+1}: {', '.join(topic_words)}")

print(f"\nVocabulary size: {len(result['vocabulary'])} words")

## Summary

### Key Takeaways:
- **Vectorization** converts text to numbers for algorithms
- **Bag of Words** counts word occurrences
- **TF-IDF** weights words by importance
- **Topic Modeling** discovers hidden themes
- **LDA** and **NMF** are popular topic modeling algorithms

### When to Use:
- **BoW**: Fast, simple classification
- **TF-IDF**: Better for text similarity and classification
- **LDA**: Probabilistic topics, interpretability
- **NMF**: Matrix decomposition, semantic topics

### What's Next:
Tomorrow we'll build a **Text Classifier** using these techniques to categorize documents!

---

*Created for Natruja's NLP study plan*