# Lesson 38: Text preprocessing pipeline activity solution

In this activity, you will build a reusable text preprocessing pipeline and compare the effects of different preprocessing choices on text classification.

1. **Custom preprocessing function** - Create a function that applies multiple preprocessing steps
2. **Stemmer comparison** - Compare Porter, Lancaster, and Snowball stemmers
3. **Classification impact** - Measure how preprocessing choices affect Naive Bayes performance

## Notebook set-up

### Imports

In [1]:
import re

import nltk
from nltk.corpus import movie_reviews, stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nltk.download('movie_reviews', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)

True

## 1. Load data

In [2]:
# Load movie reviews
documents = [
    (' '.join(movie_reviews.words(fileid)), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)
]

texts = [doc[0] for doc in documents]
labels = [1 if doc[1] == 'pos' else 0 for doc in documents]

print(f'Total documents: {len(texts)}')
print(f'Positive reviews: {sum(labels)}')
print(f'Negative reviews: {len(labels) - sum(labels)}')

Total documents: 2000
Positive reviews: 1000
Negative reviews: 1000


## 2. Build preprocessing pipeline

### Task 1: Create a reusable preprocessing function

Complete the `preprocess_text` function below to apply the following steps:

1. **Lowercase** the text
2. **Remove punctuation** and numbers
3. **Tokenize** the text
4. **Remove stopwords**
5. **Apply stemming** using the provided stemmer

In [3]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text: str, reducer=None) -> str:
    '''Preprocess text by lowercasing, removing punctuation, tokenizing,
    removing stopwords, and optionally stemming/lemmatizing.
    '''

    # Step 1: Lowercase
    text = text.lower()

    # Step 2: Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)

    # Step 3: Tokenize
    tokens = word_tokenize(text)

    # Step 4: Remove stopwords
    tokens = [t for t in tokens if t not in stop_words]

    # Step 5: Apply stemming or lemmatization (if reducer provided)
    if reducer:
        if isinstance(reducer, WordNetLemmatizer):
            tokens = [reducer.lemmatize(t) for t in tokens]

        else:
            tokens = [reducer.stem(t) for t in tokens]

    # Join tokens back to string
    return ' '.join(tokens)

In [4]:
# Test your preprocessing function
sample_text = "This movie was AMAZING! I loved the acting (10/10) and the storyline."
porter = PorterStemmer()

result = preprocess_text(sample_text, reducer=porter)
print(f'Original: {sample_text}')
print(f'Preprocessed: {result}')

Original: This movie was AMAZING! I loved the acting (10/10) and the storyline.
Preprocessed: movi amaz love act storylin


## 3. Compare stemmers and lemmatizers

### Task 2: Compare stemmer and lemmatizer outputs

Use your preprocessing function to compare how different stemmers and lemmatizers transform the same text.

In [5]:
# Initialize stemmers and lemmatizers
reducers = {
    'Porter': PorterStemmer(),
    'Lancaster': LancasterStemmer(),
    'Snowball': SnowballStemmer('english'),
    'WordNet': WordNetLemmatizer(),
    'None': None
}

# Compare on sample text
sample = "The running runners ran quickly through the beautiful forest"

print(f'Original: {sample}\n')

for name, reducer in reducers.items():
    result = preprocess_text(sample, reducer=reducer)
    print(f'{name:12s}: {result}')

Original: The running runners ran quickly through the beautiful forest

Porter      : run runner ran quickli beauti forest
Lancaster   : run run ran quick beauty forest
Snowball    : run runner ran quick beauti forest
WordNet     : running runner ran quickly beautiful forest
None        : running runners ran quickly beautiful forest


## 4. Classification impact

### Task 3: Measure preprocessing impact on classification

Evaluate how different preprocessing choices affect Naive Bayes classification accuracy.

In [6]:
results = {}

for name, reducer in reducers.items():

    # Preprocess all texts using this stemmer/lemmatizer
    preprocessed_texts = [preprocess_text(t, reducer) for t in texts]

    # Split into train/test sets
    X_train, X_test, y_train, y_test = train_test_split(
        preprocessed_texts,
        labels,
        test_size=0.2,
        random_state=315
    )

    # Vectorize using CountVectorizer
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)

    # Train Naive Bayes classifier
    clf = MultinomialNB()
    clf.fit(X_train_vec, y_train)

    # Calculate accuracy on test set
    y_pred = clf.predict(X_test_vec)
    accuracy = accuracy_score(y_test, y_pred)

    results[name] = accuracy
    print(f'{name:12s} accuracy: {accuracy:.4f}')

Porter       accuracy: 0.7925
Lancaster    accuracy: 0.7775
Snowball     accuracy: 0.8025
WordNet      accuracy: 0.8000
None         accuracy: 0.7950


## 5. Analysis questions

**1. Which approach produced the highest accuracy? Why might this be?**

The results may vary slightly, but typically no reduction, lemmatization, or conservative stemming (Porter/Snowball) produces similar or slightly better accuracy than aggressive stemming. This is because aggressive stemming can conflate words with different meanings (e.g., "universal" and "university" both become "univers" with Lancaster), introducing noise into the features. Lemmatization often performs well because it produces valid dictionary words while still reducing vocabulary.

**2. How does stemming/lemmatization affect the vocabulary size (number of unique words)?**

Both stemming and lemmatization reduce vocabulary size by collapsing different word forms. More aggressive stemmers (Lancaster) create smaller vocabularies than conservative ones (Porter). Lemmatization typically produces a vocabulary between no reduction and Porter stemming. We can measure this:

In [7]:
# Measure vocabulary size for each configuration
for name, reducer in reducers.items():

    preprocessed_texts = [preprocess_text(t, reducer) for t in texts]
    vectorizer = CountVectorizer()
    vectorizer.fit(preprocessed_texts)
    vocab_size = len(vectorizer.vocabulary_)
    print(f'{name:12s}: {vocab_size:,} unique terms')

Porter      : 25,322 unique terms
Lancaster   : 20,665 unique terms
Snowball    : 24,838 unique terms
WordNet     : 34,402 unique terms
None        : 38,786 unique terms


**3. What are the trade-offs between stemming and lemmatization?**

| Aspect | Aggressive Stemming (Lancaster) | Conservative Stemming (Porter) | Lemmatization (WordNet) |
|--------|--------------------------------|-------------------------------|-------------------------|
| Vocabulary size | Smallest | Medium | Medium-Large |
| Output validity | Non-words | Non-words | Valid words |
| Conflation errors | More frequent | Less frequent | Least frequent |
| Computational cost | Fast | Fast | Slower (dictionary lookup) |
| Context awareness | None | None | Limited (POS tagging helps) |

**When to use each:**
- **Aggressive stemming**: When recall is important and vocabulary reduction is critical (e.g., limited memory)
- **Conservative stemming**: When you need speed and moderate vocabulary reduction
- **Lemmatization**: When interpretability matters or downstream tasks need valid words
- **No reduction**: When exact word forms carry important meaning or when using modern embeddings