# Lesson 38: Text preprocessing pipeline activity

In this activity, you will build a reusable text preprocessing pipeline and compare the effects of different preprocessing choices on text classification.

1. **Custom preprocessing function** - Create a function that applies multiple preprocessing steps
2. **Stemmer comparison** - Compare Porter, Lancaster, and Snowball stemmers
3. **Classification impact** - Measure how preprocessing choices affect Naive Bayes performance

## Notebook set-up

### Imports

In [None]:
import re

import nltk
from nltk.corpus import movie_reviews, stopwords
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

nltk.download('movie_reviews', quiet=True)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)

## 1. Load data

In [None]:
# Load movie reviews
documents = [
    (' '.join(movie_reviews.words(fileid)), category)
    for category in movie_reviews.categories()
    for fileid in movie_reviews.fileids(category)
]

texts = [doc[0] for doc in documents]
labels = [1 if doc[1] == 'pos' else 0 for doc in documents]

print(f'Total documents: {len(texts)}')
print(f'Positive reviews: {sum(labels)}')
print(f'Negative reviews: {len(labels) - sum(labels)}')

## 2. Build preprocessing pipeline

### Task 1: Create a reusable preprocessing function

Complete the `preprocess_text` function below to apply the following steps:

1. **Lowercase** the text
2. **Remove punctuation** and numbers
3. **Tokenize** the text
4. **Remove stopwords**
5. **Apply stemming** using the provided stemmer

**Hints:**
- Use `text.lower()` for lowercasing
- Use `re.sub(r'[^a-z\s]', '', text)` to keep only letters and spaces
- Use `word_tokenize()` for tokenization
- Filter tokens against `stopwords.words('english')`
- Apply `stemmer.stem(token)` to each remaining token

In [None]:
stop_words = set(stopwords.words('english'))

def preprocess_text(text: str, stemmer=None) -> str:
    '''Preprocess text by lowercasing, removing punctuation, tokenizing,
    removing stopwords, and optionally stemming.
    
    TODO: Implement the preprocessing steps described above.
    '''
    
    # Step 1: Lowercase
    # YOUR CODE HERE
    
    # Step 2: Remove punctuation and numbers
    # YOUR CODE HERE
    
    # Step 3: Tokenize
    # YOUR CODE HERE
    
    # Step 4: Remove stopwords
    # YOUR CODE HERE
    
    # Step 5: Apply stemming (if stemmer provided)
    # YOUR CODE HERE
    
    # Join tokens back to string
    return ' '.join(tokens)

In [None]:
# Test your preprocessing function
sample_text = "This movie was AMAZING! I loved the acting (10/10) and the storyline."
porter = PorterStemmer()

result = preprocess_text(sample_text, stemmer=porter)
print(f'Original: {sample_text}')
print(f'Preprocessed: {result}')

# Expected output should be something like: "movi amaz love act storylin"

## 3. Compare stemmers

### Task 2: Compare stemmer outputs

Use your preprocessing function to compare how different stemmers transform the same text. Run the cell below to see the differences.

In [None]:
# Initialize stemmers
stemmers = {
    'Porter': PorterStemmer(),
    'Lancaster': LancasterStemmer(),
    'Snowball': SnowballStemmer('english'),
    'None': None
}

# Compare on sample text
sample = "The running runners ran quickly through the beautiful forest"

print(f'Original: {sample}\n')

for name, stemmer in stemmers.items():
    result = preprocess_text(sample, stemmer=stemmer)
    print(f'{name:12s}: {result}')

## 4. Classification impact

### Task 3: Measure preprocessing impact on classification

Complete the code below to evaluate how different preprocessing choices affect Naive Bayes classification accuracy.

**Your task:**
1. Preprocess all texts using each stemmer configuration
2. Train a Naive Bayes classifier for each
3. Report accuracy for each configuration

**Hints:**
- Use a list comprehension to preprocess all texts: `[preprocess_text(t, stemmer) for t in texts]`
- Use `train_test_split` with `random_state=315`
- Use `CountVectorizer` to convert preprocessed texts to features

In [None]:
results = {}

for name, stemmer in stemmers.items():

    # TODO: Preprocess all texts using this stemmer
    preprocessed_texts = None  # YOUR CODE HERE
    
    # TODO: Split into train/test sets (use random_state=315)
    X_train, X_test, y_train, y_test = None, None, None, None  # YOUR CODE HERE
    
    # TODO: Vectorize using CountVectorizer
    vectorizer = CountVectorizer()
    # YOUR CODE HERE to fit_transform train and transform test
    
    # TODO: Train Naive Bayes classifier
    # YOUR CODE HERE
    
    # TODO: Calculate accuracy on test set
    accuracy = None  # YOUR CODE HERE
    
    results[name] = accuracy
    print(f'{name:12s} stemmer accuracy: {accuracy:.4f}')

## 5. Analysis questions

After completing the tasks above, answer these questions:

1. Which stemmer produced the highest accuracy? Why might this be?
2. How does stemming affect the vocabulary size (number of unique words)?
3. What are the trade-offs between aggressive stemming (Lancaster) and conservative stemming (Porter)?