# Stopwords and Lemmatization
## Objective

Normalize tokenized text by:

- Removing non-informative words

- Reducing tokens to their canonical forms

- The goal is signal preservation, not aggressive compression.

## Why This Step Matters

Without normalization:

- Vocabulary size explodes

- Models overfit on morphological noise

- Feature importance becomes fragmented

- Interpretability degrades

With over-normalization:

- Semantic meaning is lost

- Sentiment and intent can be distorted

This notebook demonstrates controlled normalization.

## Imports and Setup

In [2]:
import re
import pandas as pd
from typing import List

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("omw-1.4")


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pantu\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pantu\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pantu\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\pantu\AppData\Roaming\nltk_data...


True

## Example Tokens (From Previous Notebook)

We assume tokens already exist from 01_basic_cleaning_and_tokenization.ipynb.

In [10]:
data = {
    "tokens": [
        ["this", "is", "amazing", "visit", "now"],
        ["nlp", "is", "hard", "or", "is", "it"],
        ["tokenization", "errors", "silent", "model", "failures"],
        ["clean", "text", "better", "models"]
    ]
}

df = pd.DataFrame(data)
df

Unnamed: 0,tokens
0,"[this, is, amazing, visit, now]"
1,"[nlp, is, hard, or, is, it]"
2,"[tokenization, errors, silent, model, failures]"
3,"[clean, text, better, models]"


## Stopword Removal
What Are Stopwords?

Stopwords are high-frequency terms that usually carry:

- Little semantic value

- Low discriminative power

Examples: `is, the, and, or`

### Load Stopword List

In [13]:
stop_words = set(stopwords.words("english"))

### Remove Stopwords

In [16]:
def remove_stopwords(tokens: List[str]) -> List[str]:
    return [token for token in tokens if token not in stop_words]

df["tokens_no_stopwords"] = df["tokens"].apply(remove_stopwords)
df[["tokens", "tokens_no_stopwords"]]


Unnamed: 0,tokens,tokens_no_stopwords
0,"[this, is, amazing, visit, now]","[amazing, visit]"
1,"[nlp, is, hard, or, is, it]","[nlp, hard]"
2,"[tokenization, errors, silent, model, failures]","[tokenization, errors, silent, model, failures]"
3,"[clean, text, better, models]","[clean, text, better, models]"


# Design Warning

Do NOT blindly remove stopwords when:

- Sentiment matters (not, never)

- Question detection is important

- Legal or medical language is involved

- Custom stopword lists are often superior.

# Stemming (Baseline Approach)
What Is Stemming?

- Rule-based suffix stripping

- Fast but linguistically crude

- Can produce non-words

In [19]:
stemmer = PorterStemmer()

def stem_tokens(tokens: List[str]) -> List[str]:
    return [stemmer.stem(token) for token in tokens]

df["tokens_stemmed"] = df["tokens_no_stopwords"].apply(stem_tokens)
df[["tokens_no_stopwords", "tokens_stemmed"]]


Unnamed: 0,tokens_no_stopwords,tokens_stemmed
0,"[amazing, visit]","[amaz, visit]"
1,"[nlp, hard]","[nlp, hard]"
2,"[tokenization, errors, silent, model, failures]","[token, error, silent, model, failur]"
3,"[clean, text, better, models]","[clean, text, better, model]"


## When to Use Stemming

- ✅ Fast baselines
- ✅ Search / retrieval
- ❌ Interpretability-critical models

# Lemmatization (Preferred)
What Is Lemmatization?

- Dictionary-based normalization

- Preserves real words

- Requires part-of-speech context (simplified here)

In [22]:
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens: List[str]) -> List[str]:
    return [lemmatizer.lemmatize(token) for token in tokens]

df["tokens_lemmatized"] = df["tokens_no_stopwords"].apply(lemmatize_tokens)
df[["tokens_no_stopwords", "tokens_lemmatized"]]


Unnamed: 0,tokens_no_stopwords,tokens_lemmatized
0,"[amazing, visit]","[amazing, visit]"
1,"[nlp, hard]","[nlp, hard]"
2,"[tokenization, errors, silent, model, failures]","[tokenization, error, silent, model, failure]"
3,"[clean, text, better, models]","[clean, text, better, model]"


# Stemming vs Lemmatization

| Aspect           | Stemming             | Lemmatization |
| ---------------- | -------------------- | ------------- |
| Speed            | Very fast            | Slower        |
| Output           | May be invalid words | Valid words   |
| Interpretability | Low                  | High          |
| Production NLP   | Rare                 | Preferred     |



# Handling Rare or Noisy Tokens

Rare tokens often represent:

- Typos

- OCR errors

- One-off identifiers

### Simple Length Filter

In [25]:
def remove_short_tokens(tokens: List[str], min_len: int = 3) -> List[str]:
    return [t for t in tokens if len(t) >= min_len]

df["tokens_filtered"] = df["tokens_lemmatized"].apply(remove_short_tokens)
df[["tokens_lemmatized", "tokens_filtered"]]


Unnamed: 0,tokens_lemmatized,tokens_filtered
0,"[amazing, visit]","[amazing, visit]"
1,"[nlp, hard]","[nlp, hard]"
2,"[tokenization, error, silent, model, failure]","[tokenization, error, silent, model, failure]"
3,"[clean, text, better, model]","[clean, text, better, model]"


# Pipeline-Safe Normalization Function

All logic must be:

- Deterministic

- Stateless

- Reusable

In [28]:
def normalize_tokens(tokens: List[str]) -> List[str]:
    tokens = [t for t in tokens if t not in stop_words]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if len(t) >= 3]
    return tokens

df["tokens_normalized"] = df["tokens"].apply(normalize_tokens)
df[["tokens", "tokens_normalized"]]


Unnamed: 0,tokens,tokens_normalized
0,"[this, is, amazing, visit, now]","[amazing, visit]"
1,"[nlp, is, hard, or, is, it]","[nlp, hard]"
2,"[tokenization, errors, silent, model, failures]","[tokenization, error, silent, model, failure]"
3,"[clean, text, better, models]","[clean, text, better, model]"


# Common Normalization Mistakes

- ❌ Removing negations (not, no)
- ❌ Mixing stemming and lemmatization
- ❌ Applying different stopword sets across splits
- ❌ Normalizing after vectorization

# Key Takeaways

- Stopwords reduce noise but can remove meaning

- Lemmatization is safer than stemming

- Normalization choices affect interpretability

- Always encapsulate logic in reusable functions

# Next Notebook