# 🔧 Text Preprocessing — Stemming & Its Types

**Stemming** is a **rule-based text normalization process** that reduces words to their **base stem** by chopping off **suffixes** or **prefixes** using a series of predefined rules.  
It is a **fast**, **statistical**, and **dictionary-free** approach that approximates a word’s **root form (lemma)**.

---

📘 **Important Distinction**

> While **lemmatization** uses vocabulary and morphological analysis to find valid words (lemmas),  
> **stemming** merely truncates tokens based on heuristic rules — meaning the result *may not* be a valid dictionary word.

---

### 🧩 Example

| Original Word | Stem Output | Valid Word? |
|:--------------|:-------------|:-------------|
| connections | connect | ✅ |
| studies | studi | ❌ |
| organizing | organiz | ✅ (approximation) |

---

### 🧮 Formal Definition

Let $w$ be a token (word). The stemming process applies a function:
$$
\text{stem}(w) \rightarrow s
$$
where $s$ is the derived stem that may or may not correspond to a valid lemma.

Given a corpus $C = \{w_1, w_2, \dots, w_n\}$,
$$
S = \{\text{stem}(w_1), \text{stem}(w_2), \dots, \text{stem}(w_n)\}
$$

---

### 🔹 Why We Use Stemming

✅ Reduces the **vocabulary size** → smaller, faster models  
✅ Groups **morphological variants** → improves recall in text search  
⚠️ May **over-stem** (merge unrelated words) or **under-stem** (fail to merge related words)

---

### 🧠 Stemming vs Lemmatization

| Feature | Stemming | Lemmatization |
|:--------|:----------|:---------------|
| Output | Truncated stem | Valid dictionary word |
| Logic | Rule-based suffix chopping | Morphology + POS + Dictionary |
| Accuracy | Approximate | Precise |
| Speed | ⚡ Fast | 🐢 Slower |
| Example | *studies → studi* | *studies → study* |


In [41]:
# A diverse test vocabulary containing verbs, nouns, adjectives, adverbs, and edge cases.
# Each "family" groups related morphological variants to reveal how stemmers conflate them.
tokens = [
    # ----- connect-family (regular verb/noun derivations)
    "connect", "connected", "connection", "connections", "connecting",
    # ----- study-family (y→i + inflections)
    "study", "studies", "studying", "studied",
    # ----- happy-family (adj/adv/negation)
    "happy", "happiness", "unhappily",
    # ----- organize-family (ize/ization derivations)
    "organization", "organize", "organized", "organizing",
    # ----- generalize-family (ize/ization derivations)
    "generalize", "generalized", "generalization",
    # ----- go-family (irregular verb forms)
    "go", "going", "goes", "gone",
    # ----- comparatives/superlatives (irregular scale)
    "better", "best",
    # ----- relate-family (derivational adjective)
    "relational", "relate", "related", "relating",
    # ----- pluralization and irregular plurals
    "cats", "boxes", "mice",
    # ----- practical-family (derivational/orthographic similarity)
    "practically", "practical", "practicable",
    # write-family: test progressive (-ing) and 3rd-person singular (-s)
    "writing", "writes",
    # program-family: noun vs. verb derivations (+plural -s)
    "programming", "programs",
    # history-family: bare noun (tests that some words have no useful conflation)
    "history",
    # 🏁 final-family: adverb (-ly) vs. past participle (-ed)
    "finally", "finalized",
]

# Display metadata about our sample
print("🧾 Corpus Preview — Tokens for Stemming Exploration")
print(f"Total tokens: {len(tokens)}\n")
print(tokens)

# 💡 Design notes:
# - Regular and irregular inflections (connect, study, go)
# - Adjective/adverb forms and negation (happy → unhappily)
# - Derivational morphology (organization, generalization)
# - Comparatives/superlatives (better, best) to see if stemmers over-conflate
# - Plural forms (cats, boxes) and irregular plurals (mice)
# - Orthographically similar but semantically distinct (practical vs practicable)
# - Newly added: write/program/final/history families to probe -ing, -s, -ly, -ed endings
#   and noun/verb derivations common in software/text corpora.


🧾 Corpus Preview — Tokens for Stemming Exploration
Total tokens: 42

['connect', 'connected', 'connection', 'connections', 'connecting', 'study', 'studies', 'studying', 'studied', 'happy', 'happiness', 'unhappily', 'organization', 'organize', 'organized', 'organizing', 'generalize', 'generalized', 'generalization', 'go', 'going', 'goes', 'gone', 'better', 'best', 'relational', 'relate', 'related', 'relating', 'cats', 'boxes', 'mice', 'practically', 'practical', 'practicable', 'writing', 'writes', 'programming', 'programs', 'history', 'finally', 'finalized']


## 🧰 Types of Stemmers in NLP

There are several stemmers available in NLTK, each with its own level of aggressiveness and rule set.

| Stemmer | Description | Behavior |
|:--------|:-------------|:------------|
| **PorterStemmer** | 🧩 The most classic rule-based stemmer; moderate and reliable | Conservative |
| **SnowballStemmer (Porter2)** | ❄️ Enhanced Porter version; supports multiple languages | Balanced |
| **LancasterStemmer** | ⚡ Extremely aggressive; very short stems | Over-stems often |
| **RegexpStemmer** | 🔍 Custom regex-based stemmer; great for domain control | Fully customizable |

---

Each stemmer applies pattern-based rules:
$$
\text{stemmer}(w) = w - \text{(suffixes according to pattern rules)}
$$

Let's compare how these differ in practice 👇


In [42]:
# 📦 Import stemmers from NLTK and define a sample corpus of tokens
# We'll test how different stemmers handle various morphological forms.

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer

In [43]:
# ✅ 1. Porter Stemmer — the most commonly used baseline
#    Developed by Martin Porter in 1980, uses a fixed set of rules (about 60)
#    Typically produces readable stems and avoids excessive truncation.

porter = PorterStemmer()

print("🔹 Porter Stemmer Results:\n")
for w in tokens:
    stem = porter.stem(w)
    print(f"{w:>15}  →  {stem}")

# 🧠 Notes:
# - "connections" → "connect"
# - "studies" → "studi"
# - "organizing" → "organ"
# Porter tries to keep stems interpretable, but not necessarily valid words.


🔹 Porter Stemmer Results:

        connect  →  connect
      connected  →  connect
     connection  →  connect
    connections  →  connect
     connecting  →  connect
          study  →  studi
        studies  →  studi
       studying  →  studi
        studied  →  studi
          happy  →  happi
      happiness  →  happi
      unhappily  →  unhappili
   organization  →  organ
       organize  →  organ
      organized  →  organ
     organizing  →  organ
     generalize  →  gener
    generalized  →  gener
 generalization  →  gener
             go  →  go
          going  →  go
           goes  →  goe
           gone  →  gone
         better  →  better
           best  →  best
     relational  →  relat
         relate  →  relat
        related  →  relat
       relating  →  relat
           cats  →  cat
          boxes  →  box
           mice  →  mice
    practically  →  practic
      practical  →  practic
    practicable  →  practic
        writing  →  write
         writes  →  write
    p

## 🧩 Why Porter Stemmer Produces “studi” and “organ”

The **Porter Stemmer** works through a series of **rule-based substitution phases**, where each step applies a pattern like:

$$
\text{(suffix)} \rightarrow \text{(replacement)}
$$

These rules are *ordered and conditional*, meaning:
- The stemmer checks for certain suffix patterns in a fixed sequence.
- Once a rule fires, it may not revisit the word with later patterns.

Let’s analyze the examples 👇

---

### 🔹 `"connections" → "connect"`
**Rule triggered:**
- Remove plural **-s** or **-es** → “connection”
- Remove nominal suffix **-ion** if preceded by “ct” → “connect”

✅ This is a *perfect case*: Porter correctly identifies the root form “connect”.

---

### 🔹 `"studies" → "studi"`
**Rule triggered:**
- Step 1: `ies` → `i` (Porter replaces “ies” with “i” to generalize plural forms)
  
$$
\text{studies} \rightarrow \text{studi}
$$

🔍 Porter assumes that words ending in “ies” are plural or 3rd person forms of verbs ending with “y”.
But it does **not** convert “i” back to “y”, since it’s unaware of morphological semantics.

⚠️ **Result:** “studi” is **not** a valid word (should have been “study”).

---

### 🔹 `"organizing" → "organ"`
**Rule triggered:**
- Step 1: Remove **-ing** → “organiz”
- Step 2: If the remaining word ends in **-iz**, and a rule matches **-ize → e**, Porter sometimes truncates the suffix inconsistently.
- Since the rule chain doesn’t always reconstruct “organize”, it stops at “organ”.

⚠️ This happens because Porter doesn’t look at full morphology — it only applies simple text-based rules, not grammar.

---

## ⚠️ Drawbacks of Porter Stemmer

| Limitation | Description | Example |
|:------------|:-------------|:----------|
| **1️⃣ Over-stemming** | Different words reduced to same root (loss of meaning) | “organization” & “organism” → “organ” |
| **2️⃣ Under-stemming** | Related words fail to merge | “unhappy” & “happiness” remain separate |
| **3️⃣ Non-dictionary roots** | Outputs stems like “studi”, “happi”, “gener” | “studies” → “studi” |
| **4️⃣ No POS or context awareness** | Doesn’t know if word is noun/verb/adjective | “better” → “bett” |
| **5️⃣ Fixed English-only rules** | Limited cross-linguistic support | Works poorly on non-English corpora |

---

📌 **In short:**
- Porter is **fast**, **deterministic**, and great for **IR tasks** (like search engines).
- But it’s **linguistically naive** — it just “chops”, it doesn’t “understand”.

---

### ✅ When Porter Stemmer is Still Useful
- For **Information Retrieval (IR)** — when perfect word forms aren’t necessary.  
- For **keyword-based search systems** (search “connect” finds “connections”).  
- When you need lightweight preprocessing in large text pipelines.

---

📚 **When NOT to use Porter:**
- When your downstream model depends on *precise grammatical or lexical meaning*  
  (like text generation, translation, or semantic similarity tasks).  
In those cases, prefer **Lemmatization**.

---

💡 **Summary Insight**

$$
\text{PorterStemmer} \approx \text{HeuristicSuffixRemover}
$$

✅ Efficient for search  
❌ Not semantically accurate


## 🧩 RegexpStemmer — Custom, Rule-Driven Stemming

**`RegexpStemmer`** (short for *Regular Expression Stemmer*) is a **lightweight and customizable** stemmer provided by NLTK.  
Unlike other stemmers (Porter, Snowball, Lancaster), which use **predefined linguistic rules**,  
this stemmer allows you to define **your own pattern-based stripping rules** using **regular expressions (regex)**.

---

### 🔹 Working Principle

The stemmer applies a **regex substitution** to remove specific suffixes or endings.

$$
\text{stem}(w) = w - \text{(regex\_matched\_suffix)}
$$

In other words:
1. The stemmer looks for patterns that match your regex rule.  
2. If the pattern appears **at the end** of the word, it is replaced with an empty string (i.e., removed).  
3. The operation is purely **text-based** — no morphological knowledge, no POS awareness.

---

### 📘 Parameters

| Parameter | Type | Description |
|:--|:--|:--|
| **pattern** | `str` (regex) | A pattern describing suffixes to remove (e.g., `(ing|ly|ed|s)$`) |
| **min** | `int` | Minimum length of the remaining word (prevents over-stripping) |
| **ignore_case** | `bool` | Whether to match case-insensitive suffixes |
| **repl** | `str` | Replacement string (default: empty) |

---

### 🧪 Example

```python
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer(regexp=r'(ing|ly|ed|ious|ies|ive|es|s|ment)$', min=3)

print(regexp.stem("studying"))      # → study
print(regexp.stem("boxes"))         # → box
print(regexp.stem("connections"))   # → connection
print(regexp.stem("happily"))       # → happi
```
---

### 🧩 Explanation
- The regex removes common English endings (like -ing, -ed, -s).
- `min=3` ensures that very short words (like “is”, “as”) are not truncated.
- The `$` ensures suffixes are matched only at the end of the token.
---

### ✅ Advantages

| Benefit             | Description                                                     |
| :------------------ | :-------------------------------------------------------------- |
| **Customizable**    | You control exactly which suffixes are removed                  |
| **Lightweight**     | No heavy rule set or model loading                              |
| **Fast**            | Regex-based → extremely quick on large corpora                  |
| **Domain-friendly** | Ideal for domain-specific text (medical, legal, software, etc.) |

---

### ⚠️ Limitations

| Drawback                  | Description                                           |
| :------------------------ | :---------------------------------------------------- |
| ❌ No linguistic knowledge | Doesn’t know valid lemmas or word families            |
| ❌ Can under- or over-stem | Regex too broad or too narrow causes errors           |
| ❌ English-only by design  | You’d need new regex rules per language               |
| ❌ Not context aware       | “goes” → “go” works, but “was” or “went” won’t change |

---

### 📌 Best Use Cases
- You want full control over stemming rules.
- You work in restricted domains (e.g., biomedical, legal, or software corpora).
- You need a fast and transparent way to reduce word variants without external dependencies.

---

### 🧮 Mathematical Recap

$$
\text{RegexpStemmer}(w) =
\begin{cases}
w - \text{regex\_suffix}, & \text{if pattern matches the end of } w \\
w, & \text{otherwise}
\end{cases}
$$

--- 

### 💡 Summary

| Feature      | Description                                            |
| :----------- | :----------------------------------------------------- |
| **Approach** | Regex-based suffix removal                             |
| **Accuracy** | Depends entirely on your regex quality                 |
| **Speed**    | ⚡ Very fast                                            |
| **Output**   | Non-linguistic stems                                   |
| **Best for** | Controlled pipelines and domain-specific normalization |



In [44]:
# 🧪 Regexp Stemmer — create domain-specific rules
# Here, we remove a few common English suffixes when they appear at the end of a word.

# Pattern: remove -ing, -ly, -ed, -ious, -ies, -ive, -es, -s, -ment
regexp = RegexpStemmer(
    regexp=r'(ing|ly|ed|ious|ies|ive|es|s|ment)$',
    min=3
)

print("\n🔹 Regexp Stemmer Results (Custom Suffix Chopping):\n")
for w in tokens:
    print(f"{w:>15}  →  {regexp.stem(w)}")

# 📝 Notes:
# - `min=3` prevents stripping when the remainder would be < 3 chars
# - Tune the regex to your domain (biomedical, legal, etc.)



🔹 Regexp Stemmer Results (Custom Suffix Chopping):

        connect  →  connect
      connected  →  connect
     connection  →  connection
    connections  →  connection
     connecting  →  connect
          study  →  study
        studies  →  stud
       studying  →  study
        studied  →  studi
          happy  →  happy
      happiness  →  happines
      unhappily  →  unhappi
   organization  →  organization
       organize  →  organize
      organized  →  organiz
     organizing  →  organiz
     generalize  →  generalize
    generalized  →  generaliz
 generalization  →  generalization
             go  →  go
          going  →  go
           goes  →  go
           gone  →  gone
         better  →  better
           best  →  best
     relational  →  relational
         relate  →  relate
        related  →  relat
       relating  →  relat
           cats  →  cat
          boxes  →  box
           mice  →  mice
    practically  →  practical
      practical  →  practical
    practica

## ⚠️ Why Some Outputs Are Incorrect in RegexpStemmer

The **RegexpStemmer** is purely **pattern-based**, meaning it does not understand language morphology.  
It simply strips the suffixes that match your custom regular expression — even when doing so **breaks the word form**.

---

### 🔍 Example Analysis (Based on Output)

| Word | Output | Expected | What Happened |
|:--|:--|:--|:--|
| `studies` | `stud` | `study` | Regex removed `ies`, leaving `stud`; it doesn’t replace `ies → y` like linguistic stemmers. |
| `studied` | `studi` | `study` | Removed `ed` → produced non-word root. |
| `happiness` | `happines` | `happy` | Removed only final `s`, not `ness`; pattern didn’t match `ness`. |
| `unhappily` | `unhappi` | `unhappy` | Removed `ly`, but didn’t restore `y`. |
| `organized` | `organiz` | `organize` | Removed `ed`, unaware that base form has `e`. |
| `programming` | `programm` | `program` | Removed `ing`, but double `m` rule (from `mming` → `m`) isn’t handled. |
| `writing` | `writ` | `write` | Removed `ing`, unaware that root should regain the `e`. |
| `finalized` | `finaliz` | `finalize` | Removed `ed`, unaware of morphological “restore e” rule. |

---

### 🧩 Why This Happens

Your regex:

```python
regexp = RegexpStemmer(
    regexp=r'(ing|ly|ed|ious|ies|ive|es|s|ment)$',
    min=3
)
```

... blindly removes suffixes, so it:
Removes endings like `-ed`, `-s`, `-ing`, etc.
Does not know when to add back missing letters (like `y` or `e`).
Does not skip partial matches (e.g., removes `s` in `happiness`).
This makes it fast ⚡ but linguistically dumb 🤖.

### 🧮 Mathematically

$$
\text{RegexpStemmer}(w) =
\begin{cases}
w - \text{regex\_suffix}, & \text{if the suffix matches the regex pattern at the end of } w \\
w, & \text{otherwise}
\end{cases}
$$

---

Unlike **Porter** or **Snowball**, it lacks post-processing rules such as:

$$
\text{if } w' \text{ ends with } "i" \Rightarrow \text{replace with } "y"
$$


## 🧊 Snowball Stemmer (aka Porter2)

The **Snowball Stemmer** — also known as **Porter2 Stemmer** — is an improved, modernized version of the classic **Porter Stemmer**.  
It was introduced by **Martin Porter** himself as part of the **Snowball framework** for multilingual stemming.

---

### 🔹 Overview

Unlike the original Porter Stemmer, which was written in English-only rules,  
the Snowball version generalizes the same concept for **multiple languages** using a cleaner, more maintainable rule definition syntax.

It uses a **larger rule set**, **stricter conditions**, and **refined suffix handling**, making it:
- More **consistent** across morphological cases  
- Slightly **less aggressive** than Lancaster, but more **accurate** than Porter  
- **Multilingual**, supporting languages such as `english`, `french`, `spanish`, `german`, `italian`, etc.

---

### 🧠 Working Logic

1️⃣ **Input**: token $w$  
2️⃣ **Identify** suffixes and endings based on the target language  
3️⃣ **Apply** language-specific morphological rules  
4️⃣ **Return** truncated stem $s$

Mathematically:

$$
\text{SnowballStemmer}(w) =
\begin{cases}
w - \text{language\_specific\_suffix}, & \text{if rule applies for language} \\
w, & \text{otherwise}
\end{cases}
$$

---

### 🧮 Example (English)

```python
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')

for w in ["connected", "connections", "studies", "studying", "happiness", "generalization"]:
    print(f"{w:>15}  →  {snowball.stem(w)}")
```

---
### 🧩 Sample Output

| Word           | Snowball Stem |
| :------------- | :------------ |
| connected      | connect       |
| connections    | connect       |
| studies        | studi         |
| studying       | studi         |
| happiness      | happi         |
| generalization | general       |

---

### ⚙️ Features & Advantages

| Feature                          | Description                                     |
| :------------------------------- | :---------------------------------------------- |
| 🌍 **Multilingual support**      | Works for multiple languages                    |
| ⚖️ **Balanced approach**         | Avoids over-stemming seen in Lancaster          |
| 📏 **Improved rule definitions** | Simpler and more uniform rule syntax            |
| ⚡ **Fast and lightweight**       | Similar performance to Porter                   |
| 🧩 **Stable results**            | Produces consistent stems across similar tokens |

---

### ⚠️ Limitations

| Limitation                        | Description                                                    |
| :-------------------------------- | :------------------------------------------------------------- |
| ❌ Still heuristic-based           | No understanding of real word meaning                          |
| ❌ Not lemmatization               | “studies” → “studi” (non-word)                                 |
| ❌ English bias                    | Best for Indo-European languages                               |
| ⚠️ Slight differences from Porter | Can produce smaller or larger stems depending on rule ordering |

---

### 📘 When to Use Snowball Stemmer
- ✅ Choose Snowball Stemmer when:
    - You need a fast and accurate rule-based stemmer for English or European languages
    - You want consistency and clarity in stemming behavior
    - You’re preprocessing for Information Retrieval, Topic Modeling, or Search Indexing
- ❌ Avoid when:
    - You need linguistically correct root forms (→ use WordNet Lemmatizer)
    - You’re working on languages unsupported by Snowball

---

### 💡 Summary

| Aspect           | Porter               | Snowball (Porter2)     |
| :--------------- | :------------------- | :--------------------- |
| Rules            | ~60 hardcoded        | ~85 structured         |
| Accuracy         | Moderate             | High                   |
| Language support | English only         | Multi-language         |
| Aggressiveness   | Moderate             | Controlled             |
| Output example   | “studying” → “studi” | “studying” → “studi”   |
| Use case         | IR tasks             | IR + NLP preprocessing |

---

### 🧮 Mathematical Insight

$$
\text{SnowballStemmer}(w) =
\text{normalize}\Bigg(
w - 
\sum_{i=1}^{n}
\text{suffix}_i \cdot 
\mathbb{1}_{\text{rule}_i(w)}
\Bigg)
$$

where:

- $\mathbb{1}_{\text{rule}_i(w)}$ is an **indicator function**:
  $$
  \mathbb{1}_{\text{rule}_i(w)} =
  \begin{cases}
  1, & \text{if linguistic rule } i \text{ applies to } w \\
  0, & \text{otherwise}
  \end{cases}
  $$

- $\text{suffix}_i$ represents each possible removable suffix.  
- $\text{normalize}(\cdot)$ denotes **post-processing** (like removing double letters or trailing vowels).  

---

📘 **Intuition**

The Snowball Stemmer applies a series of $n$ language-specific rules.  
Each rule checks if a word $w$ matches a condition (via $\text{rule}_i$).  
If true ($\mathbb{1}=1$), it removes the corresponding $\text{suffix}_i$,  
and then the result is normalized to ensure consistency across derived stems.




In [45]:
# ✅ 2. Snowball Stemmer — also called Porter2
#    Developed as an improved version of Porter with:
#    - Better consistency
#    - Multi-language support
#    - More transparent rules
             
snowball = SnowballStemmer(language="english")

print("\n🔹 Snowball Stemmer (Porter2) Results:\n")
for w in tokens:
    stem = snowball.stem(w)
    print(f"{w:>15}  →  {stem}")

# 🧠 Notes:
# - Handles 'happiness' → 'happi' like Porter
# - More consistent rule application
# - Supports many languages: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian,
#   Portuguese, Romanian, Russian, Spanish and Swedish.



🔹 Snowball Stemmer (Porter2) Results:

        connect  →  connect
      connected  →  connect
     connection  →  connect
    connections  →  connect
     connecting  →  connect
          study  →  studi
        studies  →  studi
       studying  →  studi
        studied  →  studi
          happy  →  happi
      happiness  →  happi
      unhappily  →  unhappili
   organization  →  organ
       organize  →  organ
      organized  →  organ
     organizing  →  organ
     generalize  →  general
    generalized  →  general
 generalization  →  general
             go  →  go
          going  →  go
           goes  →  goe
           gone  →  gone
         better  →  better
           best  →  best
     relational  →  relat
         relate  →  relat
        related  →  relat
       relating  →  relat
           cats  →  cat
          boxes  →  box
           mice  →  mice
    practically  →  practic
      practical  →  practic
    practicable  →  practic
        writing  →  write
         wri

## ⚠️ Why Some Snowball Stemmer Outputs Look Incorrect

Although **Snowball Stemmer (Porter2)** is an improvement over Porter,  
it’s still a **rule-based stemmer** — not a lemmatizer.  
It applies **heuristic truncation rules**, which means it:
- Strips **common suffixes** like *-ed*, *-ing*, *-ly*, *-ation*, *-ize*, etc.
- Does **not** restore missing characters (like `y`, `e`)
- Treats **word derivation families** as morphologically equivalent

---

### 🔍 Example Analysis

| Word | Output | Expected Lemma | Explanation |
|:--|:--|:--|:--|
| `studies` | `studi` | `study` | Replaces *-ies* → *-i*, doesn’t restore *y* |
| `studied` | `studi` | `study` | Removes *-ed*, no post-fix rule to add *y* |
| `happiness` | `happi` | `happy` | Removes *-ness*, doesn’t change *i* → *y* |
| `unhappily` | `unhappili` | `unhappy` | Removes *-ly*, but also drops *y → i* from earlier suffix logic |
| `organization` | `organ` | `organize` | Removes derivational suffix *-ization* → *ize*, then normalizes to base “organ” |
| `organized` | `organ` | `organize` | Removes *-ed*, then *-ize*, collapsing both |
| `generalization` | `general` | `generalize` | Removes *-ization*, truncating the base |
| `goes` | `goe` | `go` | Strips *-es* but lacks *restore e → o* fixup |
| `history` | `histori` | `history` | Treats *-y* as possible derivational ending and removes *y → i* |
| `practically` | `practic` | `practical` | Removes *-ally* (double rule: *al + ly*) |
| `writing` | `write` | `write` | ✅ Correct — known morphological case handled well |
| `programming` | `program` | `program` | ✅ Correct — doubled consonant handled |
| `finalized` | `final` | `finalize` | Removes *-ized*, returns the root *final* |
| `finally` | `final` | `final` | ✅ Expected stem; adverb stripped correctly |

---

### 🧩 Why These Happen — Mechanism of Snowball (Porter2)

The **Porter2 algorithm** operates in **five sequential phases**  
where each step applies pattern-based transformations.

$$
\text{SnowballStemmer}(w) = 
\text{normalize}\Big(
w - \sum_{i=1}^{n} \text{suffix}_i \cdot \mathbb{1}_{\text{rule}_i(w)}
\Big)
$$

Each rule ($\text{rule}_i$):
- Checks if the word ends with a specific suffix (e.g., `-ed`, `-ing`, `-ation`)
- Verifies a *minimum stem length* condition
- Applies replacements like:  
  - `ies → i`  
  - `ization → ize`  
  - `ational → ate`  
  - `fulness → ful`

🧠 However:
- There is **no restoration rule** (like *i → y*, *add back e*)  
- There is **no dictionary check** to verify if the result is a valid word  
- It assumes words sharing the same morphological stem should reduce to the same root

---

### ⚙️ Design Philosophy (Intentional Behavior)

The Snowball stemmer intentionally produces **canonical base forms**,  
not **valid English words** — because it’s built for **Information Retrieval (IR)**, not grammar.

Example:

| Use Case | Goal |
|:--|:--|
| Search “connections” | Should match documents containing “connect”, “connecting”, “connection” |
| Text classification | Token frequency of “connect” should count all related forms |
| Linguistic correctness | ❌ Not required |

Hence, truncations like “studi”, “happi”, “organ” are **acceptable stems** for IR,  
since they merge semantically related words into one root.

---

### 📘 Summary: Stemming ≠ Lemmatization

| Aspect | Stemming (Porter/Snowball) | Lemmatization (WordNet) |
|:--|:--|:--|
| Logic | Heuristic rules | Morphological + dictionary |
| Output | Non-word stems | Valid dictionary words |
| Context awareness | ❌ None | ✅ Uses POS tags |
| Example | “studies” → “studi” | “studies” → “study” |
| Use case | IR / Search / Topic Modeling | Linguistic / Semantic tasks |

---

### 💡 Key Takeaway

- ❗ **Incorrect-looking stems ≠ wrong** — they’re **intentional** truncations.  
- ❗ The **Snowball Stemmer** doesn’t “understand” language — it only applies suffix heuristics.  
- ✅ For **real-word roots**, move to **WordNet Lemmatizer** (which we’ll cover next).

---

📚 **In one line:**

> *Snowball stemmer trims words for equality, not for readability.*


## ⚖️ Porter vs Snowball Stemmer — Behavior on `fairly` and `sportingly`

We’ll compare how the **Porter Stemmer** and **Snowball (Porter2) Stemmer** process  
two adverbial words: `fairly` and `sportingly`.

---

### 🧩 Code Used

```python
from nltk.stem import PorterStemmer, SnowballStemmer

porter = PorterStemmer()
snowball = SnowballStemmer(language='english')

print("Porter:", porter.stem("fairly"), ",", porter.stem("sportingly"))
print("Snowball:", snowball.stem("fairly"), ",", snowball.stem("sportingly"))
```

In [46]:
print("Porter:", porter.stem("fairly"), ",", porter.stem("sportingly"))
print("Snowball:", snowball.stem("fairly"), ",", snowball.stem("sportingly"))

Porter: fairli , sportingli
Snowball: fair , sport


### 🔍 Output

| Word | **Porter Stemmer** | **Snowball Stemmer (Porter2)** | ✅ **Explanation** |
|:--|:--|:--|:--|
| `fairly` | `fairli` | `fair` | Porter replaces *y → i* (old rule); Snowball cleanly removes `-ly` |
| `sportingly` | `sportingli` | `sport` | Porter only removes `-ly` (leaving `sportingli`); Snowball removes the full `-ingly` suffix |

---

### 🧠 Why Porter Gives “fairli” and “sportingli”

The **Porter Stemmer** applies a **fixed, rule-based sequence** without understanding grammar or word parts.  
It mechanically replaces or trims suffixes based on pattern rules.

In **Step 1c** of Porter’s algorithm:

> (*v*) Y → I  
> If the word contains a vowel and ends with *y*, replace *y* with *i*.

So:

$$
\text{fairly} \xrightarrow[\text{remove } -ly]{} \text{fairy} \xrightarrow[\text{Y→I}]{} \text{fairli}
$$

And for *sportingly*:

$$
\text{sportingly} \xrightarrow[\text{remove } -ly]{} \text{sportingli}
$$

✅ Correct per its rules,  
❌ but not linguistically meaningful (non-word outputs).

---

### 🧊 Why Snowball (Porter2) Is Better

**Snowball (Porter2)** improves Porter’s logic by:
- Recognizing multi-suffix patterns (*-ly*, *-ingly*, *-edly*)  
- Avoiding unnecessary *y → i* replacements  
- Applying **normalization** steps for smoother stems  

Hence:

$$
\text{SnowballStemmer("fairly")} = \text{"fair"}
$$

$$
\text{SnowballStemmer("sportingly")} = \text{"sport"}
$$

---

### 🧮 Formal Difference

$$
\text{PorterStemmer}(w) =
\text{apply\_rules}(w, \text{fixed\_suffixes})
$$

$$
\text{SnowballStemmer}(w) =
\text{normalize}\Big(
w -
\sum_{i=1}^{n}
\text{suffix}_i \cdot
\mathbb{1}_{\text{rule}_i(w)}
\Big)
$$

where:

- $\mathbb{1}_{\text{rule}_i(w)} = 1$ if linguistic rule *i* applies  
- $\text{normalize}(\cdot)$ handles cleanup like removing double consonants or restoring vowels

---

### 📘 Summary

| Aspect | **Porter Stemmer** | **Snowball Stemmer (Porter2)** |
|:--|:--|:--|
| Rule design | Fixed, legacy English rules | Modular, modern rule system |
| Handles `-ly` / `-ingly` | Partial (→ “sportingli”) | ✅ Robust (→ “sport”) |
| `fairly` → | `fairli` | `fair` |
| `sportingly` → | `sportingli` | `sport` |
| Accuracy | ❌ Often non-word | ✅ Linguistically cleaner |
| Language support | English only | Multilingual (English, German, Spanish, etc.) |

---

### 💡 Takeaway

> The **Porter Stemmer** applies old mechanical rules — fast but crude.  
> The **Snowball (Porter2)** Stemmer refines these rules for readability, consistency, and multilingual support.  
> For **true dictionary words**, the next step is **WordNet Lemmatization**.


## ⚡ Lancaster Stemmer — The Aggressive Rule-Based Stemmer

The **Lancaster Stemmer** (also known as the **Paice/Husk Stemmer**) is a **very aggressive, iterative rule-based** stemmer.  
It was developed by **Chris Paice (1990)** and applies a series of **deletion and substitution rules** until no more rules can be applied.  

It’s fast ⚡ and simple, but often **over-stems** words — chopping too much and sometimes merging unrelated terms.

---

### 🔹 Overview

- Developed after Porter, designed to be **simpler and faster**  
- Uses a **rule lookup table** of around **120 rules**  
- Applies rules **iteratively** (multiple passes) until no further reduction is possible  
- Each rule defines:
  - a **suffix pattern** to remove  
  - a **replacement** (optional)  
  - and whether to **continue or stop**

---

### 🧠 How It Works

Each rule in Lancaster has the general form:

$$
\text{<ending><condition><replacement><continue>}
$$

---

### 🧩 Example Rules

```text
tion4>     → remove "tion" if word length > 4  
ed4>       → remove "ed" if word length > 4  
y>i        → replace "y" with "i"
```

---

### 🔍 Sample Output

| Word | **Lancaster Stem** | ✅ **Observation** |
|:--|:--|:--|
| `connection` | `connect` | ✅ Good |
| `connected` | `connect` | ✅ Good |
| `connecting` | `connect` | ✅ Good |
| `organization` | `organ` | ⚠️ Over-stemmed (lost “ize” meaning) |
| `organized` | `organ` | ⚠️ Same — merges with “organ” |
| `happiness` | `happy` | ✅ Accurate |
| `practically` | `practic` | ⚠️ Slight truncation |
| `studies` | `study` | ✅ Correct |
| `generalization` | `gener` | ⚠️ Too short, over-stemmed |

---

### ⚠️ Characteristics & Limitations

| Aspect | Description |
|:--|:--|
| ⚡ **Aggressive** | Removes large parts of the word; may over-stem |
| 🔁 **Iterative** | Continues applying rules until no further matches |
| ❌ **Unstable** | Small changes in input can cause big differences in output |
| 🧩 **Short stems** | Often results in very short root forms (e.g., “compute”, “computer” → “comput”) |
| 📚 **No context** | Doesn’t distinguish noun/verb forms or semantics |

---

### 🧮 Mathematical View (KaTeX)

$$
\text{LancasterStemmer}(w) =
\text{Iterate}\Big(
w - \sum_{i=1}^{n}
\text{rule}_i(w)
\Big)
$$

Where:  
- Each $\text{rule}_i$ is a transformation (delete/replace suffix)  
- The process **continues until convergence** (no rule applies)

That is:

$$
w_{t+1} = \text{apply\_rule}(w_t)
\quad \text{until} \quad
w_{t+1} = w_t
$$

---

### ⚙️ Strengths vs Weaknesses

| ✅ Strengths | ⚠️ Weaknesses |
|:--|:--|
| Very fast and simple | Over-stemming is common |
| Small code footprint | Can merge unrelated words |
| Iterative and deterministic | Not linguistically aware |
| Handles many English suffixes | Poor accuracy for complex derivations |

---

### 📘 Summary

| Feature | Description |
|:--|:--|
| Algorithm | Rule-based, iterative |
| Developer | Chris Paice (1990) |
| Aggressiveness | 🔥 Very high |
| Accuracy | ⚠️ Moderate |
| Iterative | Yes |
| Output validity | Often non-dictionary stems |
| Example | `organization → organ` |
| Use case | When **speed > accuracy** (e.g., keyword compression) |

---

### 💡 Takeaway

> The **Lancaster Stemmer** is the fastest and most aggressive among classical stemmers.  
> It’s suitable for quick **indexing or keyword matching**, but **not recommended for semantic NLP tasks**.  
>  
> For balanced results — use **Snowball (Porter2)**.  
> For linguistically valid roots — use **WordNet Lemmatizer** next.


In [47]:
# ⚠️ 3. Lancaster Stemmer — very aggressive
#     Often chops too much, merging unrelated words.
#     But it’s extremely fast and compact.

lancaster = LancasterStemmer()

print("\n🔹 Lancaster Stemmer Results (Aggressive):\n")
for w in tokens:
    stem = lancaster.stem(w)
    print(f"{w:>15}  →  {stem}")

# 🧠 Notes:
# - Often shorter stems than Porter/Snowball (e.g., 'connection' → 'connect')
# - Can over-stem: 'practically' → 'practic' or even 'prac'
# - Useful when you want maximum vocabulary compression



🔹 Lancaster Stemmer Results (Aggressive):

        connect  →  connect
      connected  →  connect
     connection  →  connect
    connections  →  connect
     connecting  →  connect
          study  →  study
        studies  →  study
       studying  →  study
        studied  →  study
          happy  →  happy
      happiness  →  happy
      unhappily  →  unhappy
   organization  →  org
       organize  →  org
      organized  →  org
     organizing  →  org
     generalize  →  gen
    generalized  →  gen
 generalization  →  gen
             go  →  go
          going  →  going
           goes  →  goe
           gone  →  gon
         better  →  bet
           best  →  best
     relational  →  rel
         relate  →  rel
        related  →  rel
       relating  →  rel
           cats  →  cat
          boxes  →  box
           mice  →  mic
    practically  →  pract
      practical  →  pract
    practicable  →  pract
        writing  →  writ
         writes  →  writ
    programming  →  pr

## ⚠️ Why Some Lancaster Stemmer Results Look Incorrect

The **Lancaster Stemmer** is extremely **aggressive** and **iterative**, which makes it *fast* but often **linguistically inaccurate**.  
Unlike Porter or Snowball, it doesn’t just remove a suffix — it applies **a chain of truncation and replacement rules repeatedly**,  
until the word can no longer be shortened.

---

### 🔍 Example Output

| Word | **Lancaster Stem** | ✅ **Expected** | ⚙️ **Observation** |
|:--|:--|:--|:--|
| `organization` | `org` | `organize` | Over-stemmed — repeatedly truncated |
| `generalization` | `gen` | `generalize` | Over-stemmed; multiple suffix rules applied |
| `unhappily` | `unhappy` | `unhappy` | ✅ Correct — stripped “-ily” |
| `finally` | `fin` | `final` | Over-stemmed; lost meaningful suffix |
| `gone` | `gon` | `go` | Partial removal of “e” only |
| `mice` | `mic` | `mouse` | ❌ Wrong — no understanding of irregular plurals |
| `better` | `bet` | `good` | ❌ Not aware of comparative/synonymic meaning |
| `writing` | `writ` | `write` | ✅ Acceptable stem |
| `organization` | `org` | `organize` | ⚠️ Over-stemmed — intended for text compression |
| `practical` | `pract` | `practical` | ⚠️ Truncated aggressively |

---

### 🧠 Why It Happens

The **Lancaster Stemmer** uses a compact **rule table of about 120 transformation rules**.  
Each rule defines:
- a **suffix** to remove,
- a **minimum stem length**, and  
- whether to **continue or stop** processing.

For example, some of its rules look like:

```text
ion4>      → remove "ion" if length > 4
ize3>      → remove "ize" if length > 3
al4>       → remove "al" if length > 4
e>         → remove "e"
y>i        → replace "y" with "i"
```

These are applied **iteratively**.  
So a word like **“organization”** goes through multiple transformations:

```text
organization 
→ organize
→ organ
→ org
```

✅ The algorithm stops **only when no more rules match**,  
hence the stems are often **shorter than expected**.

---

### 🧩 Mathematical Representation 

$$
w_{t+1} = f(w_t)
$$

where:

$$
f(w) = 
\begin{cases}
w - \text{suffix}, & \text{if a rule applies} \\
\text{replace}(w), & \text{if replacement condition met} \\
w, & \text{otherwise}
\end{cases}
$$

The process continues until convergence:

$$
\text{LancasterStemmer}(w) = \lim_{t \to T} f^{(t)}(w)
$$

---

### 🔎 Key Differences from Porter / Snowball

| Feature | **Lancaster** | **Porter** | **Snowball (Porter2)** |
|:--|:--|:--|:--|
| Rule type | Iterative table lookup | Sequential suffix removal | Structured rule system |
| Aggressiveness | 🔥 Very high | Medium | Controlled |
| Iterative passes | ✅ Yes | ❌ No | ❌ No |
| Output validity | Often non-word | Sometimes non-word | Usually readable |
| `organization` | `org` | `organ` | `organ` |
| `generalization` | `gen` | `general` | `general` |
| `practically` | `pract` | `practic` | `practic` |
| Speed | ⚡ Fastest | ⚡ Fast | ⚡ Fast |
| Use case | IR compression | General NLP | Balanced NLP |

---

### 📘 When Results Are “Incorrect”

They look incorrect because:

1. The algorithm has **no linguistic understanding** — it only applies string patterns.  
2. Rules can **cascade**, meaning multiple suffix rules can apply in sequence.  
3. It doesn’t check whether the **resulting stem** is a valid English word.  
4. It was designed for **Information Retrieval (IR)**, not grammatical correctness.

---

### ✅ When It’s Still Useful

| Use Case | Reason |
|:--|:--|
| **Search indexing** | Fewer unique stems → faster lookups |
| **Text deduplication** | Groups related word forms aggressively |
| **Keyword extraction** | Reduces inflectional variations |
| **Count-based NLP models** | Smaller vocabulary → more efficient vectorization |

---

### 💡 Summary

> The **Lancaster Stemmer** is extremely fast but highly aggressive.  
> It’s ideal for **text retrieval** and **index compression**,  
> but unsuitable for linguistically sensitive NLP applications.  
>
> ✅ Use **Snowball (Porter2)** for balanced stemming.  
> ✅ Use **WordNet Lemmatizer** for meaningful dictionary words.


In [48]:
# 🧮 Compare outputs of all stemmers side by side for quick analysis

def compare_stemmers(word):
    return {
        "token": word,
        "Porter": porter.stem(word),
        "Snowball": snowball.stem(word),
        "Lancaster": lancaster.stem(word),
        "Regexp": regexp.stem(word),
    }

results = [compare_stemmers(w) for w in tokens]

# Compute column widths for nice console alignment
cols = ["token", "Porter", "Snowball", "Lancaster", "Regexp"]
widths = {c: max(len(c), max(len(str(r[c])) for r in results)) for c in cols}

# Header
header = "  ".join(c.upper().ljust(widths[c]) for c in cols)
print(header)
print("-" * len(header))

# Rows
for row in results:
    print("  ".join(str(row[c]).ljust(widths[c]) for c in cols))

# 💡 Alternative visualization:
# import pandas as pd
# df = pd.DataFrame(results)
# display(df)


TOKEN           PORTER     SNOWBALL   LANCASTER  REGEXP        
---------------------------------------------------------------
connect         connect    connect    connect    connect       
connected       connect    connect    connect    connect       
connection      connect    connect    connect    connection    
connections     connect    connect    connect    connection    
connecting      connect    connect    connect    connect       
study           studi      studi      study      study         
studies         studi      studi      study      stud          
studying        studi      studi      study      study         
studied         studi      studi      study      studi         
happy           happi      happi      happy      happy         
happiness       happi      happi      happy      happines      
unhappily       unhappili  unhappili  unhappy    unhappi       
organization    organ      organ      org        organization  
organize        organ      organ      or

## 📊 Observations and Comparison

| Word | Porter | Snowball | Lancaster | Regexp | Remarks |
|:----|:-------|:---------|:-----------|:--------|:--------|
| connected | connect | connect | connect | connect | ✅ Consistent |
| studies | studi | studi | study | study | ⚠️ Minor differences |
| happiness | happi | happi | happy | happi | ✅ Close results |
| organizing | organ | organ | organ | organiz | ⚠️ Regexp retains ‘z’ |
| practical | practic | practic | pract | practic | ⚠️ Lancaster over-stems |

---

### ⚠️ Over-stemming vs Under-stemming

- **Over-stemming:** unrelated words merge into one root  
  e.g., “practicable” and “practical” → both → “practic”
- **Under-stemming:** related words fail to merge  
  e.g., “relational” and “relate” remain different stems

---

### ✅ Best Practices

- Use **Porter** or **Snowball** for English text preprocessing pipelines.  
- Use **Lancaster** only for **aggressive vocabulary compression** (like search engines).  
- Use **Regexp** for **custom domains** — especially if you know common suffix patterns.  
- For linguistically valid roots → switch to **lemmatization**.

---

### 🧮 Mathematical Recap

$$
\text{Stem: } \mathcal{W} \to \mathcal{S}, \quad s = \text{stem}(w)
$$

$$
\text{Word Family: } \{\text{connect}, \text{connected}, \text{connections}, \text{connecting}\} 
\xrightarrow{\text{stem}} \{\text{connect}\}
$$


## 🧩 WordNet Lemmatization — Getting True Dictionary Words

Unlike **stemming**, which just chops off suffixes,  
**lemmatization** uses **vocabulary + grammar rules** to find the *actual dictionary root* (lemma) of a word.

It considers:
- Part of Speech (POS) — noun, verb, adjective, adverb  
- Morphological rules (e.g., *studies → study*, *better → good*)  
- WordNet lexical database for valid words

---

### 💻 Example Code


In [49]:
# Download WordNet if not already done
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [50]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

words = ["studies", "studying", "better", "organized", "cats", "went"]

print("🔹 Lemmatization Examples:\n")
for w in tokens:
    print(f"{w:>10} → {lemmatizer.lemmatize(w)}")

# Using POS (Part of Speech) for better accuracy
print("\n🔹 With POS Tags:\n")
for w in tokens:
    print(f"{w} (verb): {lemmatizer.lemmatize(w, pos='v')}")
    print(f"{w} (adjective): {lemmatizer.lemmatize(w, pos='a')}")

    
print("studying (verb):", lemmatizer.lemmatize("studying", pos='v'))
print("better (adjective):", lemmatizer.lemmatize("better", pos='a'))

🔹 Lemmatization Examples:

   connect → connect
 connected → connected
connection → connection
connections → connection
connecting → connecting
     study → study
   studies → study
  studying → studying
   studied → studied
     happy → happy
 happiness → happiness
 unhappily → unhappily
organization → organization
  organize → organize
 organized → organized
organizing → organizing
generalize → generalize
generalized → generalized
generalization → generalization
        go → go
     going → going
      goes → go
      gone → gone
    better → better
      best → best
relational → relational
    relate → relate
   related → related
  relating → relating
      cats → cat
     boxes → box
      mice → mouse
practically → practically
 practical → practical
practicable → practicable
   writing → writing
    writes → writes
programming → programming
  programs → program
   history → history
   finally → finally
 finalized → finalized

🔹 With POS Tags:

connect (verb): connect
connect (adje