# üîß Text Preprocessing ‚Äî Stemming & Its Types

**Stemming** is a **rule-based text normalization process** that reduces words to their **base stem** by chopping off **suffixes** or **prefixes** using a series of predefined rules.  
It is a **fast**, **statistical**, and **dictionary-free** approach that approximates a word‚Äôs **root form (lemma)**.

---

üìò **Important Distinction**

> While **lemmatization** uses vocabulary and morphological analysis to find valid words (lemmas),  
> **stemming** merely truncates tokens based on heuristic rules ‚Äî meaning the result *may not* be a valid dictionary word.

---

### üß© Example

| Original Word | Stem Output | Valid Word? |
|:--------------|:-------------|:-------------|
| connections | connect | ‚úÖ |
| studies | studi | ‚ùå |
| organizing | organiz | ‚úÖ (approximation) |

---

### üßÆ Formal Definition

Let $w$ be a token (word). The stemming process applies a function:
$$
\text{stem}(w) \rightarrow s
$$
where $s$ is the derived stem that may or may not correspond to a valid lemma.

Given a corpus $C = \{w_1, w_2, \dots, w_n\}$,
$$
S = \{\text{stem}(w_1), \text{stem}(w_2), \dots, \text{stem}(w_n)\}
$$

---

### üîπ Why We Use Stemming

‚úÖ Reduces the **vocabulary size** ‚Üí smaller, faster models  
‚úÖ Groups **morphological variants** ‚Üí improves recall in text search  
‚ö†Ô∏è May **over-stem** (merge unrelated words) or **under-stem** (fail to merge related words)

---

### üß† Stemming vs Lemmatization

| Feature | Stemming | Lemmatization |
|:--------|:----------|:---------------|
| Output | Truncated stem | Valid dictionary word |
| Logic | Rule-based suffix chopping | Morphology + POS + Dictionary |
| Accuracy | Approximate | Precise |
| Speed | ‚ö° Fast | üê¢ Slower |
| Example | *studies ‚Üí studi* | *studies ‚Üí study* |


In [41]:
# A diverse test vocabulary containing verbs, nouns, adjectives, adverbs, and edge cases.
# Each "family" groups related morphological variants to reveal how stemmers conflate them.
tokens = [
    # ----- connect-family (regular verb/noun derivations)
    "connect", "connected", "connection", "connections", "connecting",
    # ----- study-family (y‚Üíi + inflections)
    "study", "studies", "studying", "studied",
    # ----- happy-family (adj/adv/negation)
    "happy", "happiness", "unhappily",
    # ----- organize-family (ize/ization derivations)
    "organization", "organize", "organized", "organizing",
    # ----- generalize-family (ize/ization derivations)
    "generalize", "generalized", "generalization",
    # ----- go-family (irregular verb forms)
    "go", "going", "goes", "gone",
    # ----- comparatives/superlatives (irregular scale)
    "better", "best",
    # ----- relate-family (derivational adjective)
    "relational", "relate", "related", "relating",
    # ----- pluralization and irregular plurals
    "cats", "boxes", "mice",
    # ----- practical-family (derivational/orthographic similarity)
    "practically", "practical", "practicable",
    # write-family: test progressive (-ing) and 3rd-person singular (-s)
    "writing", "writes",
    # program-family: noun vs. verb derivations (+plural -s)
    "programming", "programs",
    # history-family: bare noun (tests that some words have no useful conflation)
    "history",
    # üèÅ final-family: adverb (-ly) vs. past participle (-ed)
    "finally", "finalized",
]

# Display metadata about our sample
print("üßæ Corpus Preview ‚Äî Tokens for Stemming Exploration")
print(f"Total tokens: {len(tokens)}\n")
print(tokens)

# üí° Design notes:
# - Regular and irregular inflections (connect, study, go)
# - Adjective/adverb forms and negation (happy ‚Üí unhappily)
# - Derivational morphology (organization, generalization)
# - Comparatives/superlatives (better, best) to see if stemmers over-conflate
# - Plural forms (cats, boxes) and irregular plurals (mice)
# - Orthographically similar but semantically distinct (practical vs practicable)
# - Newly added: write/program/final/history families to probe -ing, -s, -ly, -ed endings
#   and noun/verb derivations common in software/text corpora.


üßæ Corpus Preview ‚Äî Tokens for Stemming Exploration
Total tokens: 42

['connect', 'connected', 'connection', 'connections', 'connecting', 'study', 'studies', 'studying', 'studied', 'happy', 'happiness', 'unhappily', 'organization', 'organize', 'organized', 'organizing', 'generalize', 'generalized', 'generalization', 'go', 'going', 'goes', 'gone', 'better', 'best', 'relational', 'relate', 'related', 'relating', 'cats', 'boxes', 'mice', 'practically', 'practical', 'practicable', 'writing', 'writes', 'programming', 'programs', 'history', 'finally', 'finalized']


## üß∞ Types of Stemmers in NLP

There are several stemmers available in NLTK, each with its own level of aggressiveness and rule set.

| Stemmer | Description | Behavior |
|:--------|:-------------|:------------|
| **PorterStemmer** | üß© The most classic rule-based stemmer; moderate and reliable | Conservative |
| **SnowballStemmer (Porter2)** | ‚ùÑÔ∏è Enhanced Porter version; supports multiple languages | Balanced |
| **LancasterStemmer** | ‚ö° Extremely aggressive; very short stems | Over-stems often |
| **RegexpStemmer** | üîç Custom regex-based stemmer; great for domain control | Fully customizable |

---

Each stemmer applies pattern-based rules:
$$
\text{stemmer}(w) = w - \text{(suffixes according to pattern rules)}
$$

Let's compare how these differ in practice üëá


In [42]:
# üì¶ Import stemmers from NLTK and define a sample corpus of tokens
# We'll test how different stemmers handle various morphological forms.

from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer, RegexpStemmer

In [43]:
# ‚úÖ 1. Porter Stemmer ‚Äî the most commonly used baseline
#    Developed by Martin Porter in 1980, uses a fixed set of rules (about 60)
#    Typically produces readable stems and avoids excessive truncation.

porter = PorterStemmer()

print("üîπ Porter Stemmer Results:\n")
for w in tokens:
    stem = porter.stem(w)
    print(f"{w:>15}  ‚Üí  {stem}")

# üß† Notes:
# - "connections" ‚Üí "connect"
# - "studies" ‚Üí "studi"
# - "organizing" ‚Üí "organ"
# Porter tries to keep stems interpretable, but not necessarily valid words.


üîπ Porter Stemmer Results:

        connect  ‚Üí  connect
      connected  ‚Üí  connect
     connection  ‚Üí  connect
    connections  ‚Üí  connect
     connecting  ‚Üí  connect
          study  ‚Üí  studi
        studies  ‚Üí  studi
       studying  ‚Üí  studi
        studied  ‚Üí  studi
          happy  ‚Üí  happi
      happiness  ‚Üí  happi
      unhappily  ‚Üí  unhappili
   organization  ‚Üí  organ
       organize  ‚Üí  organ
      organized  ‚Üí  organ
     organizing  ‚Üí  organ
     generalize  ‚Üí  gener
    generalized  ‚Üí  gener
 generalization  ‚Üí  gener
             go  ‚Üí  go
          going  ‚Üí  go
           goes  ‚Üí  goe
           gone  ‚Üí  gone
         better  ‚Üí  better
           best  ‚Üí  best
     relational  ‚Üí  relat
         relate  ‚Üí  relat
        related  ‚Üí  relat
       relating  ‚Üí  relat
           cats  ‚Üí  cat
          boxes  ‚Üí  box
           mice  ‚Üí  mice
    practically  ‚Üí  practic
      practical  ‚Üí  practic
    practicabl

## üß© Why Porter Stemmer Produces ‚Äústudi‚Äù and ‚Äúorgan‚Äù

The **Porter Stemmer** works through a series of **rule-based substitution phases**, where each step applies a pattern like:

$$
\text{(suffix)} \rightarrow \text{(replacement)}
$$

These rules are *ordered and conditional*, meaning:
- The stemmer checks for certain suffix patterns in a fixed sequence.
- Once a rule fires, it may not revisit the word with later patterns.

Let‚Äôs analyze the examples üëá

---

### üîπ `"connections" ‚Üí "connect"`
**Rule triggered:**
- Remove plural **-s** or **-es** ‚Üí ‚Äúconnection‚Äù
- Remove nominal suffix **-ion** if preceded by ‚Äúct‚Äù ‚Üí ‚Äúconnect‚Äù

‚úÖ This is a *perfect case*: Porter correctly identifies the root form ‚Äúconnect‚Äù.

---

### üîπ `"studies" ‚Üí "studi"`
**Rule triggered:**
- Step 1: `ies` ‚Üí `i` (Porter replaces ‚Äúies‚Äù with ‚Äúi‚Äù to generalize plural forms)
  
$$
\text{studies} \rightarrow \text{studi}
$$

üîç Porter assumes that words ending in ‚Äúies‚Äù are plural or 3rd person forms of verbs ending with ‚Äúy‚Äù.
But it does **not** convert ‚Äúi‚Äù back to ‚Äúy‚Äù, since it‚Äôs unaware of morphological semantics.

‚ö†Ô∏è **Result:** ‚Äústudi‚Äù is **not** a valid word (should have been ‚Äústudy‚Äù).

---

### üîπ `"organizing" ‚Üí "organ"`
**Rule triggered:**
- Step 1: Remove **-ing** ‚Üí ‚Äúorganiz‚Äù
- Step 2: If the remaining word ends in **-iz**, and a rule matches **-ize ‚Üí e**, Porter sometimes truncates the suffix inconsistently.
- Since the rule chain doesn‚Äôt always reconstruct ‚Äúorganize‚Äù, it stops at ‚Äúorgan‚Äù.

‚ö†Ô∏è This happens because Porter doesn‚Äôt look at full morphology ‚Äî it only applies simple text-based rules, not grammar.

---

## ‚ö†Ô∏è Drawbacks of Porter Stemmer

| Limitation | Description | Example |
|:------------|:-------------|:----------|
| **1Ô∏è‚É£ Over-stemming** | Different words reduced to same root (loss of meaning) | ‚Äúorganization‚Äù & ‚Äúorganism‚Äù ‚Üí ‚Äúorgan‚Äù |
| **2Ô∏è‚É£ Under-stemming** | Related words fail to merge | ‚Äúunhappy‚Äù & ‚Äúhappiness‚Äù remain separate |
| **3Ô∏è‚É£ Non-dictionary roots** | Outputs stems like ‚Äústudi‚Äù, ‚Äúhappi‚Äù, ‚Äúgener‚Äù | ‚Äústudies‚Äù ‚Üí ‚Äústudi‚Äù |
| **4Ô∏è‚É£ No POS or context awareness** | Doesn‚Äôt know if word is noun/verb/adjective | ‚Äúbetter‚Äù ‚Üí ‚Äúbett‚Äù |
| **5Ô∏è‚É£ Fixed English-only rules** | Limited cross-linguistic support | Works poorly on non-English corpora |

---

üìå **In short:**
- Porter is **fast**, **deterministic**, and great for **IR tasks** (like search engines).
- But it‚Äôs **linguistically naive** ‚Äî it just ‚Äúchops‚Äù, it doesn‚Äôt ‚Äúunderstand‚Äù.

---

### ‚úÖ When Porter Stemmer is Still Useful
- For **Information Retrieval (IR)** ‚Äî when perfect word forms aren‚Äôt necessary.  
- For **keyword-based search systems** (search ‚Äúconnect‚Äù finds ‚Äúconnections‚Äù).  
- When you need lightweight preprocessing in large text pipelines.

---

üìö **When NOT to use Porter:**
- When your downstream model depends on *precise grammatical or lexical meaning*  
  (like text generation, translation, or semantic similarity tasks).  
In those cases, prefer **Lemmatization**.

---

üí° **Summary Insight**

$$
\text{PorterStemmer} \approx \text{HeuristicSuffixRemover}
$$

‚úÖ Efficient for search  
‚ùå Not semantically accurate


## üß© RegexpStemmer ‚Äî Custom, Rule-Driven Stemming

**`RegexpStemmer`** (short for *Regular Expression Stemmer*) is a **lightweight and customizable** stemmer provided by NLTK.  
Unlike other stemmers (Porter, Snowball, Lancaster), which use **predefined linguistic rules**,  
this stemmer allows you to define **your own pattern-based stripping rules** using **regular expressions (regex)**.

---

### üîπ Working Principle

The stemmer applies a **regex substitution** to remove specific suffixes or endings.

$$
\text{stem}(w) = w - \text{(regex\_matched\_suffix)}
$$

In other words:
1. The stemmer looks for patterns that match your regex rule.  
2. If the pattern appears **at the end** of the word, it is replaced with an empty string (i.e., removed).  
3. The operation is purely **text-based** ‚Äî no morphological knowledge, no POS awareness.

---

### üìò Parameters

| Parameter | Type | Description |
|:--|:--|:--|
| **pattern** | `str` (regex) | A pattern describing suffixes to remove (e.g., `(ing|ly|ed|s)$`) |
| **min** | `int` | Minimum length of the remaining word (prevents over-stripping) |
| **ignore_case** | `bool` | Whether to match case-insensitive suffixes |
| **repl** | `str` | Replacement string (default: empty) |

---

### üß™ Example

```python
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer(regexp=r'(ing|ly|ed|ious|ies|ive|es|s|ment)$', min=3)

print(regexp.stem("studying"))      # ‚Üí study
print(regexp.stem("boxes"))         # ‚Üí box
print(regexp.stem("connections"))   # ‚Üí connection
print(regexp.stem("happily"))       # ‚Üí happi
```
---

### üß© Explanation
- The regex removes common English endings (like -ing, -ed, -s).
- `min=3` ensures that very short words (like ‚Äúis‚Äù, ‚Äúas‚Äù) are not truncated.
- The `$` ensures suffixes are matched only at the end of the token.
---

### ‚úÖ Advantages

| Benefit             | Description                                                     |
| :------------------ | :-------------------------------------------------------------- |
| **Customizable**    | You control exactly which suffixes are removed                  |
| **Lightweight**     | No heavy rule set or model loading                              |
| **Fast**            | Regex-based ‚Üí extremely quick on large corpora                  |
| **Domain-friendly** | Ideal for domain-specific text (medical, legal, software, etc.) |

---

### ‚ö†Ô∏è Limitations

| Drawback                  | Description                                           |
| :------------------------ | :---------------------------------------------------- |
| ‚ùå No linguistic knowledge | Doesn‚Äôt know valid lemmas or word families            |
| ‚ùå Can under- or over-stem | Regex too broad or too narrow causes errors           |
| ‚ùå English-only by design  | You‚Äôd need new regex rules per language               |
| ‚ùå Not context aware       | ‚Äúgoes‚Äù ‚Üí ‚Äúgo‚Äù works, but ‚Äúwas‚Äù or ‚Äúwent‚Äù won‚Äôt change |

---

### üìå Best Use Cases
- You want full control over stemming rules.
- You work in restricted domains (e.g., biomedical, legal, or software corpora).
- You need a fast and transparent way to reduce word variants without external dependencies.

---

### üßÆ Mathematical Recap

$$
\text{RegexpStemmer}(w) =
\begin{cases}
w - \text{regex\_suffix}, & \text{if pattern matches the end of } w \\
w, & \text{otherwise}
\end{cases}
$$

--- 

### üí° Summary

| Feature      | Description                                            |
| :----------- | :----------------------------------------------------- |
| **Approach** | Regex-based suffix removal                             |
| **Accuracy** | Depends entirely on your regex quality                 |
| **Speed**    | ‚ö° Very fast                                            |
| **Output**   | Non-linguistic stems                                   |
| **Best for** | Controlled pipelines and domain-specific normalization |



In [44]:
# üß™ Regexp Stemmer ‚Äî create domain-specific rules
# Here, we remove a few common English suffixes when they appear at the end of a word.

# Pattern: remove -ing, -ly, -ed, -ious, -ies, -ive, -es, -s, -ment
regexp = RegexpStemmer(
    regexp=r'(ing|ly|ed|ious|ies|ive|es|s|ment)$',
    min=3
)

print("\nüîπ Regexp Stemmer Results (Custom Suffix Chopping):\n")
for w in tokens:
    print(f"{w:>15}  ‚Üí  {regexp.stem(w)}")

# üìù Notes:
# - `min=3` prevents stripping when the remainder would be < 3 chars
# - Tune the regex to your domain (biomedical, legal, etc.)



üîπ Regexp Stemmer Results (Custom Suffix Chopping):

        connect  ‚Üí  connect
      connected  ‚Üí  connect
     connection  ‚Üí  connection
    connections  ‚Üí  connection
     connecting  ‚Üí  connect
          study  ‚Üí  study
        studies  ‚Üí  stud
       studying  ‚Üí  study
        studied  ‚Üí  studi
          happy  ‚Üí  happy
      happiness  ‚Üí  happines
      unhappily  ‚Üí  unhappi
   organization  ‚Üí  organization
       organize  ‚Üí  organize
      organized  ‚Üí  organiz
     organizing  ‚Üí  organiz
     generalize  ‚Üí  generalize
    generalized  ‚Üí  generaliz
 generalization  ‚Üí  generalization
             go  ‚Üí  go
          going  ‚Üí  go
           goes  ‚Üí  go
           gone  ‚Üí  gone
         better  ‚Üí  better
           best  ‚Üí  best
     relational  ‚Üí  relational
         relate  ‚Üí  relate
        related  ‚Üí  relat
       relating  ‚Üí  relat
           cats  ‚Üí  cat
          boxes  ‚Üí  box
           mice  ‚Üí  mice
    p

## ‚ö†Ô∏è Why Some Outputs Are Incorrect in RegexpStemmer

The **RegexpStemmer** is purely **pattern-based**, meaning it does not understand language morphology.  
It simply strips the suffixes that match your custom regular expression ‚Äî even when doing so **breaks the word form**.

---

### üîç Example Analysis (Based on Output)

| Word | Output | Expected | What Happened |
|:--|:--|:--|:--|
| `studies` | `stud` | `study` | Regex removed `ies`, leaving `stud`; it doesn‚Äôt replace `ies ‚Üí y` like linguistic stemmers. |
| `studied` | `studi` | `study` | Removed `ed` ‚Üí produced non-word root. |
| `happiness` | `happines` | `happy` | Removed only final `s`, not `ness`; pattern didn‚Äôt match `ness`. |
| `unhappily` | `unhappi` | `unhappy` | Removed `ly`, but didn‚Äôt restore `y`. |
| `organized` | `organiz` | `organize` | Removed `ed`, unaware that base form has `e`. |
| `programming` | `programm` | `program` | Removed `ing`, but double `m` rule (from `mming` ‚Üí `m`) isn‚Äôt handled. |
| `writing` | `writ` | `write` | Removed `ing`, unaware that root should regain the `e`. |
| `finalized` | `finaliz` | `finalize` | Removed `ed`, unaware of morphological ‚Äúrestore e‚Äù rule. |

---

### üß© Why This Happens

Your regex:

```python
regexp = RegexpStemmer(
    regexp=r'(ing|ly|ed|ious|ies|ive|es|s|ment)$',
    min=3
)
```

... blindly removes suffixes, so it:
Removes endings like `-ed`, `-s`, `-ing`, etc.
Does not know when to add back missing letters (like `y` or `e`).
Does not skip partial matches (e.g., removes `s` in `happiness`).
This makes it fast ‚ö° but linguistically dumb ü§ñ.

### üßÆ Mathematically

$$
\text{RegexpStemmer}(w) =
\begin{cases}
w - \text{regex\_suffix}, & \text{if the suffix matches the regex pattern at the end of } w \\
w, & \text{otherwise}
\end{cases}
$$

---

Unlike **Porter** or **Snowball**, it lacks post-processing rules such as:

$$
\text{if } w' \text{ ends with } "i" \Rightarrow \text{replace with } "y"
$$


## üßä Snowball Stemmer (aka Porter2)

The **Snowball Stemmer** ‚Äî also known as **Porter2 Stemmer** ‚Äî is an improved, modernized version of the classic **Porter Stemmer**.  
It was introduced by **Martin Porter** himself as part of the **Snowball framework** for multilingual stemming.

---

### üîπ Overview

Unlike the original Porter Stemmer, which was written in English-only rules,  
the Snowball version generalizes the same concept for **multiple languages** using a cleaner, more maintainable rule definition syntax.

It uses a **larger rule set**, **stricter conditions**, and **refined suffix handling**, making it:
- More **consistent** across morphological cases  
- Slightly **less aggressive** than Lancaster, but more **accurate** than Porter  
- **Multilingual**, supporting languages such as `english`, `french`, `spanish`, `german`, `italian`, etc.

---

### üß† Working Logic

1Ô∏è‚É£ **Input**: token $w$  
2Ô∏è‚É£ **Identify** suffixes and endings based on the target language  
3Ô∏è‚É£ **Apply** language-specific morphological rules  
4Ô∏è‚É£ **Return** truncated stem $s$

Mathematically:

$$
\text{SnowballStemmer}(w) =
\begin{cases}
w - \text{language\_specific\_suffix}, & \text{if rule applies for language} \\
w, & \text{otherwise}
\end{cases}
$$

---

### üßÆ Example (English)

```python
from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')

for w in ["connected", "connections", "studies", "studying", "happiness", "generalization"]:
    print(f"{w:>15}  ‚Üí  {snowball.stem(w)}")
```

---
### üß© Sample Output

| Word           | Snowball Stem |
| :------------- | :------------ |
| connected      | connect       |
| connections    | connect       |
| studies        | studi         |
| studying       | studi         |
| happiness      | happi         |
| generalization | general       |

---

### ‚öôÔ∏è Features & Advantages

| Feature                          | Description                                     |
| :------------------------------- | :---------------------------------------------- |
| üåç **Multilingual support**      | Works for multiple languages                    |
| ‚öñÔ∏è **Balanced approach**         | Avoids over-stemming seen in Lancaster          |
| üìè **Improved rule definitions** | Simpler and more uniform rule syntax            |
| ‚ö° **Fast and lightweight**       | Similar performance to Porter                   |
| üß© **Stable results**            | Produces consistent stems across similar tokens |

---

### ‚ö†Ô∏è Limitations

| Limitation                        | Description                                                    |
| :-------------------------------- | :------------------------------------------------------------- |
| ‚ùå Still heuristic-based           | No understanding of real word meaning                          |
| ‚ùå Not lemmatization               | ‚Äústudies‚Äù ‚Üí ‚Äústudi‚Äù (non-word)                                 |
| ‚ùå English bias                    | Best for Indo-European languages                               |
| ‚ö†Ô∏è Slight differences from Porter | Can produce smaller or larger stems depending on rule ordering |

---

### üìò When to Use Snowball Stemmer
- ‚úÖ Choose Snowball Stemmer when:
    - You need a fast and accurate rule-based stemmer for English or European languages
    - You want consistency and clarity in stemming behavior
    - You‚Äôre preprocessing for Information Retrieval, Topic Modeling, or Search Indexing
- ‚ùå Avoid when:
    - You need linguistically correct root forms (‚Üí use WordNet Lemmatizer)
    - You‚Äôre working on languages unsupported by Snowball

---

### üí° Summary

| Aspect           | Porter               | Snowball (Porter2)     |
| :--------------- | :------------------- | :--------------------- |
| Rules            | ~60 hardcoded        | ~85 structured         |
| Accuracy         | Moderate             | High                   |
| Language support | English only         | Multi-language         |
| Aggressiveness   | Moderate             | Controlled             |
| Output example   | ‚Äústudying‚Äù ‚Üí ‚Äústudi‚Äù | ‚Äústudying‚Äù ‚Üí ‚Äústudi‚Äù   |
| Use case         | IR tasks             | IR + NLP preprocessing |

---

### üßÆ Mathematical Insight

$$
\text{SnowballStemmer}(w) =
\text{normalize}\Bigg(
w - 
\sum_{i=1}^{n}
\text{suffix}_i \cdot 
\mathbb{1}_{\text{rule}_i(w)}
\Bigg)
$$

where:

- $\mathbb{1}_{\text{rule}_i(w)}$ is an **indicator function**:
  $$
  \mathbb{1}_{\text{rule}_i(w)} =
  \begin{cases}
  1, & \text{if linguistic rule } i \text{ applies to } w \\
  0, & \text{otherwise}
  \end{cases}
  $$

- $\text{suffix}_i$ represents each possible removable suffix.  
- $\text{normalize}(\cdot)$ denotes **post-processing** (like removing double letters or trailing vowels).  

---

üìò **Intuition**

The Snowball Stemmer applies a series of $n$ language-specific rules.  
Each rule checks if a word $w$ matches a condition (via $\text{rule}_i$).  
If true ($\mathbb{1}=1$), it removes the corresponding $\text{suffix}_i$,  
and then the result is normalized to ensure consistency across derived stems.




In [45]:
# ‚úÖ 2. Snowball Stemmer ‚Äî also called Porter2
#    Developed as an improved version of Porter with:
#    - Better consistency
#    - Multi-language support
#    - More transparent rules
             
snowball = SnowballStemmer(language="english")

print("\nüîπ Snowball Stemmer (Porter2) Results:\n")
for w in tokens:
    stem = snowball.stem(w)
    print(f"{w:>15}  ‚Üí  {stem}")

# üß† Notes:
# - Handles 'happiness' ‚Üí 'happi' like Porter
# - More consistent rule application
# - Supports many languages: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian,
#   Portuguese, Romanian, Russian, Spanish and Swedish.



üîπ Snowball Stemmer (Porter2) Results:

        connect  ‚Üí  connect
      connected  ‚Üí  connect
     connection  ‚Üí  connect
    connections  ‚Üí  connect
     connecting  ‚Üí  connect
          study  ‚Üí  studi
        studies  ‚Üí  studi
       studying  ‚Üí  studi
        studied  ‚Üí  studi
          happy  ‚Üí  happi
      happiness  ‚Üí  happi
      unhappily  ‚Üí  unhappili
   organization  ‚Üí  organ
       organize  ‚Üí  organ
      organized  ‚Üí  organ
     organizing  ‚Üí  organ
     generalize  ‚Üí  general
    generalized  ‚Üí  general
 generalization  ‚Üí  general
             go  ‚Üí  go
          going  ‚Üí  go
           goes  ‚Üí  goe
           gone  ‚Üí  gone
         better  ‚Üí  better
           best  ‚Üí  best
     relational  ‚Üí  relat
         relate  ‚Üí  relat
        related  ‚Üí  relat
       relating  ‚Üí  relat
           cats  ‚Üí  cat
          boxes  ‚Üí  box
           mice  ‚Üí  mice
    practically  ‚Üí  practic
      practical  ‚Üí  pra

## ‚ö†Ô∏è Why Some Snowball Stemmer Outputs Look Incorrect

Although **Snowball Stemmer (Porter2)** is an improvement over Porter,  
it‚Äôs still a **rule-based stemmer** ‚Äî not a lemmatizer.  
It applies **heuristic truncation rules**, which means it:
- Strips **common suffixes** like *-ed*, *-ing*, *-ly*, *-ation*, *-ize*, etc.
- Does **not** restore missing characters (like `y`, `e`)
- Treats **word derivation families** as morphologically equivalent

---

### üîç Example Analysis

| Word | Output | Expected Lemma | Explanation |
|:--|:--|:--|:--|
| `studies` | `studi` | `study` | Replaces *-ies* ‚Üí *-i*, doesn‚Äôt restore *y* |
| `studied` | `studi` | `study` | Removes *-ed*, no post-fix rule to add *y* |
| `happiness` | `happi` | `happy` | Removes *-ness*, doesn‚Äôt change *i* ‚Üí *y* |
| `unhappily` | `unhappili` | `unhappy` | Removes *-ly*, but also drops *y ‚Üí i* from earlier suffix logic |
| `organization` | `organ` | `organize` | Removes derivational suffix *-ization* ‚Üí *ize*, then normalizes to base ‚Äúorgan‚Äù |
| `organized` | `organ` | `organize` | Removes *-ed*, then *-ize*, collapsing both |
| `generalization` | `general` | `generalize` | Removes *-ization*, truncating the base |
| `goes` | `goe` | `go` | Strips *-es* but lacks *restore e ‚Üí o* fixup |
| `history` | `histori` | `history` | Treats *-y* as possible derivational ending and removes *y ‚Üí i* |
| `practically` | `practic` | `practical` | Removes *-ally* (double rule: *al + ly*) |
| `writing` | `write` | `write` | ‚úÖ Correct ‚Äî known morphological case handled well |
| `programming` | `program` | `program` | ‚úÖ Correct ‚Äî doubled consonant handled |
| `finalized` | `final` | `finalize` | Removes *-ized*, returns the root *final* |
| `finally` | `final` | `final` | ‚úÖ Expected stem; adverb stripped correctly |

---

### üß© Why These Happen ‚Äî Mechanism of Snowball (Porter2)

The **Porter2 algorithm** operates in **five sequential phases**  
where each step applies pattern-based transformations.

$$
\text{SnowballStemmer}(w) = 
\text{normalize}\Big(
w - \sum_{i=1}^{n} \text{suffix}_i \cdot \mathbb{1}_{\text{rule}_i(w)}
\Big)
$$

Each rule ($\text{rule}_i$):
- Checks if the word ends with a specific suffix (e.g., `-ed`, `-ing`, `-ation`)
- Verifies a *minimum stem length* condition
- Applies replacements like:  
  - `ies ‚Üí i`  
  - `ization ‚Üí ize`  
  - `ational ‚Üí ate`  
  - `fulness ‚Üí ful`

üß† However:
- There is **no restoration rule** (like *i ‚Üí y*, *add back e*)  
- There is **no dictionary check** to verify if the result is a valid word  
- It assumes words sharing the same morphological stem should reduce to the same root

---

### ‚öôÔ∏è Design Philosophy (Intentional Behavior)

The Snowball stemmer intentionally produces **canonical base forms**,  
not **valid English words** ‚Äî because it‚Äôs built for **Information Retrieval (IR)**, not grammar.

Example:

| Use Case | Goal |
|:--|:--|
| Search ‚Äúconnections‚Äù | Should match documents containing ‚Äúconnect‚Äù, ‚Äúconnecting‚Äù, ‚Äúconnection‚Äù |
| Text classification | Token frequency of ‚Äúconnect‚Äù should count all related forms |
| Linguistic correctness | ‚ùå Not required |

Hence, truncations like ‚Äústudi‚Äù, ‚Äúhappi‚Äù, ‚Äúorgan‚Äù are **acceptable stems** for IR,  
since they merge semantically related words into one root.

---

### üìò Summary: Stemming ‚â† Lemmatization

| Aspect | Stemming (Porter/Snowball) | Lemmatization (WordNet) |
|:--|:--|:--|
| Logic | Heuristic rules | Morphological + dictionary |
| Output | Non-word stems | Valid dictionary words |
| Context awareness | ‚ùå None | ‚úÖ Uses POS tags |
| Example | ‚Äústudies‚Äù ‚Üí ‚Äústudi‚Äù | ‚Äústudies‚Äù ‚Üí ‚Äústudy‚Äù |
| Use case | IR / Search / Topic Modeling | Linguistic / Semantic tasks |

---

### üí° Key Takeaway

- ‚ùó **Incorrect-looking stems ‚â† wrong** ‚Äî they‚Äôre **intentional** truncations.  
- ‚ùó The **Snowball Stemmer** doesn‚Äôt ‚Äúunderstand‚Äù language ‚Äî it only applies suffix heuristics.  
- ‚úÖ For **real-word roots**, move to **WordNet Lemmatizer** (which we‚Äôll cover next).

---

üìö **In one line:**

> *Snowball stemmer trims words for equality, not for readability.*


## ‚öñÔ∏è Porter vs Snowball Stemmer ‚Äî Behavior on `fairly` and `sportingly`

We‚Äôll compare how the **Porter Stemmer** and **Snowball (Porter2) Stemmer** process  
two adverbial words: `fairly` and `sportingly`.

---

### üß© Code Used

```python
from nltk.stem import PorterStemmer, SnowballStemmer

porter = PorterStemmer()
snowball = SnowballStemmer(language='english')

print("Porter:", porter.stem("fairly"), ",", porter.stem("sportingly"))
print("Snowball:", snowball.stem("fairly"), ",", snowball.stem("sportingly"))
```

In [46]:
print("Porter:", porter.stem("fairly"), ",", porter.stem("sportingly"))
print("Snowball:", snowball.stem("fairly"), ",", snowball.stem("sportingly"))

Porter: fairli , sportingli
Snowball: fair , sport


### üîç Output

| Word | **Porter Stemmer** | **Snowball Stemmer (Porter2)** | ‚úÖ **Explanation** |
|:--|:--|:--|:--|
| `fairly` | `fairli` | `fair` | Porter replaces *y ‚Üí i* (old rule); Snowball cleanly removes `-ly` |
| `sportingly` | `sportingli` | `sport` | Porter only removes `-ly` (leaving `sportingli`); Snowball removes the full `-ingly` suffix |

---

### üß† Why Porter Gives ‚Äúfairli‚Äù and ‚Äúsportingli‚Äù

The **Porter Stemmer** applies a **fixed, rule-based sequence** without understanding grammar or word parts.  
It mechanically replaces or trims suffixes based on pattern rules.

In **Step 1c** of Porter‚Äôs algorithm:

> (*v*) Y ‚Üí I  
> If the word contains a vowel and ends with *y*, replace *y* with *i*.

So:

$$
\text{fairly} \xrightarrow[\text{remove } -ly]{} \text{fairy} \xrightarrow[\text{Y‚ÜíI}]{} \text{fairli}
$$

And for *sportingly*:

$$
\text{sportingly} \xrightarrow[\text{remove } -ly]{} \text{sportingli}
$$

‚úÖ Correct per its rules,  
‚ùå but not linguistically meaningful (non-word outputs).

---

### üßä Why Snowball (Porter2) Is Better

**Snowball (Porter2)** improves Porter‚Äôs logic by:
- Recognizing multi-suffix patterns (*-ly*, *-ingly*, *-edly*)  
- Avoiding unnecessary *y ‚Üí i* replacements  
- Applying **normalization** steps for smoother stems  

Hence:

$$
\text{SnowballStemmer("fairly")} = \text{"fair"}
$$

$$
\text{SnowballStemmer("sportingly")} = \text{"sport"}
$$

---

### üßÆ Formal Difference

$$
\text{PorterStemmer}(w) =
\text{apply\_rules}(w, \text{fixed\_suffixes})
$$

$$
\text{SnowballStemmer}(w) =
\text{normalize}\Big(
w -
\sum_{i=1}^{n}
\text{suffix}_i \cdot
\mathbb{1}_{\text{rule}_i(w)}
\Big)
$$

where:

- $\mathbb{1}_{\text{rule}_i(w)} = 1$ if linguistic rule *i* applies  
- $\text{normalize}(\cdot)$ handles cleanup like removing double consonants or restoring vowels

---

### üìò Summary

| Aspect | **Porter Stemmer** | **Snowball Stemmer (Porter2)** |
|:--|:--|:--|
| Rule design | Fixed, legacy English rules | Modular, modern rule system |
| Handles `-ly` / `-ingly` | Partial (‚Üí ‚Äúsportingli‚Äù) | ‚úÖ Robust (‚Üí ‚Äúsport‚Äù) |
| `fairly` ‚Üí | `fairli` | `fair` |
| `sportingly` ‚Üí | `sportingli` | `sport` |
| Accuracy | ‚ùå Often non-word | ‚úÖ Linguistically cleaner |
| Language support | English only | Multilingual (English, German, Spanish, etc.) |

---

### üí° Takeaway

> The **Porter Stemmer** applies old mechanical rules ‚Äî fast but crude.  
> The **Snowball (Porter2)** Stemmer refines these rules for readability, consistency, and multilingual support.  
> For **true dictionary words**, the next step is **WordNet Lemmatization**.


## ‚ö° Lancaster Stemmer ‚Äî The Aggressive Rule-Based Stemmer

The **Lancaster Stemmer** (also known as the **Paice/Husk Stemmer**) is a **very aggressive, iterative rule-based** stemmer.  
It was developed by **Chris Paice (1990)** and applies a series of **deletion and substitution rules** until no more rules can be applied.  

It‚Äôs fast ‚ö° and simple, but often **over-stems** words ‚Äî chopping too much and sometimes merging unrelated terms.

---

### üîπ Overview

- Developed after Porter, designed to be **simpler and faster**  
- Uses a **rule lookup table** of around **120 rules**  
- Applies rules **iteratively** (multiple passes) until no further reduction is possible  
- Each rule defines:
  - a **suffix pattern** to remove  
  - a **replacement** (optional)  
  - and whether to **continue or stop**

---

### üß† How It Works

Each rule in Lancaster has the general form:

$$
\text{<ending><condition><replacement><continue>}
$$

---

### üß© Example Rules

```text
tion4>     ‚Üí remove "tion" if word length > 4  
ed4>       ‚Üí remove "ed" if word length > 4  
y>i        ‚Üí replace "y" with "i"
```

---

### üîç Sample Output

| Word | **Lancaster Stem** | ‚úÖ **Observation** |
|:--|:--|:--|
| `connection` | `connect` | ‚úÖ Good |
| `connected` | `connect` | ‚úÖ Good |
| `connecting` | `connect` | ‚úÖ Good |
| `organization` | `organ` | ‚ö†Ô∏è Over-stemmed (lost ‚Äúize‚Äù meaning) |
| `organized` | `organ` | ‚ö†Ô∏è Same ‚Äî merges with ‚Äúorgan‚Äù |
| `happiness` | `happy` | ‚úÖ Accurate |
| `practically` | `practic` | ‚ö†Ô∏è Slight truncation |
| `studies` | `study` | ‚úÖ Correct |
| `generalization` | `gener` | ‚ö†Ô∏è Too short, over-stemmed |

---

### ‚ö†Ô∏è Characteristics & Limitations

| Aspect | Description |
|:--|:--|
| ‚ö° **Aggressive** | Removes large parts of the word; may over-stem |
| üîÅ **Iterative** | Continues applying rules until no further matches |
| ‚ùå **Unstable** | Small changes in input can cause big differences in output |
| üß© **Short stems** | Often results in very short root forms (e.g., ‚Äúcompute‚Äù, ‚Äúcomputer‚Äù ‚Üí ‚Äúcomput‚Äù) |
| üìö **No context** | Doesn‚Äôt distinguish noun/verb forms or semantics |

---

### üßÆ Mathematical View (KaTeX)

$$
\text{LancasterStemmer}(w) =
\text{Iterate}\Big(
w - \sum_{i=1}^{n}
\text{rule}_i(w)
\Big)
$$

Where:  
- Each $\text{rule}_i$ is a transformation (delete/replace suffix)  
- The process **continues until convergence** (no rule applies)

That is:

$$
w_{t+1} = \text{apply\_rule}(w_t)
\quad \text{until} \quad
w_{t+1} = w_t
$$

---

### ‚öôÔ∏è Strengths vs Weaknesses

| ‚úÖ Strengths | ‚ö†Ô∏è Weaknesses |
|:--|:--|
| Very fast and simple | Over-stemming is common |
| Small code footprint | Can merge unrelated words |
| Iterative and deterministic | Not linguistically aware |
| Handles many English suffixes | Poor accuracy for complex derivations |

---

### üìò Summary

| Feature | Description |
|:--|:--|
| Algorithm | Rule-based, iterative |
| Developer | Chris Paice (1990) |
| Aggressiveness | üî• Very high |
| Accuracy | ‚ö†Ô∏è Moderate |
| Iterative | Yes |
| Output validity | Often non-dictionary stems |
| Example | `organization ‚Üí organ` |
| Use case | When **speed > accuracy** (e.g., keyword compression) |

---

### üí° Takeaway

> The **Lancaster Stemmer** is the fastest and most aggressive among classical stemmers.  
> It‚Äôs suitable for quick **indexing or keyword matching**, but **not recommended for semantic NLP tasks**.  
>  
> For balanced results ‚Äî use **Snowball (Porter2)**.  
> For linguistically valid roots ‚Äî use **WordNet Lemmatizer** next.


In [47]:
# ‚ö†Ô∏è 3. Lancaster Stemmer ‚Äî very aggressive
#     Often chops too much, merging unrelated words.
#     But it‚Äôs extremely fast and compact.

lancaster = LancasterStemmer()

print("\nüîπ Lancaster Stemmer Results (Aggressive):\n")
for w in tokens:
    stem = lancaster.stem(w)
    print(f"{w:>15}  ‚Üí  {stem}")

# üß† Notes:
# - Often shorter stems than Porter/Snowball (e.g., 'connection' ‚Üí 'connect')
# - Can over-stem: 'practically' ‚Üí 'practic' or even 'prac'
# - Useful when you want maximum vocabulary compression



üîπ Lancaster Stemmer Results (Aggressive):

        connect  ‚Üí  connect
      connected  ‚Üí  connect
     connection  ‚Üí  connect
    connections  ‚Üí  connect
     connecting  ‚Üí  connect
          study  ‚Üí  study
        studies  ‚Üí  study
       studying  ‚Üí  study
        studied  ‚Üí  study
          happy  ‚Üí  happy
      happiness  ‚Üí  happy
      unhappily  ‚Üí  unhappy
   organization  ‚Üí  org
       organize  ‚Üí  org
      organized  ‚Üí  org
     organizing  ‚Üí  org
     generalize  ‚Üí  gen
    generalized  ‚Üí  gen
 generalization  ‚Üí  gen
             go  ‚Üí  go
          going  ‚Üí  going
           goes  ‚Üí  goe
           gone  ‚Üí  gon
         better  ‚Üí  bet
           best  ‚Üí  best
     relational  ‚Üí  rel
         relate  ‚Üí  rel
        related  ‚Üí  rel
       relating  ‚Üí  rel
           cats  ‚Üí  cat
          boxes  ‚Üí  box
           mice  ‚Üí  mic
    practically  ‚Üí  pract
      practical  ‚Üí  pract
    practicable  ‚Üí  pract

## ‚ö†Ô∏è Why Some Lancaster Stemmer Results Look Incorrect

The **Lancaster Stemmer** is extremely **aggressive** and **iterative**, which makes it *fast* but often **linguistically inaccurate**.  
Unlike Porter or Snowball, it doesn‚Äôt just remove a suffix ‚Äî it applies **a chain of truncation and replacement rules repeatedly**,  
until the word can no longer be shortened.

---

### üîç Example Output

| Word | **Lancaster Stem** | ‚úÖ **Expected** | ‚öôÔ∏è **Observation** |
|:--|:--|:--|:--|
| `organization` | `org` | `organize` | Over-stemmed ‚Äî repeatedly truncated |
| `generalization` | `gen` | `generalize` | Over-stemmed; multiple suffix rules applied |
| `unhappily` | `unhappy` | `unhappy` | ‚úÖ Correct ‚Äî stripped ‚Äú-ily‚Äù |
| `finally` | `fin` | `final` | Over-stemmed; lost meaningful suffix |
| `gone` | `gon` | `go` | Partial removal of ‚Äúe‚Äù only |
| `mice` | `mic` | `mouse` | ‚ùå Wrong ‚Äî no understanding of irregular plurals |
| `better` | `bet` | `good` | ‚ùå Not aware of comparative/synonymic meaning |
| `writing` | `writ` | `write` | ‚úÖ Acceptable stem |
| `organization` | `org` | `organize` | ‚ö†Ô∏è Over-stemmed ‚Äî intended for text compression |
| `practical` | `pract` | `practical` | ‚ö†Ô∏è Truncated aggressively |

---

### üß† Why It Happens

The **Lancaster Stemmer** uses a compact **rule table of about 120 transformation rules**.  
Each rule defines:
- a **suffix** to remove,
- a **minimum stem length**, and  
- whether to **continue or stop** processing.

For example, some of its rules look like:

```text
ion4>      ‚Üí remove "ion" if length > 4
ize3>      ‚Üí remove "ize" if length > 3
al4>       ‚Üí remove "al" if length > 4
e>         ‚Üí remove "e"
y>i        ‚Üí replace "y" with "i"
```

These are applied **iteratively**.  
So a word like **‚Äúorganization‚Äù** goes through multiple transformations:

```text
organization 
‚Üí organize
‚Üí organ
‚Üí org
```

‚úÖ The algorithm stops **only when no more rules match**,  
hence the stems are often **shorter than expected**.

---

### üß© Mathematical Representation 

$$
w_{t+1} = f(w_t)
$$

where:

$$
f(w) = 
\begin{cases}
w - \text{suffix}, & \text{if a rule applies} \\
\text{replace}(w), & \text{if replacement condition met} \\
w, & \text{otherwise}
\end{cases}
$$

The process continues until convergence:

$$
\text{LancasterStemmer}(w) = \lim_{t \to T} f^{(t)}(w)
$$

---

### üîé Key Differences from Porter / Snowball

| Feature | **Lancaster** | **Porter** | **Snowball (Porter2)** |
|:--|:--|:--|:--|
| Rule type | Iterative table lookup | Sequential suffix removal | Structured rule system |
| Aggressiveness | üî• Very high | Medium | Controlled |
| Iterative passes | ‚úÖ Yes | ‚ùå No | ‚ùå No |
| Output validity | Often non-word | Sometimes non-word | Usually readable |
| `organization` | `org` | `organ` | `organ` |
| `generalization` | `gen` | `general` | `general` |
| `practically` | `pract` | `practic` | `practic` |
| Speed | ‚ö° Fastest | ‚ö° Fast | ‚ö° Fast |
| Use case | IR compression | General NLP | Balanced NLP |

---

### üìò When Results Are ‚ÄúIncorrect‚Äù

They look incorrect because:

1. The algorithm has **no linguistic understanding** ‚Äî it only applies string patterns.  
2. Rules can **cascade**, meaning multiple suffix rules can apply in sequence.  
3. It doesn‚Äôt check whether the **resulting stem** is a valid English word.  
4. It was designed for **Information Retrieval (IR)**, not grammatical correctness.

---

### ‚úÖ When It‚Äôs Still Useful

| Use Case | Reason |
|:--|:--|
| **Search indexing** | Fewer unique stems ‚Üí faster lookups |
| **Text deduplication** | Groups related word forms aggressively |
| **Keyword extraction** | Reduces inflectional variations |
| **Count-based NLP models** | Smaller vocabulary ‚Üí more efficient vectorization |

---

### üí° Summary

> The **Lancaster Stemmer** is extremely fast but highly aggressive.  
> It‚Äôs ideal for **text retrieval** and **index compression**,  
> but unsuitable for linguistically sensitive NLP applications.  
>
> ‚úÖ Use **Snowball (Porter2)** for balanced stemming.  
> ‚úÖ Use **WordNet Lemmatizer** for meaningful dictionary words.


In [48]:
# üßÆ Compare outputs of all stemmers side by side for quick analysis

def compare_stemmers(word):
    return {
        "token": word,
        "Porter": porter.stem(word),
        "Snowball": snowball.stem(word),
        "Lancaster": lancaster.stem(word),
        "Regexp": regexp.stem(word),
    }

results = [compare_stemmers(w) for w in tokens]

# Compute column widths for nice console alignment
cols = ["token", "Porter", "Snowball", "Lancaster", "Regexp"]
widths = {c: max(len(c), max(len(str(r[c])) for r in results)) for c in cols}

# Header
header = "  ".join(c.upper().ljust(widths[c]) for c in cols)
print(header)
print("-" * len(header))

# Rows
for row in results:
    print("  ".join(str(row[c]).ljust(widths[c]) for c in cols))

# üí° Alternative visualization:
# import pandas as pd
# df = pd.DataFrame(results)
# display(df)


TOKEN           PORTER     SNOWBALL   LANCASTER  REGEXP        
---------------------------------------------------------------
connect         connect    connect    connect    connect       
connected       connect    connect    connect    connect       
connection      connect    connect    connect    connection    
connections     connect    connect    connect    connection    
connecting      connect    connect    connect    connect       
study           studi      studi      study      study         
studies         studi      studi      study      stud          
studying        studi      studi      study      study         
studied         studi      studi      study      studi         
happy           happi      happi      happy      happy         
happiness       happi      happi      happy      happines      
unhappily       unhappili  unhappili  unhappy    unhappi       
organization    organ      organ      org        organization  
organize        organ      organ      or

## üìä Observations and Comparison

| Word | Porter | Snowball | Lancaster | Regexp | Remarks |
|:----|:-------|:---------|:-----------|:--------|:--------|
| connected | connect | connect | connect | connect | ‚úÖ Consistent |
| studies | studi | studi | study | study | ‚ö†Ô∏è Minor differences |
| happiness | happi | happi | happy | happi | ‚úÖ Close results |
| organizing | organ | organ | organ | organiz | ‚ö†Ô∏è Regexp retains ‚Äòz‚Äô |
| practical | practic | practic | pract | practic | ‚ö†Ô∏è Lancaster over-stems |

---

### ‚ö†Ô∏è Over-stemming vs Under-stemming

- **Over-stemming:** unrelated words merge into one root  
  e.g., ‚Äúpracticable‚Äù and ‚Äúpractical‚Äù ‚Üí both ‚Üí ‚Äúpractic‚Äù
- **Under-stemming:** related words fail to merge  
  e.g., ‚Äúrelational‚Äù and ‚Äúrelate‚Äù remain different stems

---

### ‚úÖ Best Practices

- Use **Porter** or **Snowball** for English text preprocessing pipelines.  
- Use **Lancaster** only for **aggressive vocabulary compression** (like search engines).  
- Use **Regexp** for **custom domains** ‚Äî especially if you know common suffix patterns.  
- For linguistically valid roots ‚Üí switch to **lemmatization**.

---

### üßÆ Mathematical Recap

$$
\text{Stem: } \mathcal{W} \to \mathcal{S}, \quad s = \text{stem}(w)
$$

$$
\text{Word Family: } \{\text{connect}, \text{connected}, \text{connections}, \text{connecting}\} 
\xrightarrow{\text{stem}} \{\text{connect}\}
$$


## üß© WordNet Lemmatization ‚Äî Getting True Dictionary Words

Unlike **stemming**, which just chops off suffixes,  
**lemmatization** uses **vocabulary + grammar rules** to find the *actual dictionary root* (lemma) of a word.

It considers:
- Part of Speech (POS) ‚Äî noun, verb, adjective, adverb  
- Morphological rules (e.g., *studies ‚Üí study*, *better ‚Üí good*)  
- WordNet lexical database for valid words

---

### üíª Example Code


In [49]:
# Download WordNet if not already done
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [50]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

words = ["studies", "studying", "better", "organized", "cats", "went"]

print("üîπ Lemmatization Examples:\n")
for w in tokens:
    print(f"{w:>10} ‚Üí {lemmatizer.lemmatize(w)}")

# Using POS (Part of Speech) for better accuracy
print("\nüîπ With POS Tags:\n")
for w in tokens:
    print(f"{w} (verb): {lemmatizer.lemmatize(w, pos='v')}")
    print(f"{w} (adjective): {lemmatizer.lemmatize(w, pos='a')}")

    
print("studying (verb):", lemmatizer.lemmatize("studying", pos='v'))
print("better (adjective):", lemmatizer.lemmatize("better", pos='a'))

üîπ Lemmatization Examples:

   connect ‚Üí connect
 connected ‚Üí connected
connection ‚Üí connection
connections ‚Üí connection
connecting ‚Üí connecting
     study ‚Üí study
   studies ‚Üí study
  studying ‚Üí studying
   studied ‚Üí studied
     happy ‚Üí happy
 happiness ‚Üí happiness
 unhappily ‚Üí unhappily
organization ‚Üí organization
  organize ‚Üí organize
 organized ‚Üí organized
organizing ‚Üí organizing
generalize ‚Üí generalize
generalized ‚Üí generalized
generalization ‚Üí generalization
        go ‚Üí go
     going ‚Üí going
      goes ‚Üí go
      gone ‚Üí gone
    better ‚Üí better
      best ‚Üí best
relational ‚Üí relational
    relate ‚Üí relate
   related ‚Üí related
  relating ‚Üí relating
      cats ‚Üí cat
     boxes ‚Üí box
      mice ‚Üí mouse
practically ‚Üí practically
 practical ‚Üí practical
practicable ‚Üí practicable
   writing ‚Üí writing
    writes ‚Üí writes
programming ‚Üí programming
  programs ‚Üí program
   history ‚Üí history
   finally ‚Üí fi