# 📘 WordNet Lemmatization

> **Objective:**  
> Transform words into their **base or dictionary form (lemma)** using linguistic knowledge of **morphology and part-of-speech (POS)** tags.  
> Unlike *stemming*, which crudely chops off word endings, **lemmatization** ensures the result is a valid, meaningful word found in a lexicon (WordNet).

---

## 🧩 1. What is Lemmatization?

- Lemmatizzation technique is like stemming. The output we will get after lemmatization is called **lemma**
- Lemmatization finds the **canonical (dictionary) form** of a word — called the **lemma**.
- It’s **context-aware** and **POS-sensitive**, using grammatical category (noun, verb, adjective, adverb) to find the correct base form.
- Lemmatization relies on **WordNet**, a large lexical database of English maintained by Princeton University.
- After Lemmatization, we will be getting a valid word that means the same thing

💡 **Example Difference vs Stemming**

| Word | Porter Stemmer | Lemmatizer | Comment |
|------|----------------|-------------|----------|
| studies | studi | study | Returns valid word |
| better | better | good | Handles irregular adjective |
| mice | mice | mouse | Handles plural-to-singular |
| gone | gone | go | Handles past participle |
| running | run | run | Similar, but via POS check |

---

## 🧮 2. Theoretical Foundation (Mathematical View)

The lemmatization function can be mathematically defined as:

$$
\text{Lemma}(w, \text{POS}) = \operatorname{lookup}_{\text{WordNet}}\Big(\text{morph}(w), \text{POS}\Big)
$$

where:

- $ w $ → input word token  
- $ \text{POS} $ → part-of-speech tag  
- $ \text{morph}(w) $ → morphological normalization (e.g., removing inflectional suffixes like *-ing*, *-ed*)  
- $ \operatorname{lookup}_{\text{WordNet}} $ → dictionary-based retrieval of lemma

If no match is found:

$$
\text{Lemma}(w, \text{POS}) =
\begin{cases}
\text{headword in WordNet}, & \text{if found}\\[3pt]
\text{morph}(w), & \text{otherwise}
\end{cases}
$$

📘 **Simplified Intuition:**

$$
\text{Lemma}(w,\text{POS}) = \text{dictionary lookup based on morphology and POS}
$$

---

## ⚙️ 3. Implementation with NLTK’s WordNet Lemmatizer

NLTK provides a built-in **`WordNetLemmatizer`** class that uses the **WordNet** lexical database to return the lemma (dictionary base form) of a word.

> 💡 **Key Detail:**  
> The `WordNetLemmatizer` is a **thin wrapper** around NLTK’s internal `WordNetCorpusReader` class.  
> Under the hood, it calls the **`morph()`** function of `WordNetCorpusReader` to perform **morphological analysis** and find the lemma.

This means:
- It first applies **morphological normalization** to remove inflectional suffixes (e.g., *-ing*, *-ed*, *-s*).  
- Then it performs a **dictionary lookup** within WordNet to find the canonical form of the word based on its **part of speech (POS)**.

---

### ⚙️ Import and Setup

Before using the lemmatizer, we must download required NLTK corpora and initialize the tools.


```python
import nltk

# Download necessary resources
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
```
---

In [22]:
import nltk

# Download necessary resources
nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/psundara/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


True

In [23]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk import pos_tag, word_tokenize

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

Once initialized, the `lemmatizer.lemmatize()` method can be used as:

$$
\text{lemma} = \text{lemmatizer.lemmatize}(w, \text{pos})
$$

where:
- $ w $: input word  
- $ \text{pos} $: optional POS tag (`'n'`, `'v'`, `'a'`, `'r'`) corresponding to **noun**, **verb**, **adjective**, or **adverb**.

If `pos` is not provided, it defaults to **noun (`'n'`)**, which can lead to incorrect results for verbs or adjectives — hence POS tagging is essential.

---


## 🧩 4. Why POS Tagging is Crucial

Lemmatization is **part-of-speech (POS) sensitive**.  
Without knowing whether a word is a *noun*, *verb*, *adjective*, or *adverb*, the lemmatizer cannot accurately determine its correct base form.

For example:
- `running` as a **noun** → `running`
- `running` as a **verb** → `run`

Hence, we must first assign **POS tags** (using Penn Treebank tags), and then **map them** to WordNet POS tags understood by the lemmatizer.

---

### 💡 POS Tag Mapping (Penn → WordNet)

| Word Type | Penn Tag Prefix | WordNet Constant | Example |
|------------|-----------------|------------------|----------|
| Noun | `N` | `wn.NOUN` | mice → mouse |
| Verb | `V` | `wn.VERB` | gone → go |
| Adjective | `J` | `wn.ADJ` | better → good |
| Adverb | `R` | `wn.ADV` | finally → finally |

---

The mapping function can be defined as:

$$
\text{map}(t) =
\begin{cases}
wn.ADJ, & \text{if } t \text{ starts with } J \\[4pt]
wn.VERB, & \text{if } t \text{ starts with } V \\[4pt]
wn.NOUN, & \text{if } t \text{ starts with } N \\[4pt]
wn.ADV, & \text{if } t \text{ starts with } R \\[4pt]
wn.NOUN, & \text{otherwise (default)}
\end{cases}
$$



---
### ⚙️ The Internal Flow

```text
WordNetLemmatizer.lemmatize()
        ↓
calls  nltk.corpus.reader.wordnet._morphy()
        ↓
uses   WordNetCorpusReader.morph()
        ↓
accesses WordNet dictionary entries and morphological rules
```



----
### 💡 Visualization: Internal Flow Diagram
```text
WordNetLemmatizer
   │
   └──> wordnet._morphy(word, pos)
           │
           └──> WordNetCorpusReader.morph(form, pos)
                    │
                    └──> Applies suffix rules + dictionary lookup

```

## 🧩 Understanding Penn Treebank POS Tags

When we run NLTK’s `pos_tag()` function on tokens,  
it returns **Penn Treebank POS tags** such as `NN`, `VBD`, `JJ`, `RBR`, etc.

These are **fine-grained grammatical tags** used in NLP pipelines to describe
each word’s syntactic role (noun, verb, adjective, adverb, etc.).

For example:

| Tag | Full Form | Example | Description |
|------|-------------|----------|--------------|
| **NN** | Noun, singular | dog | singular noun |
| **NNS** | Noun, plural | dogs | plural noun |
| **NNP** | Proper noun, singular | India | proper name |
| **VB** | Verb, base form | go | base verb |
| **VBD** | Verb, past tense | went | past tense |
| **VBG** | Verb, gerund/present participle | running | `-ing` form |
| **VBN** | Verb, past participle | gone | past participle |
| **VBP** | Verb, non-3rd person singular present | run | present tense |
| **VBZ** | Verb, 3rd person singular present | runs | present tense |
| **JJ** | Adjective | quick | adjective |
| **JJR** | Adjective, comparative | quicker | comparative adjective |
| **JJS** | Adjective, superlative | quickest | superlative adjective |
| **RB** | Adverb | quickly | adverb |
| **RBR** | Adverb, comparative | faster | comparative adverb |
| **RBS** | Adverb, superlative | fastest | superlative adverb |
| **PRP** | Personal pronoun | he, she | pronouns |
| **DT** | Determiner | the, a | articles |
| **IN** | Preposition / subordinating conjunction | in, on, after | connector |
| **CC** | Coordinating conjunction | and, but | conjunction |
| **TO** | "to" | to go | infinitive marker |

---

### 💡 Example



In [24]:
from nltk import pos_tag, word_tokenize

text = "The better mice were running faster than others."
print(pos_tag(word_tokenize(text)))


[('The', 'DT'), ('better', 'JJR'), ('mice', 'NN'), ('were', 'VBD'), ('running', 'VBG'), ('faster', 'RBR'), ('than', 'IN'), ('others', 'NNS'), ('.', '.')]


In [25]:
from nltk.corpus import wordnet as wn

def penn_to_wn(penn_tag):
    """Convert Penn Treebank POS tags to WordNet POS tags."""
    if penn_tag.startswith('J'):
        return wn.ADJ
    elif penn_tag.startswith('V'):
        return wn.VERB
    elif penn_tag.startswith('N'):
        return wn.NOUN
    elif penn_tag.startswith('R'):
        return wn.ADV
    return wn.NOUN  # default fallback



---

### 🧩 How It Connects to Lemmatization

Each Penn tag must be mapped to one of the four WordNet POS categories:

| Penn Prefix | WordNet POS | Meaning   |
| ----------- | ----------- | --------- |
| `N`         | `wn.NOUN`   | Noun      |
| `V`         | `wn.VERB`   | Verb      |
| `J`         | `wn.ADJ`    | Adjective |
| `R`         | `wn.ADV`    | Adverb    |

for instance:
- `NNS` → starts with `N` → `wn.NOUN`
`VBG` → starts with `V` → `wn.VERB`
`JJR` → starts with `J` → `wn.ADJ`
`RBR` → starts with `R` → `wn.ADV`

This mapping ensures the WordNetLemmatizer applies correct rules to find the right lemma.

---

### 🧩 **Code Cell 


In [26]:
import pandas as pd

penn_tags = ["NNS", "VBD", "VBG", "JJR", "RBR", "VBZ", "NNP", "RBS"]
mapped = [(t, penn_to_wn(t)) for t in penn_tags]
pd.DataFrame(mapped, columns=["Penn Tag", "Mapped WordNet POS"])

Unnamed: 0,Penn Tag,Mapped WordNet POS
0,NNS,n
1,VBD,v
2,VBG,v
3,JJR,a
4,RBR,r
5,VBZ,v
6,NNP,n
7,RBS,r



---
### 💬 Quick Recap
- `pos_tag()` → outputs Penn Treebank tags (fine-grained syntax categories).
- You → use `penn_to_wn()` → to map to WordNet POS tags (`n`, `v`, `a`, `r`).
- `WordNetLemmatizer` → then uses these to perform accurate dictionary-based lemmatization.

---

## 🧮 5. Lemmatization in Action

Now let’s observe how **WordNet Lemmatization** behaves on real examples — with and without POS tagging.

Below are sample words demonstrating plural, tense, and irregular forms:


In [27]:
from nltk import pos_tag

words = ["studies", "better", "gone", "was", "running", "organized", "mice", "finally"]

for w in words:
    pos = pos_tag([w])[0][1]          # Penn POS tag
    wn_pos = penn_to_wn(pos)          # Convert to WordNet POS
    lemma = lemmatizer.lemmatize(w, pos=wn_pos)
    print(f"{w:>10} ({pos}) → {lemma} → Mapped WordNet POS → {wn_pos}")


   studies (NNS) → study → Mapped WordNet POS → n
    better (RBR) → well → Mapped WordNet POS → r
      gone (VBN) → go → Mapped WordNet POS → v
       was (VBD) → be → Mapped WordNet POS → v
   running (VBG) → run → Mapped WordNet POS → v
 organized (VBN) → organize → Mapped WordNet POS → v
      mice (NN) → mouse → Mapped WordNet POS → n
   finally (RB) → finally → Mapped WordNet POS → r



---
### 💡 Expected Output Explanation

| Word | POS (Penn) | Lemma | Description |
|------|-------------|--------|-------------|
| studies | NNS | study | plural → singular |
| better | JJR | good | adjective comparative |
| gone | VBN | go | past participle |
| was | VBD | be | irregular verb |
| running | VBG | run | gerund form reduced |
| organized | VBD | organize | past tense normalized |
| mice | NNS | mouse | plural noun |
| finally | RB | finally | adverb unchanged |

---

🧮 **Lemmatization Equation with POS:**

$$
\text{Lemma}(w, \text{POS}) = \operatorname{lookup}_{\text{WordNet}}\big(\text{morph}(w), \text{POS}\big)
$$

---


In [28]:
from nltk import word_tokenize, pos_tag

sentence = "The mice were running faster and the better runner was finally organized."

tokens = word_tokenize(sentence)
tagged = pos_tag(tokens)

def lemmatize_sentence(tokens):
    """Lemmatize a sentence using POS tagging."""
    lemmas = []
    for word, pos in tagged:
        wn_pos = penn_to_wn(pos)
        lemma = lemmatizer.lemmatize(word, pos=wn_pos)
        lemmas.append(lemma)
    return lemmas

print("Tokens:", tokens)
print("POS Tags:", tagged)
print("Lemmas:", lemmatize_sentence(tokens))


Tokens: ['The', 'mice', 'were', 'running', 'faster', 'and', 'the', 'better', 'runner', 'was', 'finally', 'organized', '.']
POS Tags: [('The', 'DT'), ('mice', 'NN'), ('were', 'VBD'), ('running', 'VBG'), ('faster', 'RBR'), ('and', 'CC'), ('the', 'DT'), ('better', 'JJR'), ('runner', 'NN'), ('was', 'VBD'), ('finally', 'RB'), ('organized', 'VBN'), ('.', '.')]
Lemmas: ['The', 'mouse', 'be', 'run', 'faster', 'and', 'the', 'good', 'runner', 'be', 'finally', 'organize', '.']



---

Let’s visualize the transformation process for the sentence:

> “The mice were running faster and the better runner was finally organized.”

```text
┌──────────────────────────────────────────────────────────────────────────────┐
│                         Raw Input Text                                       │ 
│  "The mice were running faster and the better runner was finally organized." │
└──────────────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                  🧩 Tokenization (word_tokenize)
                              │
                              ▼
["The", "mice", "were", "running", "faster", "and",
 "the", "better", "runner", "was", "finally", "organized", "."]
                              │
                              ▼
              ⚙️ Stemming (Porter / Snowball / etc.)
                              │
                              ▼
["the", "mic", "were", "run", "faster", "and",
 "the", "better", "run", "was", "final", "organ", "."]
       ↑                     ↑                        ↑
  (non-word)          (root form)               (over-stemmed)
                              │
                              ▼
          📘 Lemmatization (WordNet + POS Awareness)
                              │
                              ▼
["the", "mouse", "be", "run", "fast", "and",
 "the", "good", "runner", "be", "finally", "organize", "."]
       ↑          ↑          ↑            ↑
  plural→sing.   tense→base  comp→base   adj→verb (semantic lemma)

```

## 🧩 7. Lemmatizer vs. Stemmers — Comparative Study

| Word | Porter | Snowball | Lancaster | Regexp | Lemmatizer |
|------|---------|-----------|-------------|----------|-------------|
| studies | studi | studi | study | studi | study |
| better | better | better | bet | better | good |
| gone | gone | gone | gon | gone | go |
| was | wa | wa | was | was | be |
| running | run | run | run | run | run |
| organized | organ | organiz | organ | organiz | organize |
| mice | mic | mic | mic | mice | mouse |
| finally | final | final | fin | final | finally |

🧩 **Observations:**
- Lemmatizer produces valid **dictionary words**.  
- Stemmers often output non-words (`organiz`, `organ`).  
- Lemmatization correctly handles **irregular** forms and **POS** sensitivity.

---

## 🧮 8. Lemmatization Pipeline (Mathematical Model)

$$[
\text{Lemma}(w) = \text{Lemma}\big(w,\; \text{map}(\text{tag}(w))\big)
$$

$$
\text{Lemma}(w, \text{POS}) = \operatorname{lookup}_{\text{WordNet}}\big(\text{morph}(w), \text{POS}\big)
$$

Thus,  
$$
\boxed{\text{Lemmatization} = \text{POS Tagging} + \text{Morphology} + \text{Dictionary Lookup}}
$$

---


## ⚠️ 9. Caveats and Best Practices

⚠️ **Caveats:**
- Requires accurate **POS tagging** for correctness.
- Coverage limited to **WordNet vocabulary**.
- Does not perform **contextual disambiguation** (e.g., *saw* can mean `see` or `cut`).
- Slower than stemmers due to lookups and tagging.

✅ **Best Practices:**
- Always combine with a POS tagger before lemmatization.
- Prefer for **semantic NLP tasks**: question answering, summarization, chatbot NLU.
- Use stemmers for **speed-critical** IR or indexing tasks.

---

## ✅ 10. Final Comparison — All Normalization Techniques

| Technique | Output Type | POS-Aware | Handles Irregulars | Speed | Output Validity | Typical Use |
|------------|--------------|------------|---------------------|--------|------------------|---------------|
| **Porter Stemmer** | Truncated root (non-word) | ❌ | ❌ | ⚡ Fast | ⚠️ | IR baseline |
| **Snowball Stemmer** | Balanced stem | ❌ | ❌ | ⚡ Fast | ⚠️ | General stemming |
| **Lancaster Stemmer** | Aggressive | ❌ | ❌ | ⚡ Fast | ❌ | Noisy text |
| **Regexp Stemmer** | Pattern-based | ❌ | ❌ | ⚡ Fast | ⚠️ Custom only | Controlled suffix removal |
| **WordNet Lemmatizer** | Real word | ✅ | ✅ | 🐢 Slower | ✅ | Semantic, context-aware NLP |

---

### 💡 TL;DR

- **Stemming** → rule-based truncation (fast, but rough).  
- **Lemmatization** → linguistically accurate (slow, but meaningful).  
- Use **Lemmatizer + POS Tagging** for all **meaning-driven** NLP tasks.



---
## 📊 11. Lemmatization Visualization Pipeline

To make lemmatization results more intuitive, let's visualize how each **token** transforms through the process:

1. **Tokenization** → breaking the sentence into words  
2. **POS Tagging** → assigning grammatical roles  
3. **POS Conversion** → mapping Penn POS to WordNet POS  
4. **Lemmatization** → finding dictionary base forms

This table helps you observe how **POS affects the final lemma**, especially for verbs, adjectives, and irregular forms.


In [29]:
import pandas as pd
from IPython.display import Markdown

def visualize_lemmatization(sentence):
    """Display a DataFrame showing Token → POS → WordNet POS → Lemma."""
    tokens = word_tokenize(sentence)
    tagged = pos_tag(tokens)

    data = []
    for word, pos in tagged:
        wn_pos = penn_to_wn(pos)
        lemma = lemmatizer.lemmatize(word, pos=wn_pos)
        data.append({
            "Token": word,
            "Penn POS": pos,
            "WordNet POS": wn_pos if wn_pos else "-",
            "Lemma": lemma
        })
    
    df = pd.DataFrame(data)
    display(Markdown(f"### 🔍 Input Sentence:\n> *{sentence}*"))
    display(df)

# Example usage
visualize_lemmatization("The mice were running faster and the better runner was finally organized.")


### 🔍 Input Sentence:
> *The mice were running faster and the better runner was finally organized.*

Unnamed: 0,Token,Penn POS,WordNet POS,Lemma
0,The,DT,n,The
1,mice,NN,n,mouse
2,were,VBD,v,be
3,running,VBG,v,run
4,faster,RBR,r,faster
5,and,CC,n,and
6,the,DT,n,the
7,better,JJR,a,good
8,runner,NN,n,runner
9,was,VBD,v,be



---

### 💡 Observations from Visualization

| Insight | Explanation |
|----------|--------------|
| `mice → mouse` | Lemmatizer correctly handles plural nouns. |
| `were → be` | Verb inflection (past tense) normalized to root. |
| `running → run` | Gerund form reduced to base form using POS. |
| `better → good` | Adjective comparative handled via WordNet semantics. |
| `finally → finally` | Adverbs usually remain unchanged. |

---

### 🧮 Lemmatization Flow Summary


$$
\text{Tokenized Words} 
\;\xrightarrow{\text{POS Tagger}}\;
\text{Tagged Pairs (w, t)}
\;\xrightarrow{\text{map}(t)}\;
\text{WordNet POS}
\;\xrightarrow{\text{lookup}}\;
\text{Lemma}(w, \text{POS})
$$


or in short:


$$
\boxed{\text{Lemma} = \text{dictionary lookup based on morphology and POS}}
$$


---

### ✅ Wrap-Up

- Lemmatization = **POS tagging + Morphological normalization + Dictionary lookup**  
- Produces linguistically valid words — crucial for **semantic NLP tasks** like:
  - Chatbots 🤖  
  - Information Retrieval 🔍  
  - Text Summarization 🧾  
  - Question Answering 💬  

---


In [30]:
def lemmatize_text_pipeline(text):
    """End-to-end text normalization visualization."""
    print("📥 Input Text:", text)
    visualize_lemmatization(text)

# Try it
lemmatize_text_pipeline("Studies were better when mice were running and finally gone.")


📥 Input Text: Studies were better when mice were running and finally gone.


### 🔍 Input Sentence:
> *Studies were better when mice were running and finally gone.*

Unnamed: 0,Token,Penn POS,WordNet POS,Lemma
0,Studies,NNS,n,Studies
1,were,VBD,v,be
2,better,JJR,a,good
3,when,WRB,n,when
4,mice,NN,n,mouse
5,were,VBD,v,be
6,running,VBG,v,run
7,and,CC,n,and
8,finally,RB,r,finally
9,gone,VBN,v,go


# 🧠 NLP Text Normalization Summary

You’ve now learned the **core preprocessing trilogy** of NLP:

> **Tokenization → Stemming → Lemmatization**

These three stages form the foundation of almost every **Natural Language Processing** workflow — from search engines to chatbots, sentiment analyzers, and summarizers.

---

## 📚 1. Conceptual Overview

| Stage | Purpose | Output Example | Notes |
|--------|----------|----------------|-------|
| **Tokenization** | Splits raw text into tokens (words/punctuations) | `"The cats are running"` → `["The", "cats", "are", "running"]` | Foundation of NLP preprocessing |
| **Stemming** | Removes suffixes to get the root form | `"running"` → `"run"` | Rule-based, can produce non-words |
| **Lemmatization** | Converts to dictionary (lemma) form | `"better"` → `"good"` | Uses WordNet + POS awareness |

---

## 🧮 2. Mathematical Summary

\[
\text{Normalization Pipeline:}
\quad
\text{Raw Text}
\xrightarrow{\text{tokenize}}
\text{Tokens}
\xrightarrow{\text{stem/lemma}}
\text{Normalized Tokens}
\]

Formally, for each token \( w \):

\[
\text{Lemma}(w, \text{POS}) = \operatorname{lookup}_{\text{WordNet}}\big(\text{morph}(w), \text{POS}\big)
\]

---

## ⚙️ 3. ASCII Pipeline Visualization

```text
         ┌────────────────────────────────────┐
         │        Raw Input Text              │
         │   "The mice were running fast."    │
         └────────────────────────────────────┘
                          │
                          ▼
             🧩 Tokenization (word_tokenize)
                          │
                          ▼
      ["The", "mice", "were", "running", "fast", "."]
                          │
                          ▼
           ⚙️ Stemming (Porter / Snowball / etc.)
                          │
                          ▼
   ["the", "mice", "were", "run", "fast", "."]   ← root-like forms
                          │
                          ▼
      📘 Lemmatization (WordNet + POS Awareness)
                          │
                          ▼
["the", "mouse", "be", "run", "fast", "."]   ← valid dictionary words
```


---

## 💡 4. Key Takeaways

### ✅ **Tokenization**
- Splits raw text into individual tokens (words, punctuation, etc.).
- It’s the **first step** in every NLP pipeline.
- Output forms the base for further processing.

### ⚙️ **Stemming**
- Applies **rule-based truncation** to get root-like forms.
- Fast, but often produces **non-words** (e.g., *organiz*).
- Ideal for **search engines** and **information retrieval (IR)** where exact meaning isn’t critical.

### 📘 **Lemmatization**
- Produces **valid dictionary words (lemmas)**.
- Depends on **POS tagging** and **morphological rules**.
- Slower but **more accurate** and **meaningful**.
- Preferred for **semantic NLP**, **chatbots**, and **language understanding** tasks.

---

## 🎯 5. TL;DR Visual Summary

| Stage | Algorithm | Example Transformation | Output Valid | POS-Aware | Best Used For |
|--------|------------|------------------------|---------------|------------|----------------|
| **Tokenization** | `word_tokenize` | `"cats are running"` → `["cats","are","running"]` | ✅ | ❌ | Text segmentation |
| **Porter Stemmer** | Rule-based suffix removal | `"running"` → `"run"` | ⚠️ Sometimes | ❌ | Quick baseline tasks |
| **Snowball Stemmer** | Improved Porter | `"studies"` → `"studi"` | ❌ | ❌ | General stemming |
| **Lancaster Stemmer** | Aggressive | `"connection"` → `"connect"` | ⚠️ | ❌ | Noisy data cleanup |
| **Regexp Stemmer** | Regex-based suffix trim | `"organizing"` → `"organ"` | ⚠️ | ❌ | Controlled suffix removal |
| **WordNet Lemmatizer** | Morph + POS + Dictionary | `"better"` → `"good"` | ✅ | ✅ | Semantic NLP, NLU tasks |

---

## 🧩 6. Visual Flow Recap

```text
Text
  │
  ▼
Tokenization
  │
  ▼
Stemming / Lemmatization
  │
  ▼
Feature Extraction (BoW / TF-IDF / Embeddings)
  │
  ▼
Model Input → Training / Inference
```