# Lemmatization with NLTK

This notebook shows how to use NLTK's **WordNetLemmatizer**. 

**Lemmatization** is just a fancy way of saying "turning a word back to its basic dictionary form." 

For example: 
* "running" $\rightarrow$ "run"
* "ate" $\rightarrow$ "eat"
* "rocks" $\rightarrow$ "rock"

### Step 1: Initial Test
A simple hello world to ensure the environment is working.

In [11]:
print("Hello")

Hello


### Step 2: Import NLTK and Stemmers
Import the Natural Language Toolkit (NLTK) library and download necessary corpora. We also import the `PorterStemmer` for comparison.

In [12]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

# Ensure necessary NLTK data (WordNet) is downloaded
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

### Step 3: Initialize Tools
Create instances of both the `WordNetLemmatizer` and the `PorterStemmer`.

In [14]:
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

### Step 4: Comparison - Stemming vs. Lemmatization

**Stemming** is a faster, rule-based process that chops off suffixes. It often results in words that aren't real (e.g., "studies" becomes "studi").

**Lemmatization** is slower but more accurate. It uses a dictionary to find the actual root word (the lemma).

In [17]:
words = ["eating", "eats", "eaten", "writing", "writes", "programming", "programs", "history", "finally", "finalized"]

print(f"{'Word':<12} | {'Stemming':<12} | {'Lemmatization'}")
print("-" * 43)
for word in words:
    s = stemmer.stem(word)
    l = lemmatizer.lemmatize(word)
    print(f"{word:<12} | {s:<12} | {l}")

Word         | Stemming     | Lemmatization
-------------------------------------------
eating       | eat          | eating
eats         | eat          | eats
eaten        | eat          | eaten
writing      | write        | writing
writes       | write        | writes
programming  | program      | programming
programs     | program      | program
history      | histori      | history
finally      | final        | finally
finalized    | final        | finalized


### Step 5: Improving Lemmatization with POS Tags

> [!TIP]
> To make Lemmatization as effective as Stemming for identifying roots (like turning "eating" into "eat"), you must specify the **Part of Speech (POS)** tag.

In [18]:
verb_test = ["eating", "writing", "finalized"]
for verb in verb_test:
    print(f"{verb} -> {lemmatizer.lemmatize(verb, pos='v')}")

eating -> eat
writing -> write
finalized -> finalize


### Performance Summary

| Feature | Stemming | Lemmatization |
| :--- | :--- | :--- |
| **Speed** | **Very Fast** (Rule-based) | Slower (Dictionary-based) |
| **Accuracy** | Lower (Non-words common) | **Higher** (Real dictionary words) |
| **Complexity** | Simple | High (requires POS for best results) |