# Day 2: Stemming, Lemmatization & Regex
**The AI Engineer Course 2026 - Section 21b**

**Student:** Natruja

**Date:** Friday, February 13, 2026

---

## Learning Objectives
1. Understand stemming and its purpose
2. Learn lemmatization and how it differs from stemming
3. Use regular expressions (Regex) for text processing
4. Compare these text normalization techniques

## Setup: Install and Import Required Libraries

In [None]:
import subprocess
import sys

# Install NLTK
subprocess.check_call([sys.executable, "-m", "pip", "install", "nltk", "-q"])

# Download required NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("✓ NLTK installed and data downloaded successfully!")

✓ NLTK installed and data downloaded successfully!


[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>
[nltk_data] Error loading wordnet: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1028)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data]     [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data]     failed: unable to get local issuer certificate
[nltk_data]     (_ssl.c:1028)>


In [2]:
# Import necessary libraries
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import re
import nltk

print("✓ All imports successful!")

✓ All imports successful!


## Text Normalization: Why Do We Need It?

In NLP, words can appear in different forms:
- "running", "runs", "run" - all refer to the same action
- "better", "best" - all related to "good"
- "flying", "flies" - same root concept

**Text Normalization** converts these variations into a standard form, helping NLP models recognize that they're related.

### Two Main Approaches:
1. **Stemming**: Remove suffixes to get the root (fast, rough)
2. **Lemmatization**: Convert to dictionary form using linguistic rules (slower, more accurate)

## Stemming: Quick Root Extraction

**Stemming** uses algorithms to remove common suffixes and prefixes.

### Porter Stemmer Algorithm:
- Most popular stemming algorithm
- Uses simple rule-based approach
- Very fast
- May not always produce real words

### Examples:
- "running" → "run"
- "jumps" → "jump"
- "relational" → "relat"
- "universities" → "univers"

## EXAMPLE: Stemming with Porter Stemmer

In [3]:
# Create a Porter Stemmer object
stemmer = PorterStemmer()

# Words to stem
words = ['running', 'runs', 'ran', 'runner', 'jumped', 'jumps', 'jumping', 'relational', 'universities']

print("Porter Stemming Examples:")
print("="*40)
print(f"{'Original Word':<15} | {'Stemmed':<15}")
print("-"*40)

for word in words:
    stemmed = stemmer.stem(word)
    print(f"{word:<15} | {stemmed:<15}")

Porter Stemming Examples:
Original Word   | Stemmed        
----------------------------------------
running         | run            
runs            | run            
ran             | ran            
runner          | runner         
jumped          | jump           
jumps           | jump           
jumping         | jump           
relational      | relat          
universities    | univers        


## Lemmatization: Dictionary-Based Normalization

**Lemmatization** converts words to their base form (lemma) using vocabulary and morphological analysis.

### WordNetLemmatizer:
- Uses WordNet dictionary
- More accurate than stemming
- Slower but produces real words
- Requires POS (Part-of-Speech) tags for best results

### Examples:
- "running" → "run"
- "better" → "good"
- "universities" → "university"
- "organized" → "organize"

## EXAMPLE: Lemmatization with WordNetLemmatizer

In [4]:
# Create a Lemmatizer object
lemmatizer = WordNetLemmatizer()

# Words to lemmatize
words = ['running', 'runs', 'ran', 'better', 'best', 'organized', 'universities']

print("WordNet Lemmatization Examples:")
print("="*40)
print(f"{'Original Word':<15} | {'Lemma':<15}")
print("-"*40)

for word in words:
    lemma = lemmatizer.lemmatize(word)
    print(f"{word:<15} | {lemma:<15}")

WordNet Lemmatization Examples:
Original Word   | Lemma          
----------------------------------------
running         | running        
runs            | run            
ran             | ran            
better          | better         
best            | best           
organized       | organized      
universities    | university     


## EXAMPLE: Stemming vs Lemmatization Comparison

In [5]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ['running', 'better', 'organized', 'universities', 'jumped', 'happening']

print("Stemming vs Lemmatization:")
print("="*55)
print(f"{'Word':<15} | {'Stemmed':<15} | {'Lemmatized':<15}")
print("-"*55)

for word in words:
    stem = stemmer.stem(word)
    lemma = lemmatizer.lemmatize(word)
    print(f"{word:<15} | {stem:<15} | {lemma:<15}")

Stemming vs Lemmatization:
Word            | Stemmed         | Lemmatized     
-------------------------------------------------------
running         | run             | running        
better          | better          | better         
organized       | organ           | organized      
universities    | univers         | university     
jumped          | jump            | jumped         
happening       | happen          | happening      


## Regular Expressions (Regex): Pattern Matching

**Regular Expressions** are patterns for matching and manipulating text.

### Common Patterns:
- `\d` : Any digit (0-9)
- `\w` : Any word character (letters, digits, underscore)
- `\s` : Any whitespace
- `.` : Any character
- `*` : 0 or more repetitions
- `+` : 1 or more repetitions
- `[abc]` : Any character in brackets
- `[^abc]` : Any character NOT in brackets

### Common Use Cases:
- Email validation
- Phone number extraction
- Removing special characters
- Finding patterns in text

## EXAMPLE: Regex for Cleaning Text

In [6]:
# Sample text with special characters and numbers
text = "Hello! I have 123 apples. My email is john@example.com. Call me at 555-1234!"

print("Original text:")
print(text)
print("\n" + "="*60)

# Remove special characters (keep only letters, numbers, and spaces)
cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', text)
print("\nRemove special characters:")
print(cleaned)

# Remove numbers
no_numbers = re.sub(r'\d+', '', text)
print("\nRemove numbers:")
print(no_numbers)

# Remove extra whitespace
no_extra_spaces = re.sub(r'\s+', ' ', text).strip()
print("\nClean extra whitespace:")
print(no_extra_spaces)

Original text:
Hello! I have 123 apples. My email is john@example.com. Call me at 555-1234!


Remove special characters:
Hello I have 123 apples My email is johnexamplecom Call me at 5551234

Remove numbers:
Hello! I have  apples. My email is john@example.com. Call me at -!

Clean extra whitespace:
Hello! I have 123 apples. My email is john@example.com. Call me at 555-1234!


## EXAMPLE: Regex for Pattern Extraction

In [7]:
# Text with email addresses and phone numbers
text = "Contact info: alice@example.com or bob@company.org. Phone: 555-1234 or 555-5678"

print("Text:")
print(text)
print("\n" + "="*60)

# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("\nEmails found:")
print(emails)

# Extract phone numbers
phones = re.findall(r'\d{3}-\d{4}', text)
print("\nPhone numbers found:")
print(phones)

# Extract all numbers
numbers = re.findall(r'\d+', text)
print("\nAll numbers found:")
print(numbers)

Text:
Contact info: alice@example.com or bob@company.org. Phone: 555-1234 or 555-5678


Emails found:
['alice@example.com', 'bob@company.org']

Phone numbers found:
['555-1234', '555-5678']

All numbers found:
['555', '1234', '555', '5678']


---
# EXERCISE SECTION: 15 Exercises Organized by Difficulty

Work through these exercises to master stemming, lemmatization, and regex!

## ⭐ EASY: Exercise 1 - Stem a Single Word

In [None]:
# TODO: Create a stemmer and stem the word 'running'
# Store the result in a variable called 'result'

stemmer = PorterStemmer()
word = 'running'
result = stemmer.stem(word)

print(f"Original word: {word}")
print(f"Stemmed word: {result}")
print(f"Success! '{word}' was stemmed to '{result}'")

## ⭐ EASY: Exercise 2 - Stem a List of Words

In [None]:
# TODO: Stem all words in the list using list comprehension or a loop
# Create a list called 'stemmed_words' with the stemmed versions

stemmer = PorterStemmer()
words = ['happiness', 'walking', 'studies', 'flying', 'connection']

stemmed_words = ___

print(f"Original: {words}")
print(f"Stemmed:  {stemmed_words}")

## ⭐ EASY: Exercise 3 - Lemmatize a Single Word

In [None]:
# TODO: Create a lemmatizer and lemmatize the word 'better'
# Store the result in a variable called 'result'

lemmatizer = WordNetLemmatizer()
word = 'better'
result = lemmatizer.lemmatize(word)

print(f"Original word: {word}")
print(f"Lemmatized word: {result}")
print(f"Success! '{word}' was lemmatized to '{result}'")

## ⭐ EASY: Exercise 4 - Use re.findall() to Find Numbers

In [None]:
# TODO: Use re.findall() with the pattern \d+ to extract all numbers from the text
# Store the result in a variable called 'numbers'

text = "I have 3 cats, 5 dogs, and 12 birds. There are 100 total animals."

numbers = ___

print(f"Text: {text}")
print(f"Numbers found: {numbers}")
print(f"Total of {len(numbers)} numbers extracted!")

## ⭐ EASY: Exercise 5 - Use re.sub() to Remove Punctuation

In [None]:
# TODO: Use re.sub() to remove all punctuation from the text
# Pattern [^a-zA-Z\s] means: any character that is NOT a letter or whitespace
# Replace it with empty string ''

text = "Hello! How are you? I'm great, thanks!!!!"

cleaned = ___

print(f"Original: {text}")
print(f"Cleaned:  {cleaned}")

## ⭐⭐ MEDIUM: Exercise 6 - Compare Stemming vs Lemmatization

In [None]:
# TODO: For each word in the list, create a dictionary with the original word,
# its stemmed version, and its lemmatized version.
# You can use a list comprehension or loop to create a list of dictionaries.

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
words = ['running', 'better', 'organized', 'universities']

results = [
    {
        'original' : word,
        'stemmed' : stemmer.stem(word),
        'lemmatized' : lemmatizer.lemmatize(word)
    }

    for word in words
    
]

# Print results in a readable format
print("Word\t\t| Stemmed\t| Lemmatized")
print("-" * 50)
for result in results:
    print(f"{result['original']:<15} | {result['stemmed']:<15} | {result['lemmatized']:<15}")

## ⭐⭐ MEDIUM: Exercise 7 - Lemmatize with Different POS Parameters

In [None]:
# TODO: Lemmatize the word 'saw' as:
# 1. A noun (pos='n') - should stay as 'saw'
# 2. A verb (pos='v') - should become 'see'
# The pos parameter changes how the word is interpreted!
import nltk
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
word = 'saw'

lemma_noun = lemmatizer.lemmatize(word, pos='n')  # pos='n'
lemma_verb = lemmatizer.lemmatize(word, pos='v')  # pos='v'

print(f"Original word: '{word}'")
print(f"As noun (pos='n'): {lemma_noun}")
print(f"As verb (pos='v'): {lemma_verb}")
print(f"\nNotice how POS tag changes the lemmatization result!")

## ⭐⭐ MEDIUM: Exercise 8 - Extract Emails and URLs with Regex

In [None]:
# TODO: Use re.findall() with appropriate regex patterns to extract:
# 1. All email addresses from the text
# 2. All URLs from the text
# Hints:
# - Email pattern: r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
# - URL pattern: r'http\S+|www\S+'

text = """Contact us at support@company.com or sales@business.org. 
Visit our website at https://www.example.com or www.another-site.net. 
Email me at john.doe@mail.co.uk anytime!"""

emails = ___
urls = ___

print(f"Text: {text}\n")
print(f"Emails found: {emails}")
print(f"URLs found: {urls}")

## ⭐⭐ MEDIUM: Exercise 9 - Clean Text with Regex Then Tokenize

In [None]:
# TODO: 
# 1. Use re.sub() to remove special characters and numbers from the text
# 2. Convert to lowercase
# 3. Use word_tokenize() to split into words
# 4. Filter out empty strings

text = "The quick BROWN fox!!! It's jumping over the lazy 123 dogs... Wow!!!"

# Step 1: Clean with regex (remove punctuation, numbers, special chars)
cleaned = ___

# Step 2: Convert to lowercase
cleaned = cleaned.lower()

# # Step 3: Tokenize
tokens = ___

# # Step 4: Filter out empty strings
tokens = ___


print(f"Original: {text}")
print(f"Cleaned: {cleaned.strip()}")
print(f"Tokens: {tokens}")

## ⭐⭐ MEDIUM: Exercise 10 - Stem All Words in a Sentence

In [None]:
# TODO:
# 1. Tokenize the sentence into words
# 2. Stem each word using PorterStemmer
# 3. Create a list of stemmed words
# 4. Join them back into a sentence

stemmer = PorterStemmer()
sentence = "The runners were running quickly through the running competition."

# Step 1: Tokenize
tokens = ___

# Step 2 & 3: Stem each word (use list comprehension)
stemmed_tokens = ___

# Step 4: Join back into a sentence
stemmed_sentence = ' '.join(stemmed_tokens)

print(f"Original: {sentence}")
print(f"Stemmed:  {stemmed_sentence}")

## ⭐⭐⭐ HARD: Exercise 11 - Full Preprocessing Pipeline

In [None]:
# TODO: Build a complete text preprocessing pipeline that:
# 1. Removes special characters and numbers (keep only letters and spaces)
# 2. Converts to lowercase
# 3. Tokenizes the text
# 4. Removes stopwords (use nltk.corpus.stopwords)
# 5. Lemmatizes all remaining words

# First, download stopwords if needed
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords', quiet=True)

raw_text = "The quick BROWN fox!!! It's jumping 123 times over the lazy dogs... Amazing!!!"
lemmatizer = WordNetLemmatizer()
stop_words = ___

# Step 1: Remove special characters and numbers
step1 = ___

# Step 2: Convert to lowercase
step2 = step1.lower()

# Step 3: Tokenize
step3 = word_tokenize(step2)

# Step 4: Remove stopwords (filter out words in stop_words set)
step4 = ___

# Step 5: Lemmatize
step5 = ___

print(f"Original: {raw_text}")
print(f"After removing special chars: {step1}")
print(f"After tokenization: {step3}")
print(f"After removing stopwords: {step4}")
print(f"After lemmatization: {step5}")
print(f"\nFinal token count: {len(step5)}")

## ⭐⭐⭐ HARD: Exercise 12 - Smart POS-Based Lemmatization

In [None]:
# TODO: Create a function that lemmatizes words and tries different POS tags
# to find the shortest lemmatized form.
# 
# The function should:
# 1. Try lemmatizing with pos='n' (noun)
# 2. Try lemmatizing with pos='v' (verb)
# 3. Try lemmatizing with pos='a' (adjective)
# 4. Return the shortest result

import nltk
nltk.download('wordnet')

def smart_lemmatize(word, lemmatizer):
    """Try different POS tags and return shortest lemmatized form."""
    
    # Create a list of lemmatized results using different pos tags
    results = [
        lemmatizer.lemmatize(word, pos='n'),  # as noun
        lemmatizer.lemmatize(word, pos='v'),  # as verb
        lemmatizer.lemmatize(word, pos='a')   # as adjective
    ]
    
    # Return the shortest result
    return min(results, key=len)  # Use min() with key=len

lemmatizer = WordNetLemmatizer()
test_words = ['running', 'better', 'saw', 'organized']

print("Word\t\t| Smart Lemma")
print("-" * 30)
for word in test_words:
    result = smart_lemmatize(word, lemmatizer)
    print(f"{word:<15} | {result:<15}")

## ⭐⭐⭐ HARD: Exercise 13 - Extract and Clean Phone Numbers

In [None]:
# TODO: 
# 1. Extract phone numbers from the text using regex
# 2. Clean them to remove formatting (dashes, spaces, parentheses)
# 3. Return only valid phone numbers (exactly 10 digits)
#
# Hints:
# - Pattern to find phone numbers: r'\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}'
# - Use re.sub() to remove non-digit characters: r'\D' matches non-digits
# - Check length to ensure exactly 10 digits

text = """Contact us:
Main line: (555) 123-4567
Sales: 555.987.6543
Support: 555-246-8135
Fax: (555) 111-2222
Invalid: 555-12-34
"""

# Step 1: Extract potential phone numbers
phone_pattern = r'\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}'
extracted = ___


# Step 2 & 3: Clean them and keep only valid ones (10 digits)
cleaned_phones = []
for phone in extracted:
    # Remove non-digit characters
    cleaned = ___
    # Keep only if exactly 10 digits
    if len(cleaned) == 10:
        cleaned_phones.append(cleaned)

print(f"Text: {text}")
print(f"\nExtracted phone numbers: {extracted}")
print(f"Cleaned phone numbers: {cleaned_phones}")

## ⭐⭐⭐ HARD: Exercise 14 - Process Multiple Documents

In [None]:
# TODO: Create a function that processes multiple documents through:
# 1. Regex cleaning (remove special characters, numbers)
# 2. Tokenization
# 3. Stopword removal
# 4. Lemmatization
# 
# The function should return stats about each document:
# - Original token count
# - Final token count
# - List of final lemmatized tokens

def process_document(text, lemmatizer, stop_words):
    """Process a single document through the full pipeline."""
    # Step 1: Clean (remove special chars, numbers)
    cleaned =  ___
    
    # Step 2: Tokenize
    tokens = ___
    original_count = len(tokens)
    
    # Step 3: Remove stopwords
    tokens = ___
    
    # Step 4: Lemmatize
    tokens = ___
    
    return {
        'original_count': original_count,
        'final_count': len(tokens),
        'tokens': tokens
    }

# Documents to process
documents = [
    "The quick brown fox jumps over the lazy dog!!!",
    "Natural Language Processing is amazing!!! I'm learning NLP with 100% enthusiasm.",
    "Python, Java, and C++ are popular programming languages. They're great!!!"
]

lemmatizer = WordNetLemmatizer()
stop_words = ___

results = []
for i, doc in enumerate(documents, 1):
    result = process_document(doc, lemmatizer, stop_words)
    results.append(result)
    print(f"\nDocument {i}:")
    print(f"  Original: {doc}")
    print(f"  Tokens: {result['original_count']} → {result['final_count']}")
    print(f"  Processed: {result['tokens']}")

## ⭐⭐⭐ HARD: Exercise 15 - Compare Stemming vs Lemmatization Pipeline

In [None]:
# TODO: Create two preprocessing pipelines - one using stemming, one using lemmatization
# Then compare their outputs.
# 
# Stemming pipeline:
# - Clean text with regex
# - Tokenize
# - Remove stopwords
# - Stem words
#
# Lemmatization pipeline:
# - Clean text with regex
# - Tokenize
# - Remove stopwords
# - Lemmatize words
#
# Compare results and count unique tokens in each pipeline.

def stemming_pipeline(text, stemmer, stop_words):
    """Process text using stemming."""
    cleaned = ___
    tokens = ___
    tokens = ___
    stemmed = ___
    return stemmed

def lemmatization_pipeline(text, lemmatizer, stop_words):
    """Process text using lemmatization."""
    cleaned = ___
    tokens = ___
    tokens = ___
    lemmatized = ___  # Apply lemmatizer to each token
    return lemmatized

# Test text
text = "The runners were running quickly. They ran and jumped repeatedly, running faster!"

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = ___

# Process with both pipelines
stemmed_result = ___
lemmatized_result = ___

print(f"Original text: {text}\n")
print(f"Stemming pipeline:")
print(f"  Result: {stemmed_result}")
print(f"  Unique tokens: {len(set(stemmed_result))}")
print(f"\nLemmatization pipeline:")
print(f"  Result: {lemmatized_result}")
print(f"  Unique tokens: {len(set(lemmatized_result))}")
print(f"\nComparison:")
print(f"  Stemming produced {len(stemmed_result)} tokens with {len(set(stemmed_result))} unique")
print(f"  Lemmatization produced {len(lemmatized_result)} tokens with {len(set(lemmatized_result))} unique")

---
## Summary

### Key Takeaways:
- **Stemming** removes suffixes to get roots (fast, approximate)
- **Lemmatization** converts to dictionary form (slower, more accurate)
- **Regex** is powerful for finding and cleaning text patterns
- Each technique has different use cases and trade-offs

### When to Use Each:
- **Stemming**: Quick processing, search engines, text analysis
- **Lemmatization**: NLP tasks, sentiment analysis, when accuracy matters
- **Regex**: Data cleaning, pattern extraction, preprocessing

### What's Next:
Tomorrow we'll explore **POS Tagging and Named Entity Recognition (NER)** - identifying what role each word plays in a sentence!

---

*Created for Natruja's NLP study plan*