* Objective: Apply Tokenization, Stemming, and Lemmatization to a sample text.* Tools: Python (Jupyter/VS Code), NLTK (or spaCy).* Steps: Define the sample text, import necessary tools, and print the output of each step.

In [1]:
sample_text = "Natural Language Processing (NLP) is a fascinating field! Computers are learning to understand human languages, which presents many exciting and challenging problems. We are discussing tokenizing, stemming, and lemmatizing."

In [2]:
# 1. Import necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# 2. Download NLTK resources (only needs to be run once)
print("Downloading NLTK resources...")
# 'punkt' for tokenization, 'stopwords' for the list, 'wordnet' for lemmatization
nltk.download('punkt')        
nltk.download('stopwords')    
nltk.download('wordnet')      
print("Setup complete.")

Downloading NLTK resources...


[nltk_data] Downloading package punkt to /Users/mollynew/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mollynew/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mollynew/nltk_data...


Setup complete.


Step 2: Tokenization
Break the raw text into a list of individual words or tokens.

In [3]:
# --- Step 2: Tokenization ---
print("## Step 2: Tokenization ##")

# Use word_tokenize to split the sample text
tokens = word_tokenize(sample_text)

print(f"\nOriginal Text: {sample_text}")
print("-" * 50)
print(f"Total Tokens: {len(tokens)}")
print(f"First 10 Tokens: {tokens[:10]}")

## Step 2: Tokenization ##

Original Text: Natural Language Processing (NLP) is a fascinating field! Computers are learning to understand human languages, which presents many exciting and challenging problems. We are discussing tokenizing, stemming, and lemmatizing.
--------------------------------------------------
Total Tokens: 37
First 10 Tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'fascinating', 'field']


Step 3: Stopword Removal
Filter out highly common, uninformative words.

In [4]:
# --- Step 3: Stopword Removal ---
print("\n## Step 3: Stopword Removal ##")

# 1. Get the list of English stopwords and convert to a set for fast lookup
stop_words = set(stopwords.words("english"))
print(f"Stop words count: {len(stop_words)}")
#print(stop_words)

# 2. Filter the tokens
# We'll convert tokens to lowercase and check if they are alphanumeric
filtered_tokens = [
    word.lower() for word in tokens if word.lower() not in stop_words and word.isalnum()
]

print(f"\nStopwords List Snippet: {list(stop_words)[:5]}")
print("-" * 50)
print(f"Original Token Count: {len(tokens)}")
print(f"Filtered Token Count: {len(filtered_tokens)}")
print(f"Filtered Tokens: {filtered_tokens}")


## Step 3: Stopword Removal ##
Stop words count: 198

Stopwords List Snippet: ["isn't", 'i', "didn't", 'very', 'about']
--------------------------------------------------
Original Token Count: 37
Filtered Token Count: 20
Filtered Tokens: ['natural', 'language', 'processing', 'nlp', 'fascinating', 'field', 'computers', 'learning', 'understand', 'human', 'languages', 'presents', 'many', 'exciting', 'challenging', 'problems', 'discussing', 'tokenizing', 'stemming', 'lemmatizing']


Step 4 & 5: Stemming vs. Lemmatization
Compare the two root-finding processes on a few words to see the difference in output quality.

In [5]:
# --- Step 4 & 5: Stemming vs. Lemmatization ---
print("\n## Step 4 & 5: Stemming vs. Lemmatization Comparison ##")

# Initialize the tools
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Define a list of interesting words to compare
words_to_analyze = [
    "learning",
    "exciting",
    "challenging",
    "processing",
    "computers",
    "better",
    "feet",
]

# Create a comparison table
print(f"\n{'Word':<15} | {'Stemmed':<15} | {'Lemmatized (Verb/Adj)':<20}")
print("-" * 55)

for word in words_to_analyze:
    stemmed = stemmer.stem(word)
    # Try verb ('v') or adjective ('a') POS for better results
    lemmatized = lemmatizer.lemmatize(word, pos="v")

    # Re-run with adjective POS for 'better'
    if word == "better":
        lemmatized = lemmatizer.lemmatize(word, pos="a")

    print(f"{word:<15} | {stemmed:<15} | {lemmatized:<20}")

print(
    "\n**Insight:** Stemming often produces non-words (e.g., 'comput'). Lemmatization produces a valid dictionary word ('compute', 'good')."
)


## Step 4 & 5: Stemming vs. Lemmatization Comparison ##

Word            | Stemmed         | Lemmatized (Verb/Adj)
-------------------------------------------------------
learning        | learn           | learn               
exciting        | excit           | excite              
challenging     | challeng        | challenge           
processing      | process         | process             
computers       | comput          | computers           
better          | better          | good                
feet            | feet            | feet                

**Insight:** Stemming often produces non-words (e.g., 'comput'). Lemmatization produces a valid dictionary word ('compute', 'good').
