# Text Preprocessing

This notebook covers essential text preprocessing techniques used in Natural Language Processing:
- **Tokenization**: Breaking text into words, sentences, or subwords
- **Stemming**: Reducing words to their root form
- **Lemmatization**: Converting words to their base/dictionary form
- **Named Entity Recognition (NER)**: Identifying entities like people, organizations, locations
- **Part-of-Speech (POS) Tagging**: Identifying grammatical roles of words

## Learning Objectives

- Understand different tokenization methods and when to use them
- Apply stemming and lemmatization for text normalization
- Use NER to extract structured information from text
- Perform POS tagging for grammatical analysis
- Compare and choose appropriate preprocessing techniques

## Setup

First, we need to install required libraries and download necessary data:
- `nltk`: Natural Language Toolkit
- `spacy`: Industrial-strength NLP library


## Installation

Run this cell to install required packages (uncomment if needed):


In [None]:
# Install packages (uncomment if needed)
# !pip install nltk spacy
# !python -m spacy download en_core_web_sm


In [23]:
# Import libraries
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize, wordpunct_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, ne_chunk
from nltk.corpus import stopwords
import re

# Download required NLTK data (run once)
required_resources = [
    'punkt',
    'punkt_tab',  # Required for newer NLTK versions
    'stopwords',
    'wordnet',
    'averaged_perceptron_tagger',
    'maxent_ne_chunker',
    'words'
]

try:
    for resource in required_resources:
        try:
            nltk.download(resource, quiet=True)
        except Exception as e:
            print(f"Warning: Could not download {resource}: {e}")
    
    print("NLTK data downloaded successfully!")
except Exception as e:
    print(f"Error downloading NLTK data: {e}")
    print("\nIf you encounter issues, try running:")
    print("  nltk.download('averaged_perceptron_tagger')")
    print("or")
    print("  nltk.download('all')")

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded successfully!")
except OSError:
    print("spaCy model not found. Please run: python -m spacy download en_core_web_sm")


NLTK data downloaded successfully!
spaCy model loaded successfully!


## Sample Text

We'll use this sample text throughout the notebook to demonstrate various preprocessing techniques:


In [3]:
# Sample text for demonstration
sample_text = """
Natural Language Processing (NLP) is a branch of artificial intelligence that helps computers understand, 
interpret and manipulate human language. NLP draws from many disciplines, including computer science and 
computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

Companies like Google, Microsoft, and OpenAI are leading the development of NLP technologies. 
These technologies are being used in chatbots, translation services, and virtual assistants.

Dr. Sarah Johnson from Stanford University published a paper on transformer models in 2023. 
The research was conducted in California and involved collaboration with researchers from New York.
"""
print("Sample text loaded!")


Sample text loaded!


# 1. Tokenization

Tokenization is the process of breaking down text into smaller units (tokens) such as words, sentences, or subwords. It's the first step in most NLP pipelines.

## 1.1 Word Tokenization

Word tokenization splits text into individual words.


In [6]:
# Using NLTK's word_tokenize
text = "Natural Language Processing is amazing! Let's learn about it."
words_nltk = word_tokenize(text)
print("NLTK Word Tokenization:")
print(words_nltk)
print(f"\nTotal tokens: {len(words_nltk)}")


NLTK Word Tokenization:
['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'about', 'it', '.']

Total tokens: 12


In [7]:
# Using spaCy
doc = nlp(text)
words_spacy = [token.text for token in doc]
print("spaCy Word Tokenization:")
print(words_spacy)
print(f"\nTotal tokens: {len(words_spacy)}")


spaCy Word Tokenization:
['Natural', 'Language', 'Processing', 'is', 'amazing', '!', 'Let', "'s", 'learn', 'about', 'it', '.']

Total tokens: 12


In [8]:
# Using simple regex (for comparison)
words_regex = re.findall(r'\b\w+\b', text)
print("Regex Word Tokenization:")
print(words_regex)
print(f"\nTotal tokens: {len(words_regex)}")


Regex Word Tokenization:
['Natural', 'Language', 'Processing', 'is', 'amazing', 'Let', 's', 'learn', 'about', 'it']

Total tokens: 10


## 1.2 Sentence Tokenization

Sentence tokenization splits text into sentences.


In [9]:
# Using NLTK's sent_tokenize
sentences_nltk = sent_tokenize(sample_text)
print("NLTK Sentence Tokenization:")
for i, sent in enumerate(sentences_nltk, 1):
    print(f"\nSentence {i}: {sent[:80]}...")
print(f"\nTotal sentences: {len(sentences_nltk)}")


NLTK Sentence Tokenization:

Sentence 1: 
Natural Language Processing (NLP) is a branch of artificial intelligence that h...

Sentence 2: NLP draws from many disciplines, including computer science and 
computational l...

Sentence 3: Companies like Google, Microsoft, and OpenAI are leading the development of NLP ...

Sentence 4: These technologies are being used in chatbots, translation services, and virtual...

Sentence 5: Dr. Sarah Johnson from Stanford University published a paper on transformer mode...

Sentence 6: The research was conducted in California and involved collaboration with researc...

Total sentences: 6


In [10]:
# Using spaCy
doc = nlp(sample_text)
sentences_spacy = [sent.text for sent in doc.sents]
print("spaCy Sentence Tokenization:")
for i, sent in enumerate(sentences_spacy, 1):
    print(f"\nSentence {i}: {sent[:80]}...")
print(f"\nTotal sentences: {len(sentences_spacy)}")


spaCy Sentence Tokenization:

Sentence 1: 
Natural Language Processing (NLP) is a branch of artificial intelligence that h...

Sentence 2: NLP draws from many disciplines, including computer science and 
computational l...

Sentence 3: Companies like Google, Microsoft, and OpenAI are leading the development of NLP ...

Sentence 4: These technologies are being used in chatbots, translation services, and virtual...

Sentence 5: Dr. Sarah Johnson from Stanford University published a paper on transformer mode...

Sentence 6: The research was conducted in California and involved collaboration with researc...

Total sentences: 6


In [11]:
# spaCy token attributes
text_example = "I'm learning NLP! It's amazing."
doc = nlp(text_example)

print("Token Analysis:")
print(f"{'Token':<15} {'Text':<15} {'Lemma':<15} {'POS':<10} {'Is Alpha':<10}")
print("-" * 70)
for token in doc:
    print(f"{str(token):<15} {token.text:<15} {token.lemma_:<15} {token.pos_:<10} {token.is_alpha:<10}")


Token Analysis:
Token           Text            Lemma           POS        Is Alpha  
----------------------------------------------------------------------
I               I               I               PRON       1         
'm              'm              be              AUX        0         
learning        learning        learn           VERB       1         
NLP             NLP             NLP             PROPN      1         
!               !               !               PUNCT      0         
It              It              it              PRON       1         
's              's              be              AUX        0         
amazing         amazing         amazing         ADJ        1         
.               .               .               PUNCT      0         


# 2. Stemming

Stemming reduces words to their root form by removing suffixes. It's a rule-based approach that may not always produce valid words.

**Example**: "running" → "run", "happier" → "happi"

## 2.1 Porter Stemmer

The Porter Stemmer is one of the most common stemming algorithms.


In [12]:
# Initialize Porter Stemmer
porter = PorterStemmer()

# Example words
words = ["running", "runs", "ran", "happier", "happiest", "happiness", 
         "studies", "studying", "studied", "flies", "flying", "flew"]

print("Porter Stemmer Results:")
print(f"{'Original':<15} {'Stemmed':<15}")
print("-" * 30)
for word in words:
    stemmed = porter.stem(word)
    print(f"{word:<15} {stemmed:<15}")


Porter Stemmer Results:
Original        Stemmed        
------------------------------
running         run            
runs            run            
ran             ran            
happier         happier        
happiest        happiest       
happiness       happi          
studies         studi          
studying        studi          
studied         studi          
flies           fli            
flying          fli            
flew            flew           


In [13]:
# Stemming on sample text
words = word_tokenize(sample_text)
stemmed_words = [porter.stem(word) for word in words if word.isalpha()]

print("Original words (first 20):")
print(words[:20])
print("\nStemmed words (first 20):")
print(stemmed_words[:20])


Original words (first 20):
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'branch', 'of', 'artificial', 'intelligence', 'that', 'helps', 'computers', 'understand', ',', 'interpret', 'and', 'manipulate']

Stemmed words (first 20):
['natur', 'languag', 'process', 'nlp', 'is', 'a', 'branch', 'of', 'artifici', 'intellig', 'that', 'help', 'comput', 'understand', 'interpret', 'and', 'manipul', 'human', 'languag', 'nlp']


## 2.2 Snowball Stemmer

The Snowball Stemmer (also known as Porter2) is an improved version that supports multiple languages.


In [14]:
# Initialize Snowball Stemmer (for English)
snowball = SnowballStemmer(language='english')

words = ["running", "runs", "ran", "happier", "happiest", "happiness", 
         "studies", "studying", "studied"]

print("Snowball Stemmer Results:")
print(f"{'Original':<15} {'Stemmed':<15}")
print("-" * 30)
for word in words:
    stemmed = snowball.stem(word)
    print(f"{word:<15} {stemmed:<15}")


Snowball Stemmer Results:
Original        Stemmed        
------------------------------
running         run            
runs            run            
ran             ran            
happier         happier        
happiest        happiest       
happiness       happi          
studies         studi          
studying        studi          
studied         studi          


# 3. Lemmatization

Lemmatization converts words to their base or dictionary form (lemma). Unlike stemming, lemmatization considers the context and part of speech, producing valid words.

**Example**: "running" → "run", "better" → "well", "was" → "be"

## 3.1 NLTK WordNet Lemmatizer

The WordNetLemmatizer uses the WordNet database to find the correct lemma based on the word's part of speech.


In [17]:
# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Example words
words = ["running", "runs", "ran", "happier", "happiest", "happiness", 
         "studies", "studying", "studied", "was", "better", "mice"]

print("NLTK WordNet Lemmatizer Results:")
print(f"{'Original':<15} {'Lemmatized':<15}")
print("-" * 30)
for word in words:
    # For better results, specify POS tag (verb='v', noun='n', adjective='a', adverb='r')
    # Default assumes noun
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<15} {lemmatized:<15}")

print("\n" + "="*50)
print("With POS tags (more accurate):")
print(f"{'Original':<15} {'POS':<10} {'Lemmatized':<15}")
print("-" * 40)
test_words = [("running", "v"), ("better", "a"), ("studies", "n"), ("was", "v")]
for word, pos in test_words:
    lemmatized = lemmatizer.lemmatize(word, pos=pos)
    print(f"{word:<15} {pos:<10} {lemmatized:<15}")


NLTK WordNet Lemmatizer Results:
Original        Lemmatized     
------------------------------
running         running        
runs            run            
ran             ran            
happier         happier        
happiest        happiest       
happiness       happiness      
studies         study          
studying        studying       
studied         studied        
was             wa             
better          better         
mice            mouse          

With POS tags (more accurate):
Original        POS        Lemmatized     
----------------------------------------
running         v          run            
better          a          good           
studies         n          study          
was             v          be             


In [24]:
# Using spaCy for lemmatization
text = "I was running faster and studying harder. The studies were better."
doc = nlp(text)

print("spaCy Lemmatization:")
print(f"{'Text':<15} {'Lemma':<15} {'POS':<10}")
print("-" * 40)
for token in doc:
    if not token.is_punct and not token.is_space:
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10}")


spaCy Lemmatization:
Text            Lemma           POS       
----------------------------------------
I               I               PRON      
was             be              AUX       
running         run             VERB      
faster          fast            ADV       
and             and             CCONJ     
studying        study           VERB      
harder          hard            ADV       
The             the             DET       
studies         study           NOUN      
were            be              AUX       
better          well            ADJ       


In [25]:

text = "I was running faster and studying harder. The studies were better."
doc = nlp(text)

print("spaCy Lemmatization:")
print(f"{'Text':<15} {'Lemma':<15} {'POS':<10}")
print("-" * 40)
for token in doc:
    if not token.is_punct and not token.is_space:
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10}")


spaCy Lemmatization:
Text            Lemma           POS       
----------------------------------------
I               I               PRON      
was             be              AUX       
running         run             VERB      
faster          fast            ADV       
and             and             CCONJ     
studying        study           VERB      
harder          hard            ADV       
The             the             DET       
studies         study           NOUN      
were            be              AUX       
better          well            ADJ       


In [26]:
# Compare Stemming vs Lemmatization
# Initialize WordNetLemmatizer (if not already done)
if 'lemmatizer' not in globals():
    lemmatizer = WordNetLemmatizer()

comparison_words = ["running", "happier", "studies", "was", "better"]

print("Stemming vs Lemmatization Comparison:")
print(f"{'Word':<15} {'Stemmed':<15} {'Lemmatized':<15}")
print("-" * 45)
for word in comparison_words:
    stemmed = porter.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")


Stemming vs Lemmatization Comparison:
Word            Stemmed         Lemmatized     
---------------------------------------------
running         run             running        
happier         happier         happier        
studies         studi           study          
was             wa              wa             
better          better          better         


In [28]:
# Using NLTK POS tagging
nltk.download('averaged_perceptron_tagger_eng')

text = "Natural Language Processing is amazing and helps computers understand text."
tokens = word_tokenize(text)

try:
    pos_tags = pos_tag(tokens)
    print("NLTK POS Tagging:")
    print(f"{'Word':<20} {'POS Tag':<10}")
    print("-" * 30)
    for word, tag in pos_tags:
        print(f"{word:<20} {tag:<10}")
except LookupError as e:
    print(f"Error: {e}")
    print("\nPOS tagger not found. Downloading...")
    nltk.download('averaged_perceptron_tagger', quiet=False)
    print("\nPlease re-run this cell after downloading.")


[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/mx98/nltk_data...


NLTK POS Tagging:
Word                 POS Tag   
------------------------------
Natural              JJ        
Language             NNP       
Processing           NNP       
is                   VBZ       
amazing              JJ        
and                  CC        
helps                VBZ       
computers            NNS       
understand           JJ        
text                 NN        
.                    .         


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


In [29]:
# Using spaCy POS tagging
doc = nlp(text)

print("spaCy POS Tagging:")
print(f"{'Word':<20} {'POS':<10} {'Tag':<10} {'Description':<30}")
print("-" * 70)
for token in doc:
    if not token.is_punct and not token.is_space:
        print(f"{token.text:<20} {token.pos_:<10} {token.tag_:<10} {spacy.explain(token.pos_):<30}")


spaCy POS Tagging:
Word                 POS        Tag        Description                   
----------------------------------------------------------------------
Natural              PROPN      NNP        proper noun                   
Language             PROPN      NNP        proper noun                   
Processing           NOUN       NN         noun                          
is                   AUX        VBZ        auxiliary                     
amazing              ADJ        JJ         adjective                     
and                  CCONJ      CC         coordinating conjunction      
helps                VERB       VBZ        verb                          
computers            NOUN       NNS        noun                          
understand           VERB       VB         verb                          
text                 NOUN       NN         noun                          


# 5. Named Entity Recognition (NER)

NER identifies and classifies named entities in text (people, organizations, locations, dates, etc.).


In [30]:
# Using spaCy for NER
doc = nlp(sample_text)

print("spaCy Named Entity Recognition:")
print(f"{'Entity':<25} {'Label':<15} {'Description':<30}")
print("-" * 70)
for ent in doc.ents:
    print(f"{ent.text:<25} {ent.label_:<15} {spacy.explain(ent.label_):<30}")


spaCy Named Entity Recognition:
Entity                    Label           Description                   
----------------------------------------------------------------------
NLP                       ORG             Companies, agencies, institutions, etc.
NLP                       ORG             Companies, agencies, institutions, etc.
Google                    ORG             Companies, agencies, institutions, etc.
Microsoft                 ORG             Companies, agencies, institutions, etc.
OpenAI                    GPE             Countries, cities, states     
NLP                       ORG             Companies, agencies, institutions, etc.
Sarah Johnson             PERSON          People, including fictional   
Stanford University       ORG             Companies, agencies, institutions, etc.
2023                      DATE            Absolute or relative dates or periods
California                GPE             Countries, cities, states     
New York                  GPE    

In [32]:
# Visualize NER with spaCy's displaCy (if available)
# This creates an HTML visualization
from spacy import displacy

# Create a visualization of entities
try:
    # Try to render in Jupyter notebook
    html = displacy.render(doc, style="ent", jupyter=True)
except ImportError:
    # If IPython display is not available, render as HTML string
    try:
        html = displacy.render(doc, style="ent", jupyter=False)
        print("Entity visualization (HTML):")
        print("Note: To view the visualization, save the HTML to a file or use a web browser.")
        print(f"\nHTML preview (first 500 chars):\n{html[:500]}...")
    except Exception as e:
        print(f"Could not generate visualization: {e}")
        print("\nYou can still see the entities in the previous cell output.")
except Exception as e:
    print(f"Error generating visualization: {e}")
    print("\nYou can still see the entities in the previous cell output.")


Entity visualization (HTML):
Note: To view the visualization, save the HTML to a file or use a web browser.

HTML preview (first 500 chars):
<div class="entities" style="line-height: 2.5; direction: ltr"><br>Natural Language Processing (
<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">
    NLP
    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>
</mark>
) is a branch of artificial intelligence that helps computers understand, <br>interpret and manipulate huma...


In [33]:
# Extract entities by type
print("Entities by Type:")
entities_by_type = {}
for ent in doc.ents:
    if ent.label_ not in entities_by_type:
        entities_by_type[ent.label_] = []
    entities_by_type[ent.label_].append(ent.text)

for label, entities in entities_by_type.items():
    print(f"\n{label} ({spacy.explain(label)}):")
    print(f"  {', '.join(set(entities))}")


Entities by Type:

ORG (Companies, agencies, institutions, etc.):
  NLP, Microsoft, Google, Stanford University

GPE (Countries, cities, states):
  California, New York, OpenAI

PERSON (People, including fictional):
  Sarah Johnson

DATE (Absolute or relative dates or periods):
  2023


## Summary

This notebook covered:
- ✅ **Tokenization**: Word and sentence tokenization using NLTK and spaCy
- ✅ **Stemming**: Porter and Snowball stemmers for reducing words to root forms
- ✅ **Lemmatization**: Converting words to dictionary forms with POS awareness
- ✅ **POS Tagging**: Identifying grammatical roles of words
- ✅ **NER**: Extracting named entities (people, organizations, locations, etc.)

### When to Use What?

- **Stemming**: Fast, good for search engines, information retrieval
- **Lemmatization**: Better for tasks requiring valid words, more accurate but slower
- **POS Tagging**: Needed for syntax analysis, dependency parsing, better lemmatization
- **NER**: Extract structured information, build knowledge graphs, information extraction
