# Text Preprocessing

This notebook covers essential text preprocessing techniques used in Natural Language Processing:
- **Tokenization**: Breaking text into words, sentences, or subwords
- **Stemming**: Reducing words to their root form
- **Lemmatization**: Converting words to their base/dictionary form
- **Named Entity Recognition (NER)**: Identifying entities like people, organizations, locations
- **Part-of-Speech (POS) Tagging**: Identifying grammatical roles of words

## Learning Objectives

- Understand different tokenization methods and when to use them
- Apply stemming and lemmatization for text normalization
- Use NER to extract structured information from text
- Perform POS tagging for grammatical analysis
- Compare and choose appropriate preprocessing techniques

## Setup

First, we need to install required libraries and download necessary data:
- `nltk`: Natural Language Toolkit
- `spacy`: Industrial-strength NLP library


## Installation

Run this cell to install required packages (uncomment if needed):


In [None]:
# Install packages (uncomment if needed)
# !pip install nltk spacy
# !python -m spacy download en_core_web_sm


In [None]:
# Import libraries
import nltk
import spacy
from nltk.tokenize import word_tokenize, sent_tokenize, wordpunct_tokenize
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, ne_chunk
from nltk.corpus import stopwords
import re

# Download required NLTK data (run once)
try:
    nltk.download('punkt', quiet=True)
    nltk.download('stopwords', quiet=True)
    nltk.download('wordnet', quiet=True)
    nltk.download('averaged_perceptron_tagger', quiet=True)
    nltk.download('maxent_ne_chunker', quiet=True)
    nltk.download('words', quiet=True)
    print("NLTK data downloaded successfully!")
except:
    print("Some NLTK data may already be downloaded")

# Load spaCy model
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy model loaded successfully!")
except OSError:
    print("spaCy model not found. Please run: python -m spacy download en_core_web_sm")


## Sample Text

We'll use this sample text throughout the notebook to demonstrate various preprocessing techniques:


In [None]:
# Sample text for demonstration
sample_text = """
Natural Language Processing (NLP) is a branch of artificial intelligence that helps computers understand, 
interpret and manipulate human language. NLP draws from many disciplines, including computer science and 
computational linguistics, in its pursuit to fill the gap between human communication and computer understanding.

Companies like Google, Microsoft, and OpenAI are leading the development of NLP technologies. 
These technologies are being used in chatbots, translation services, and virtual assistants.

Dr. Sarah Johnson from Stanford University published a paper on transformer models in 2023. 
The research was conducted in California and involved collaboration with researchers from New York.
"""
print("Sample text loaded!")


# 1. Tokenization

Tokenization is the process of breaking down text into smaller units (tokens) such as words, sentences, or subwords. It's the first step in most NLP pipelines.

## 1.1 Word Tokenization

Word tokenization splits text into individual words.


In [None]:
# Using NLTK's word_tokenize
text = "Natural Language Processing is amazing! Let's learn about it."
words_nltk = word_tokenize(text)
print("NLTK Word Tokenization:")
print(words_nltk)
print(f"\nTotal tokens: {len(words_nltk)}")


In [None]:
# Using spaCy
doc = nlp(text)
words_spacy = [token.text for token in doc]
print("spaCy Word Tokenization:")
print(words_spacy)
print(f"\nTotal tokens: {len(words_spacy)}")


In [None]:
# Using simple regex (for comparison)
words_regex = re.findall(r'\b\w+\b', text)
print("Regex Word Tokenization:")
print(words_regex)
print(f"\nTotal tokens: {len(words_regex)}")


## 1.2 Sentence Tokenization

Sentence tokenization splits text into sentences.


In [None]:
# Using NLTK's sent_tokenize
sentences_nltk = sent_tokenize(sample_text)
print("NLTK Sentence Tokenization:")
for i, sent in enumerate(sentences_nltk, 1):
    print(f"\nSentence {i}: {sent[:80]}...")
print(f"\nTotal sentences: {len(sentences_nltk)}")


In [None]:
# Using spaCy
doc = nlp(sample_text)
sentences_spacy = [sent.text for sent in doc.sents]
print("spaCy Sentence Tokenization:")
for i, sent in enumerate(sentences_spacy, 1):
    print(f"\nSentence {i}: {sent[:80]}...")
print(f"\nTotal sentences: {len(sentences_spacy)}")


think

In [None]:
# spaCy token attributes
text_example = "I'm learning NLP! It's amazing."
doc = nlp(text_example)

print("Token Analysis:")
print(f"{'Token':<15} {'Text':<15} {'Lemma':<15} {'POS':<10} {'Is Alpha':<10}")
print("-" * 70)
for token in doc:
    print(f"{str(token):<15} {token.text:<15} {token.lemma_:<15} {token.pos_:<10} {token.is_alpha:<10}")


# 2. Stemming

Stemming reduces words to their root form by removing suffixes. It's a rule-based approach that may not always produce valid words.

**Example**: "running" → "run", "happier" → "happi"

## 2.1 Porter Stemmer

The Porter Stemmer is one of the most common stemming algorithms.


In [None]:
# Initialize Porter Stemmer
porter = PorterStemmer()

# Example words
words = ["running", "runs", "ran", "happier", "happiest", "happiness", 
         "studies", "studying", "studied", "flies", "flying", "flew"]

print("Porter Stemmer Results:")
print(f"{'Original':<15} {'Stemmed':<15}")
print("-" * 30)
for word in words:
    stemmed = porter.stem(word)
    print(f"{word:<15} {stemmed:<15}")


In [None]:
# Stemming on sample text
words = word_tokenize(sample_text)
stemmed_words = [porter.stem(word) for word in words if word.isalpha()]

print("Original words (first 20):")
print(words[:20])
print("\nStemmed words (first 20):")
print(stemmed_words[:20])


## 2.2 Snowball Stemmer

The Snowball Stemmer (also known as Porter2) is an improved version that supports multiple languages.


In [None]:
# Initialize Snowball Stemmer (for English)
snowball = SnowballStemmer(language='english')

words = ["running", "runs", "ran", "happier", "happiest", "happiness", 
         "studies", "studying", "studied"]

print("Snowball Stemmer Results:")
print(f"{'Original':<15} {'Stemmed':<15}")
print("-" * 30)
for word in words:
    stemmed = snowball.stem(word)
    print(f"{word:<15} {stemmed:<15}")


In [None]:
# Using spaCy for lemmatization
text = "I was running faster and studying harder. The studies were better."
doc = nlp(text)

print("spaCy Lemmatization:")
print(f"{'Text':<15} {'Lemma':<15} {'POS':<10}")
print("-" * 40)
for token in doc:
    if not token.is_punct and not token.is_space:
        print(f"{token.text:<15} {token.lemma_:<15} {token.pos_:<10}")


## 3.3 Stemming vs Lemmatization Comparison


In [None]:
# Compare Stemming vs Lemmatization
comparison_words = ["running", "happier", "studies", "was", "better"]

print("Stemming vs Lemmatization Comparison:")
print(f"{'Word':<15} {'Stemmed':<15} {'Lemmatized':<15}")
print("-" * 45)
for word in comparison_words:
    stemmed = porter.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<15} {stemmed:<15} {lemmatized:<15}")


# 4. Part-of-Speech (POS) Tagging

POS tagging assigns grammatical categories (noun, verb, adjective, etc.) to each word in a sentence.


In [None]:
# Using NLTK POS tagging
text = "Natural Language Processing is amazing and helps computers understand text."
tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)

print("NLTK POS Tagging:")
print(f"{'Word':<20} {'POS Tag':<10}")
print("-" * 30)
for word, tag in pos_tags:
    print(f"{word:<20} {tag:<10}")


In [None]:
# Using spaCy POS tagging
doc = nlp(text)

print("spaCy POS Tagging:")
print(f"{'Word':<20} {'POS':<10} {'Tag':<10} {'Description':<30}")
print("-" * 70)
for token in doc:
    if not token.is_punct and not token.is_space:
        print(f"{token.text:<20} {token.pos_:<10} {token.tag_:<10} {spacy.explain(token.pos_):<30}")


# 5. Named Entity Recognition (NER)

NER identifies and classifies named entities in text (people, organizations, locations, dates, etc.).


In [None]:
# Using spaCy for NER
doc = nlp(sample_text)

print("spaCy Named Entity Recognition:")
print(f"{'Entity':<25} {'Label':<15} {'Description':<30}")
print("-" * 70)
for ent in doc.ents:
    print(f"{ent.text:<25} {ent.label_:<15} {spacy.explain(ent.label_):<30}")


In [None]:
# Visualize NER with spaCy's displaCy (if available)
# This creates an HTML visualization
from spacy import displacy

# Create a visualization of entities
html = displacy.render(doc, style="ent", jupyter=True)


In [None]:
# Extract entities by type
print("Entities by Type:")
entities_by_type = {}
for ent in doc.ents:
    if ent.label_ not in entities_by_type:
        entities_by_type[ent.label_] = []
    entities_by_type[ent.label_].append(ent.text)

for label, entities in entities_by_type.items():
    print(f"\n{label} ({spacy.explain(label)}):")
    print(f"  {', '.join(set(entities))}")


## Summary

This notebook covered:
- ✅ **Tokenization**: Word and sentence tokenization using NLTK and spaCy
- ✅ **Stemming**: Porter and Snowball stemmers for reducing words to root forms
- ✅ **Lemmatization**: Converting words to dictionary forms with POS awareness
- ✅ **POS Tagging**: Identifying grammatical roles of words
- ✅ **NER**: Extracting named entities (people, organizations, locations, etc.)

### When to Use What?

- **Stemming**: Fast, good for search engines, information retrieval
- **Lemmatization**: Better for tasks requiring valid words, more accurate but slower
- **POS Tagging**: Needed for syntax analysis, dependency parsing, better lemmatization
- **NER**: Extract structured information, build knowledge graphs, information extraction
