# Task 1.1: NLP Text Preprocessing Techniques

In this notebook, we will implement and explore several fundamental NLP preprocessing techniques:
1. Tokenization
2. Lemmatization
3. Stemming
4. Part-of-Speech (POS) Tagging
5. Named Entity Recognition (NER)

We'll also compare lemmatization and stemming with at least 10 examples to understand their differences.

## Setup and Installation

First, let's install the necessary packages - NLTK and spaCy.

In [27]:
# Install required packages
# !pip install nltk spacy
# !python -m spacy download en_core_web_sm

In [28]:
# Import required libraries
import nltk
import spacy
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, ne_chunk
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# Download necessary NLTK resources
nltk.download('punkt',quiet=True)
nltk.download('averaged_perceptron_tagger_eng',quiet=True)
nltk.download('maxent_ne_chunker_tab',quiet=True)
nltk.download('words',quiet=True)
nltk.download('wordnet',quiet=True)
# Load spaCy model
nlp = spacy.load('en_core_web_sm')

print("All libraries and resources loaded successfully!")

All libraries and resources loaded successfully!


## 1. Tokenization

Tokenization is the process of breaking text into tokens (words, sentences, etc.). We'll demonstrate both sentence tokenization and word tokenization using NLTK and spaCy.

In [3]:
# Sample text for all operations
sample_text = """Natural Language Processing (NLP) is a subfield of artificial intelligence. It helps computers understand, interpret, and manipulate human language. The goal of NLP is to bridge the gap between human communication and computer understanding. Dr. Smith developed a new algorithm at Stanford University in California."""

print("Sample Text:")
print(sample_text)
print("\n" + "-"*80 + "\n")

Sample Text:
Natural Language Processing (NLP) is a subfield of artificial intelligence. It helps computers understand, interpret, and manipulate human language. The goal of NLP is to bridge the gap between human communication and computer understanding. Dr. Smith developed a new algorithm at Stanford University in California.

--------------------------------------------------------------------------------



In [4]:
# NLTK Tokenization
print("NLTK Sentence Tokenization:")
sentences = sent_tokenize(sample_text)
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}")

print("\nNLTK Word Tokenization:")
words = word_tokenize(sample_text)
print(words[:20], "...")
print(f"Total words: {len(words)}")

NLTK Sentence Tokenization:
Sentence 1: Natural Language Processing (NLP) is a subfield of artificial intelligence.
Sentence 2: It helps computers understand, interpret, and manipulate human language.
Sentence 3: The goal of NLP is to bridge the gap between human communication and computer understanding.
Sentence 4: Dr. Smith developed a new algorithm at Stanford University in California.

NLTK Word Tokenization:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '.', 'It', 'helps', 'computers', 'understand', ',', 'interpret', ','] ...
Total words: 53


In [5]:
# spaCy Tokenization
doc = nlp(sample_text)

print("spaCy Sentence Tokenization:")
for i, sent in enumerate(doc.sents, 1):
    print(f"Sentence {i}: {sent}")

print("\nspaCy Word Tokenization:")
spacy_tokens = [token.text for token in doc]
print(spacy_tokens[:20], "...")
print(f"Total tokens: {len(spacy_tokens)}")

# Compare NLTK and spaCy tokenization
print("\nComparison of token count:")
print(f"NLTK: {len(words)} tokens")
print(f"spaCy: {len(spacy_tokens)} tokens")

spaCy Sentence Tokenization:
Sentence 1: Natural Language Processing (NLP) is a subfield of artificial intelligence.
Sentence 2: It helps computers understand, interpret, and manipulate human language.
Sentence 3: The goal of NLP is to bridge the gap between human communication and computer understanding.
Sentence 4: Dr. Smith developed a new algorithm at Stanford University in California.

spaCy Word Tokenization:
['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'artificial', 'intelligence', '.', 'It', 'helps', 'computers', 'understand', ',', 'interpret', ','] ...
Total tokens: 53

Comparison of token count:
NLTK: 53 tokens
spaCy: 53 tokens

Sentence 1: Natural Language Processing (NLP) is a subfield of artificial intelligence.
Sentence 2: It helps computers understand, interpret, and manipulate human language.
Sentence 3: The goal of NLP is to bridge the gap between human communication and computer understanding.
Sentence 4: Dr. Smith developed a ne

## 2. Lemmatization

Lemmatization reduces words to their base or dictionary form (known as lemma), considering the context and part of speech.

In [16]:
from nltk import pos_tag
# NLTK Lemmatization
lemmatizer = WordNetLemmatizer()

# We need POS information for better lemmatization with NLTK
def get_wordnet_pos(tag):
    """Map POS tag to first character used by WordNetLemmatizer"""
    if tag.startswith('J'):
        return 'a'  # Adjective
    elif tag.startswith('V'):
        return 'v'  # Verb
    elif tag.startswith('N'):
        return 'n'  # Noun
    elif tag.startswith('R'):
        return 'r'  # Adverb
    else:
        return 'n'  # Default to noun

# Lemmatize with POS information
words_with_pos = pos_tag(words)
lemmas_nltk = [lemmatizer.lemmatize(word, get_wordnet_pos(pos)) for word, pos in words_with_pos]

print("NLTK Lemmatization:")
for original, lemma in list(zip(words, lemmas_nltk))[:20]:
    if original != lemma:
        print(f"{original:<15} -> {lemma:<15}")

print("...")

NLTK Lemmatization:
is              -> be             
helps           -> help           
computers       -> computer       
...


In [17]:
# spaCy Lemmatization
lemmas_spacy = [token.lemma_ for token in doc]

print("spaCy Lemmatization:")
for original, lemma in list(zip(spacy_tokens, lemmas_spacy))[:20]:
    if original != lemma:
        print(f"{original:<15} -> {lemma:<15}")

print("...")

spaCy Lemmatization:
is              -> be             
It              -> it             
helps           -> help           
computers       -> computer       
...


## 3. Stemming

Stemming is the process of reducing words to their root/stem by removing affixes. Unlike lemmatization, stemming doesn't ensure that the resulting stem is a meaningful word.

In [18]:
# Initialize stemmers
porter_stemmer = PorterStemmer()
lancaster_stemmer = LancasterStemmer()
snowball_stemmer = SnowballStemmer('english')

# Apply stemming
stems_porter = [porter_stemmer.stem(word) for word in words]
stems_lancaster = [lancaster_stemmer.stem(word) for word in words]
stems_snowball = [snowball_stemmer.stem(word) for word in words]

# Create a comparison DataFrame
stemming_df = pd.DataFrame({
    'Original': words[:20],
    'Porter Stemmer': stems_porter[:20],
    'Lancaster Stemmer': stems_lancaster[:20],
    'Snowball Stemmer': stems_snowball[:20]
})

print("Stemming Comparison:")
display(stemming_df.style.set_properties(**{'text-align': 'left'}))

Stemming Comparison:


Unnamed: 0,Original,Porter Stemmer,Lancaster Stemmer,Snowball Stemmer
0,Natural,natur,nat,natur
1,Language,languag,langu,languag
2,Processing,process,process,process
3,(,(,(,(
4,NLP,nlp,nlp,nlp
5,),),),)
6,is,is,is,is
7,a,a,a,a
8,subfield,subfield,subfield,subfield
9,of,of,of,of


## 4. Part-of-Speech (POS) Tagging

POS tagging assigns grammatical categories (like noun, verb, adjective) to each word in a text.

In [19]:
# NLTK POS Tagging
nltk_pos = pos_tag(words)

print("NLTK POS Tagging:")
for token, pos in nltk_pos[:20]:
    print(f"{token:<15} -> {pos:<6}")

print("...")

NLTK POS Tagging:
Natural         -> JJ    
Language        -> NNP   
Processing      -> NNP   
(               -> (     
NLP             -> NNP   
)               -> )     
is              -> VBZ   
a               -> DT    
subfield        -> NN    
of              -> IN    
artificial      -> JJ    
intelligence    -> NN    
.               -> .     
It              -> PRP   
helps           -> VBZ   
computers       -> NNS   
understand      -> VBP   
,               -> ,     
interpret       -> JJ    
,               -> ,     
...


In [20]:
# spaCy POS Tagging
print("spaCy POS Tagging:")
for token in list(doc)[:20]:
    print(f"{token.text:<15} -> {token.pos_:<6} (Fine-grained: {token.tag_})")

print("...")

spaCy POS Tagging:
Natural         -> PROPN  (Fine-grained: NNP)
Language        -> PROPN  (Fine-grained: NNP)
Processing      -> PROPN  (Fine-grained: NNP)
(               -> PUNCT  (Fine-grained: -LRB-)
NLP             -> PROPN  (Fine-grained: NNP)
)               -> PUNCT  (Fine-grained: -RRB-)
is              -> AUX    (Fine-grained: VBZ)
a               -> DET    (Fine-grained: DT)
subfield        -> NOUN   (Fine-grained: NN)
of              -> ADP    (Fine-grained: IN)
artificial      -> ADJ    (Fine-grained: JJ)
intelligence    -> NOUN   (Fine-grained: NN)
.               -> PUNCT  (Fine-grained: .)
It              -> PRON   (Fine-grained: PRP)
helps           -> VERB   (Fine-grained: VBZ)
computers       -> NOUN   (Fine-grained: NNS)
understand      -> VERB   (Fine-grained: VB)
,               -> PUNCT  (Fine-grained: ,)
interpret       -> ADJ    (Fine-grained: JJ)
,               -> PUNCT  (Fine-grained: ,)
...


### Visualization of POS Tags with spaCy

In [21]:
from spacy import displacy

# Use the first sentence for visualization
first_sentence = list(doc.sents)[0]

# Display POS tags in Jupyter notebook
displacy.render(nlp(first_sentence.text), style="dep", jupyter=True, options={"distance": 120})

## 5. Named Entity Recognition (NER)

NER identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, etc.

In [24]:
# NLTK NER
nltk_ner = ne_chunk(pos_tag(word_tokenize(sample_text)))

print("NLTK Named Entity Recognition:")
print(nltk_ner)

# Extract named entities
named_entities = []
for chunk in nltk_ner:
    if hasattr(chunk, 'label'):
        entity = ' '.join(c[0] for c in chunk)
        entity_type = chunk.label()
        named_entities.append((entity, entity_type))

print("\nExtracted Named Entities:")
for entity, entity_type in named_entities:
    print(f"{entity:<20} -> {entity_type}")

NLTK Named Entity Recognition:
(S
  Natural/JJ
  Language/NNP
  Processing/NNP
  (/(
  (ORGANIZATION NLP/NNP)
  )/)
  is/VBZ
  a/DT
  subfield/NN
  of/IN
  artificial/JJ
  intelligence/NN
  ./.
  It/PRP
  helps/VBZ
  computers/NNS
  understand/VBP
  ,/,
  interpret/JJ
  ,/,
  and/CC
  manipulate/VB
  human/JJ
  language/NN
  ./.
  The/DT
  goal/NN
  of/IN
  (ORGANIZATION NLP/NNP)
  is/VBZ
  to/TO
  bridge/VB
  the/DT
  gap/NN
  between/IN
  human/JJ
  communication/NN
  and/CC
  computer/NN
  understanding/NN
  ./.
  Dr./NNP
  (PERSON Smith/NNP)
  developed/VBD
  a/DT
  new/JJ
  algorithm/NN
  at/IN
  (ORGANIZATION Stanford/NNP University/NNP)
  in/IN
  (GPE California/NNP)
  ./.)

Extracted Named Entities:
NLP                  -> ORGANIZATION
NLP                  -> ORGANIZATION
Smith                -> PERSON
Stanford University  -> ORGANIZATION
California           -> GPE


In [25]:
# spaCy NER
print("spaCy Named Entity Recognition:")
for ent in doc.ents:
    print(f"{ent.text:<20} -> {ent.label_:<10} ({spacy.explain(ent.label_)})")

# Visualize NER
displacy.render(doc, style="ent", jupyter=True)

spaCy Named Entity Recognition:
Natural Language Processing -> ORG        (Companies, agencies, institutions, etc.)
NLP                  -> ORG        (Companies, agencies, institutions, etc.)
NLP                  -> ORG        (Companies, agencies, institutions, etc.)
Smith                -> PERSON     (People, including fictional)
Stanford University  -> ORG        (Companies, agencies, institutions, etc.)
California           -> GPE        (Countries, cities, states)


## Comparison of Lemmatization and Stemming

Now let's compare lemmatization and stemming with at least 10 examples to understand the differences between these two approaches.

In [26]:
# Define a list of example words for comparison
example_words = [
    'running', 'runs', 'ran',          # Run variations
    'better', 'best', 'good',          # Good variations
    'studies', 'studying', 'studied',  # Study variations
    'mice', 'mouse',                   # Irregular plurals
    'meeting', 'meet',                 # Meet variations
    'caring', 'cares', 'cared',        # Care variations
    'dogs', 'dog',                     # Regular plurals
    'wolves', 'wolf',                  # Irregular plurals
    'walking', 'walked',               # Walk variations
    'writing', 'wrote', 'written',     # Write variations
    'worse', 'worst', 'bad',           # Bad variations
    'singing', 'sang', 'sung',         # Sing variations
    'easily', 'easy',                  # Adverb formation
    'happiness', 'happy'               # Noun from adjective
]

# Apply stemming and lemmatization
lemmas = [lemmatizer.lemmatize(word, get_wordnet_pos(pos_tag([word])[0][1])) for word in example_words]
porter_stems = [porter_stemmer.stem(word) for word in example_words]
lancaster_stems = [lancaster_stemmer.stem(word) for word in example_words]

# Create a comparison DataFrame
comparison_df = pd.DataFrame({
    'Original Word': example_words,
    'Lemmatization (NLTK)': lemmas,
    'Porter Stemmer': porter_stems,
    'Lancaster Stemmer': lancaster_stems
})

# Display the comparison
print("Comparison of Lemmatization vs. Stemming:")
display(comparison_df.style.set_properties(**{'text-align': 'left'}))

Comparison of Lemmatization vs. Stemming:


Unnamed: 0,Original Word,Lemmatization (NLTK),Porter Stemmer,Lancaster Stemmer
0,running,run,run,run
1,runs,run,run,run
2,ran,ran,ran,ran
3,better,well,better,bet
4,best,best,best,best
5,good,good,good,good
6,studies,study,studi,study
7,studying,study,studi,study
8,studied,study,studi,study
9,mice,mouse,mice,mic


### Key Differences Between Lemmatization and Stemming:

1. **Approach**:
   - **Stemming**: Uses heuristic rules to chop off word endings, often resulting in non-dictionary words.
   - **Lemmatization**: Analyzes word structure and uses morphological analysis to return proper dictionary forms.

2. **Part-of-Speech Awareness**:
   - **Stemming**: Typically doesn't consider the part of speech of the word.
   - **Lemmatization**: Usually takes into account the part of speech to apply the correct normalization rules.

3. **Output Validity**:
   - **Stemming**: May produce stems that are not actual words (e.g., 'runn', 'studi').
   - **Lemmatization**: Produces valid dictionary words (lemmas) (e.g., 'run', 'study').

4. **Handling of Irregular Forms**:
   - **Stemming**: Generally fails with irregular forms (e.g., 'better' → 'better', not 'good').
   - **Lemmatization**: Properly handles irregular forms (e.g., 'better' → 'good', 'mice' → 'mouse').

5. **Accuracy vs. Speed**:
   - **Stemming**: Faster but less accurate.
   - **Lemmatization**: More accurate but computationally more intensive.

6. **Use Cases**:
   - **Stemming**: Better for search engines and information retrieval where exact form is less important.
   - **Lemmatization**: Better for text analysis, NLP tasks where meaning and linguistic correctness matter.

7. **Dictionary Dependency**:
   - **Stemming**: Rule-based, doesn't require a dictionary.
   - **Lemmatization**: Often requires a dictionary lookup to determine the lemma.

The examples above clearly demonstrate these differences. Notice how stemming sometimes produces unusual forms ('happili', 'wors'), while lemmatization consistently returns valid words but may miss some connections (treating 'worse/worst/bad' as separate words unless provided with context).

## Summary

In this notebook, we have implemented and explored various NLP preprocessing techniques:

1. **Tokenization**: Breaking text into words and sentences using both NLTK and spaCy.
2. **Lemmatization**: Reducing words to their base dictionary forms, considering context and part of speech.
3. **Stemming**: Reducing words to their word stems by removing affixes.
4. **POS Tagging**: Identifying the grammatical parts of speech for each token.
5. **Named Entity Recognition (NER)**: Identifying and classifying named entities in text.

We've also compared lemmatization and stemming using multiple examples, highlighting the key differences between these two text normalization approaches. While stemming is faster and simpler, lemmatization provides more linguistically accurate results, especially for irregular word forms and when maintaining meaningful word representations is important.