# Module 1, Week 1: Assignment - Stemming, Lemmatization, and Advanced Tokenization

### Objective
This assignment is designed to:
1. Introduce to stemming and lemmatization.
2. Teach advanced tokenization techniques using NLTK and SpaCy.
3. Help compare different preprocessing techniques and their impact on text data.

---

### Instructions:
- Use the provided text or your own sample text for analysis.
- Perform stemming, lemmatization, and advanced tokenization on the text.
- Analyze the results and reflect on the differences between these techniques.

---

## Step 1: Import Required Libraries

In [1]:
# Import Required Libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import spacy

# Download Required NLTK Data Files
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Load SpaCy Model
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kaleem\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kaleem\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\kaleem\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## Step 2: Define the Input Text
You can use the sample text below or replace it with your own dataset.

In [2]:
# Sample Raw Text
raw_text = """
Text preprocessing is a critical step in natural language processing tasks. 
It involves converting text into a format that is easy for machines to understand. 
Key steps include stemming, lemmatization, and tokenization.
"""

# Print Raw Text
print("Raw Text:\n")
print(raw_text)

Raw Text:


Text preprocessing is a critical step in natural language processing tasks. 
It involves converting text into a format that is easy for machines to understand. 
Key steps include stemming, lemmatization, and tokenization.



## Step 3: Tokenization with NLTK
Tokenize the text into words using NLTK.

In [3]:
# Tokenize Text Using NLTK
nltk_tokens = word_tokenize(raw_text)
print("\nNLTK Tokenized Words:\n")
print(nltk_tokens)


NLTK Tokenized Words:

['Text', 'preprocessing', 'is', 'a', 'critical', 'step', 'in', 'natural', 'language', 'processing', 'tasks', '.', 'It', 'involves', 'converting', 'text', 'into', 'a', 'format', 'that', 'is', 'easy', 'for', 'machines', 'to', 'understand', '.', 'Key', 'steps', 'include', 'stemming', ',', 'lemmatization', ',', 'and', 'tokenization', '.']


## Step 4: Stemming
Perform stemming using NLTK's PorterStemmer.

In [4]:
# Initialize PorterStemmer
stemmer = PorterStemmer()

# Apply Stemming
stemmed_words = [stemmer.stem(word) for word in nltk_tokens]
print("\nStemmed Words:\n")
print(stemmed_words)


Stemmed Words:

['text', 'preprocess', 'is', 'a', 'critic', 'step', 'in', 'natur', 'languag', 'process', 'task', '.', 'it', 'involv', 'convert', 'text', 'into', 'a', 'format', 'that', 'is', 'easi', 'for', 'machin', 'to', 'understand', '.', 'key', 'step', 'includ', 'stem', ',', 'lemmat', ',', 'and', 'token', '.']


## Step 5: Lemmatization
Perform lemmatization using NLTK's WordNetLemmatizer and SpaCy.

In [5]:
# Initialize WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# Apply Lemmatization with NLTK
nltk_lemmatized = [lemmatizer.lemmatize(word) for word in nltk_tokens]
print("\nNLTK Lemmatized Words:\n")
print(nltk_lemmatized)

# Apply Lemmatization with SpaCy
doc = nlp(raw_text)
spacy_lemmatized = [token.lemma_ for token in doc]
print("\nSpaCy Lemmatized Words:\n")
print(spacy_lemmatized)


NLTK Lemmatized Words:

['Text', 'preprocessing', 'is', 'a', 'critical', 'step', 'in', 'natural', 'language', 'processing', 'task', '.', 'It', 'involves', 'converting', 'text', 'into', 'a', 'format', 'that', 'is', 'easy', 'for', 'machine', 'to', 'understand', '.', 'Key', 'step', 'include', 'stemming', ',', 'lemmatization', ',', 'and', 'tokenization', '.']

SpaCy Lemmatized Words:

['\n', 'text', 'preprocessing', 'be', 'a', 'critical', 'step', 'in', 'natural', 'language', 'processing', 'task', '.', '\n', 'it', 'involve', 'convert', 'text', 'into', 'a', 'format', 'that', 'be', 'easy', 'for', 'machine', 'to', 'understand', '.', '\n', 'key', 'step', 'include', 'stem', ',', 'lemmatization', ',', 'and', 'tokenization', '.', '\n']


## Step 6: Advanced Tokenization with SpaCy
Perform advanced tokenization using SpaCy, which includes handling punctuation and special characters.

In [6]:
# Advanced Tokenization with SpaCy
spacy_tokens = [token.text for token in doc]
print("\nAdvanced Tokenization with SpaCy:\n")
print(spacy_tokens)


Advanced Tokenization with SpaCy:

['\n', 'Text', 'preprocessing', 'is', 'a', 'critical', 'step', 'in', 'natural', 'language', 'processing', 'tasks', '.', '\n', 'It', 'involves', 'converting', 'text', 'into', 'a', 'format', 'that', 'is', 'easy', 'for', 'machines', 'to', 'understand', '.', '\n', 'Key', 'steps', 'include', 'stemming', ',', 'lemmatization', ',', 'and', 'tokenization', '.', '\n']


## Step 7: Compare Results
Analyze and compare the output of stemming, lemmatization, and tokenization techniques. Discuss the differences and their implications.

In [7]:
# Compare Results
print("\nComparison of Techniques:\n")
print("Raw Text:", raw_text)
print("\nStemmed Words:", stemmed_words)
print("\nNLTK Lemmatized Words:", nltk_lemmatized)
print("\nSpaCy Lemmatized Words:", spacy_lemmatized)
print("\nNLTK Tokenized Words:", nltk_tokens)
print("\nSpaCy Tokenized Words:", spacy_tokens)


Comparison of Techniques:

Raw Text: 
Text preprocessing is a critical step in natural language processing tasks. 
It involves converting text into a format that is easy for machines to understand. 
Key steps include stemming, lemmatization, and tokenization.


Stemmed Words: ['text', 'preprocess', 'is', 'a', 'critic', 'step', 'in', 'natur', 'languag', 'process', 'task', '.', 'it', 'involv', 'convert', 'text', 'into', 'a', 'format', 'that', 'is', 'easi', 'for', 'machin', 'to', 'understand', '.', 'key', 'step', 'includ', 'stem', ',', 'lemmat', ',', 'and', 'token', '.']

NLTK Lemmatized Words: ['Text', 'preprocessing', 'is', 'a', 'critical', 'step', 'in', 'natural', 'language', 'processing', 'task', '.', 'It', 'involves', 'converting', 'text', 'into', 'a', 'format', 'that', 'is', 'easy', 'for', 'machine', 'to', 'understand', '.', 'Key', 'step', 'include', 'stemming', ',', 'lemmatization', ',', 'and', 'tokenization', '.']

SpaCy Lemmatized Words: ['\n', 'text', 'preprocessing', 'be', 'a'

## Reflection Questions
1. What differences did you observe between stemming and lemmatization?
2. Which tokenization technique (NLTK vs. SpaCy) provided better results for your text?
3. How might the choice of preprocessing technique impact downstream NLP tasks like classification or summarization?
4. Experiment with different texts. How do the results vary for complex sentences or domain-specific text?

## Stemming vs Lemmatization

**Stemming** and **lemmatization** are both text normalization techniques used in **Natural Language Processing (NLP)** to reduce words to their base or root form. Here's a concise comparison:

### Stemming:
- **Definition**: Stemming is the process of removing prefixes or suffixes from a word to reduce it to its root form, which may not always be a real word.
- **Example**: 
  - "running" → "run"
  - "happily" → "happi"
- **Approach**: Rule-based, often aggressive (may not result in a valid word).
- **Use Case**: Quick and efficient when exact meaning is less important.

### Lemmatization:
- **Definition**: Lemmatization reduces a word to its base or dictionary form (lemma), ensuring the result is a valid word, often considering the word's part of speech.
- **Example**: 
  - "running" → "run"
  - "better" → "good" (as an adjective)
- **Approach**: Dictionary-based, more precise, takes into account context.
- **Use Case**: Preferred when accurate meaning and valid words are necessary.

### Summary:
- **Stemming** is faster but less accurate and can result in non-existent words.
- **Lemmatization** is more accurate, ensuring valid words, but is computationally slower.


## Brief Explanation of the Text Preprocessing Techniques

The provided text compares **stemming**, **lemmatization**, and **tokenization** using **NLTK** and **SpaCy**, two popular NLP libraries.

### Raw Text:
The raw text discusses **text preprocessing**, which involves transforming text into a format suitable for machine understanding. It mentions key preprocessing steps like **stemming**, **lemmatization**, and **tokenization**.

---

### 1. **Stemmed Words**:
- **Stemming** involves reducing words to their root form by removing prefixes or suffixes.
- **Example**: 
  - "preprocessing" → "preprocess"
  - "critical" → "critic"
  - "language" → "languag"
- **Note**: Some stemmed words are not valid dictionary words (e.g., "critic" instead of "critical").

---

### 2. **NLTK Lemmatized Words**:
- **Lemmatization** reduces words to their base or dictionary form, ensuring they are valid words.
- **Example**: 
  - "preprocessing" → "preprocessing" (no change)
  - "critical" → "critical"
  - "language" → "language"
- **Note**: NLTK lemmatization correctly converts "converting" to "convert", maintaining valid words.

---

### 3. **SpaCy Lemmatized Words**:
- **SpaCy** also performs lemmatization but with slightly different results. For instance:
  - "preprocessing" stays as "preprocessing".
  - "critical" stays as "critical".
  - Notice the word "**be**" instead of "is" in some cases (e.g., "be easy" instead of "is easy").
  - SpaCy also includes `\n` (newline) tokens, which might indicate unwanted text artifacts.

---

### 4. **NLTK Tokenized Words**:
- **Tokenization** splits the text into smaller units (tokens) such as words and punctuation marks.
- The NLTK tokenizer has correctly separated the words and punctuation: 
  - `['Text', 'preprocessing', 'is', 'a', ...]`.

---

### 5. **SpaCy Tokenized Words**:
- **SpaCy Tokenizer** also splits the text into tokens, but includes unwanted `\n` (newline) characters.
- The output of SpaCy is similar to NLTK’s but with additional newline tokens.

---

### Summary of Key Differences:
- **Stemming** is fast but less accurate, producing root forms that might not be valid words.
- **Lemmatization** is more accurate, ensuring valid words, but with slight variations between NLTK and SpaCy.
- **Tokenization** splits text into words and punctuation, with both NLTK and SpaCy giving similar results, although SpaCy includes extra newline characters.